MATHM-47150 Structured Documents Adjunct Professor Ossi Nykänen, ossi.nykanen@tut.fi Tampere University of Technology, Dept. of Mathematics, Hypermedia Laboratory Slides for the Spring 2012 course, 6 cu, two periods of lectures and assignments; everyone must register in Moodle, course material (in Moodle) provides an outline to the topic; seek also Web examples This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
1 Introduction Problem statement Course overview Opportunities and challenges Examples to start with Basic ideas Looking forward Structured Documents /1 Introduction (ON 2012)
1.1 Teaser What do the following have in common? o Web pages (e.g. HTML+CSS) o (Interactive) graphics (e.g. SVG), o (3D) visualisations of CAD models (e.g. Web3D, Collada) o Office documents (e.g. spreadsheets) o Android applications (e.g. GUI layout and manifests) o Distributed Internet systems (e.g. SOAP) o Semantic knowledge models (e.g. RDF) o Speech-driven phone applications (e.g. VoiceXML) o Documentation systems (e.g. DITA, news systems (e.g. RSS), Google map applications (e.g. KML),......parts of each system are implemented (and thus accessed...) using structured documents, in XML Structured Documents /1 Introduction (ON 2012)
1.2 Problem statement (as an educated wish-list) "I want a generic, simple programmatic/text editor access to all kinds of documents' structure and semantics, with standard tools and applications." "I want self-describing documents, separating content from document structure, and logical structure from particular representation (e.g. visual layout)." "I want that computers can validate and process documents using abstract instructions (i.e. besides procedural programs written in e.g. Java, Python, or C)." "I want popular technology with lots content and tools, and trained professionals. And I'm in a bit of a hurry." Structured Documents /1 Introduction (ON 2012)
1.3 Short answer: Use XML-based structured documents Adopt a Unicode text format with established parsers Define logical structures and document types on top of it Use standard APIs/tools for... o Reading (validating) and writing documents o Querying, including, transforming, styling and otherwise processing documents (+editing, adapting, and viewing) o Listening and reacting to events, scripting, etc. Document and open (standard) applications for everyone Structured Documents /1 Introduction (ON 2012)
1.4 About this course (beyond this intro) Basic ideas of structured documents (SDs) XML markup grammar and (simple) document type definitions using XML DTD, namespaces Basic processing with XSL transformations and programming with DOM/Java (XSLT is the main "tool" to be learned from this course) XHTML, SVG and other significant apps Vocabulary and (publ.) application examples Observations and design notes The aim is to see a "big picture" so not all topics are covered in detail; also assignments play an important role here (see also following courses) Structured Documents /1 Introduction (ON 2012)
1.5 Assumed basic preliminaries... Computer literacy o...so that you know typical productivity apps, can work with files and data, install and use (also cmd-line) applications, without causing (severe) problems or losing critical data Programming o...so that you can design and deploy simple computer programs with (object-oriented) procedural languages (e.g. in Java, C++, Python, etc.) Hypermedia/Web literacy o...so that you are familiar with Web architecture and typical applications, and can design simple HTML/CSS applications and publish things using HTTP Structured Documents /1 Introduction (ON 2012)
1.6 SDs 101: The traditional definition Structured documents allow computationally separating their following parts/aspects: o Content o Structure o Application (traditionally presentation) In practice, adopting a suitable enabling technology is required (in this course, we focus on XML) The basic idea is also well-suited in structured data in general (persistent or messaging): The point is designing and publishing data with machine processing in mind Structured Documents /1 Introduction (ON 2012)
1.7 SDs 101: The traditional architecture Intuitive from the perspective of technical documentation In the abstract sense: Structured data-processing-application Structured Documents /1 Introduction (ON 2012)
1.8 A nice visual example to start with: Graphics Scalabe Vector Graphics (SVG) and other standard(s) for (interactive) graphics Authors/generators provide (standard) SVG documents, requested/processed/consumed by applications SVG is essentially a vocabulary and an API definition, on top of the other standard well-defined (SD/XML) technologies (!) Pros (of the open SD background) o Easy to learn & access when familiar with XML, technology integration, clear definition, robustness, no vendor lock-in Cons o Certain verbosity, legacy structures, in some cases yields into "too large" systems with several components/layers Structured Documents /1 Introduction (ON 2012)
1.9 Example: A simple interactive SVG document <?xml version="1.0" encoding="utf-8"?> <?xml-stylesheet type="text/css" href="helloworld.css"?> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/graphics/svg/1.1/dtd/svg11.dtd"> <svg width="10cm" height="10cm" viewbox="-10-10 120 120" xml:lang="en" xmlns="http://www.w3.org/2000/svg" version="1.1" onload="init();"> <defs> <title>svg Hello world</title> <desc>a simple example for discussing structured documents: A blue gradient rectangle with text and a black rectangle including a circle. Clicking the circle makes it move and change color. ON 2012</desc> <script type="text/javascript"> <![CDATA[ // Add click listener to all circles function init() { var e = document.getelementbyid("c1"); Structured Documents /1 Introduction (ON 2012)
if (e!= null) { e.addeventlistener("click",change,false); } } // When clicked, change color and translate function change(evt) { if (typeof change.p == "undefined") { change.p = 0; } // JS trick for creating the static var p change.p = change.p + 20; if (change.p>255) {change.p=0;} var n = evt.target; n.setattribute("style","fill:rgb("+change.p+","+change.p+",255)"); var x = 80*change.p/255; n.setattribute("transform","translate("+x+")"); } ]]> </script> <lineargradient id="mygradient"> <stop offset="5%" stop-color="rgb(255,255,255)" /> <stop offset="95%" stop-color="rgb(200,210,255)" /> </lineargradient> </defs> Structured Documents /1 Introduction (ON 2012)
<!-- Basic shapes to play with --> <rect width="100" height="100" fill="url(#mygradient)" stroke="black" stroke-width="1" /> <text x="10" y="50" xml:space="default"> Hello world </text> <!-- Notice the change of coordinates (tip: try changing 70 to 10)... --> <g transform="translate(50,70)"> <rect x="-50" y="-10" width="100" height="20" fill="black" /> <circle id="c1" cx="-40" cy="0" r="8" stroke="none" fill="blue"/> </g> </svg> Psst. The file helloworld.css looks like this: text { fill: blue; font-family: sans-serif; } Pssst. See also more complicated SVG examples... Structured Documents /1 Introduction (ON 2012)
1.10 Some observations... General SD concepts: Character encoding, document type and structure, elements and attributes, special characters, processing instructions, comments, SVG-specific vocabulary vs. generic XML syntax, styling,... Concepts related to the runtime (viewer) environment: Scripts (in Javascript, not SVG), event listeners and managers... o Do not worry about the scripts too much for now, but pay attention to document.getelementbyid("c1") etc. this is how the document instance is perceived from API point of view (markup is simply a structuring method) Note that compared to "traditional programming", the focus in SD typical lies in enriching the data (with little scripting) Structured Documents /1 Introduction (ON 2012)
1.11 Some fundamental concepts and terms... Extensible Markup Language (XML) Document Structured document Document type (Schema) Document instance Vocabulary Logical structure Physical structure Text, Markup, Char.Data Document fragment Serialisation Character encoding Presentation Stylesheet Link (and Path) Query Processor (cf. Parser) Interface (cf. Adapter) Application View (cf. Viewport) Runtime environment Script [Inter Re]active system Structured Documents /1 Introduction (ON 2012)
1.12 Notes SD ideas can be implemented using different modelling approaches and technologies (in this course, we adopt XML) SD is mostly an application-neutral paradigm; "documents" can stand for almost anything SD is not "just for publishing"; the same approach works for file formats, dynamic applications, and network messages SD is not just "about writing markup"; markup is simply a way to encode logical tree-form data structures for all kinds of data/processing (a use case for software in general) In most "real" applications, SD requires re-thinking (human) processes; this is usually the most difficult part
1.13 An example of SD without using XML: Careless "What you see is what you get" (WYSIWYG) editing breaks the roles of publishing, yielding spaghetti documents LaTeX is a nice typesetting system, essentially separating the tasks of authoring and document (layout) design, mostly used for scientific and mathematical texts (with pretty equations) Basic roles: Author (you), Designer (LaTeX [using macros written by others]), Typesetter (TeX) + Printing/Publishing etc. Typical LaTeX session resembles coding & compiling: o Edit manuscript (as text source documents etc.) o Process (latex lintro.tex) and preview (xdvi lintro.dvi &) o Convert for printing (dvips -o lintro.ps lintro.dvi) etc. Structured Documents /1 Introduction (ON 2012)
1.14 An simple document in LaTeX \documentclass[a4paper]{article} \begin{document} \title{almost Trivial LaTeX Introduction} \author{ossi Nyk\"{a}nen} \maketitle \abstract{this little document (in itself) demonstrates some LaTeX basics.} \tableofcontents \section{introduction} LaTex is a quality authoring and typesetting system that separates the roles of document author and (layout) designer. It is good for, Structured Documents /1 Introduction (ON 2012)
e.g., typesetting documents with mathematical formulae; in fact, the concise LaTeX math commands are widely used in mathematical software in general. This article (in itself) demonstrates some LaTeX basics. For more information, please read "The Not So Short Introduction to LATEX2" \cite{lshort} or find some suitable introductionary book for details. \section{example} Here's an example: The following formula asserts that an invariance $P$ exists for a set $X$: \begin{equation}\label{inv} \forall x \in X: P(x). \end{equation} Of course, being able to typeset (\ref{inv}) in a fancy way does not necessarily mean one understands what the formula actually means. Structured Documents /1 Introduction (ON 2012)
Nevertheless, LaTex (or \LaTeX{}...) really is something, especially if you like relatively easy-to-edit, clean manuscripts, and wish to include great-looking equations in your text\footnote{assuming you know \LaTeX codes, that is.}. \begin{thebibliography}{longtitle} \bibitem{lshort} Oetiker, T., Partl, H., Hyna, I., Schegl, E. 2011. \emph{the Not So Short Introduction to LATEX2}. Tobias Oetiker and Contributors. Available at ftp://ftp.funet.fi/pub/tex/ctan/info/lshort/english/lshort.pdf \end{thebibliography} \end{document} Structured Documents /1 Introduction (ON 2012)
1.15 Some observations Compare with the SVG document example (!) Concepts: Character encoding, document type and structure, commands[optional parameters]{parameters}, comments,... Compare with SVG or HTML authoring (!) The catch: o Source is written according to the intended author role (which does not typically include layout design etc.) o Source documents are text documents that can be written and read in various ways & tools o If you need a feature not in your current macro set, you either have to look for one, or develop it by yourself... Structured Documents /1 Introduction (ON 2012)
1.16 Ok. What about other, non-sd approaches? Key-value pair systems (e.g. properties and storages) Ad hoc file formats (where "markup syntax" is different in every application) Relational (and other) databases (which share, however, many useful and techniques concepts with SD) Interactive client-server protocols (e.g. (E)SMTP) Note than also these may utilise SD concepts & techniques, or could be implemented using SD In general, despite certain competition, different approaches and technologies may complement each other (using adapters, via query languages, as message formats, etc.) Structured Documents /1 Introduction (ON 2012)
1.17 One more example: JavaScript Object Notation JSON is a text-based open standard designed for humanreadable data interchange Derived from the JavaScript scripting language for representing simple data structures and associative arrays (which denote objects) Basically language-independent with parsers available for many languages. See RFC 4627 Structured Documents /1 Introduction (ON 2012)
1.18 A JSON example "Image member is an object whose Thumbnail member is an object and whose IDs member is an array of numbers" { } "Image": { "Width": 800, "Height": 600, "Title": "View from 15th Floor", "Thumbnail": { } "Url": "Height": 125, "Width": "100" "http://www.example.com/image/481989943", }, "IDs": [116, 943, 234, 38793] Structured Documents /1 Introduction (ON 2012)
1.19 Layers of models in data interpretation In most information processing applications, acknowledging three layers or categories of models makes sense: o Conceptual (e.g. intuitive application concepts) o Logical (e.g. pivotal design structures) o Physical (e.g. implementation structures) Roughly speaking, concepts "live" in designers' minds, logical structures appear in application (programming) interfaces, which the physical structures beneath implement Intuitively, complexity increases from conceptual to logical to physical, and a logical model can be implemented using several physical approaches (this is why one usually starts from conceptual design) Structured Documents /1 Introduction (ON 2012)
1.20 What does "structured" mean? By default, computers only deal with well-defined formats (it is even hard to speak about ill-defined structures because of the communication problem!); two things are needed: o Named container and serialisation of the content (cf. files) Here "structure" simply means some specification of organising the content into well-defined parts, e.g., as network of nodes (network data model), rows and columns (table data model), or elements and attributes (tree data model) SD follows certain design principles to allow easy access to data according to the tree data model Structured Documents /1 Introduction (ON 2012)
1.21 The Document Parser pattern The observation that "any suitable serialisation specification would in principle suffice for SD" highlights the role of (standard, re-usable) parsers As a result, the most primitive, natural (?) structured document application architecture thus looks like this (note that markup is for the parser, not the application): Doc-Parser-App In practice, "parsers" can do much more than simply parse content, as we will see (i.e. they get delegated also other generic tasks, ideally specified using modular standards) Structured Documents /1 Introduction (ON 2012)
1.22 The data pipeline processing architecture The SD paradigm is closely related with the general idea of the Data Processor Pattern: o A Data processor is a piece of software that accepts data via interface X and provides output via interface Y (where a special case is a function f:x Y) Connecting data processors with compatible input and output ("message") formats yields a data processing pipeline which itself is a data processor (cf. work and data flow systems and tools like Apache Ant, XProc,...; reusing processors and data) A 1 X 1 P 1 X 3 P 3 X 5 A 2 X 2 P 2 X 4 P 4 X 6 A 3 Structured Documents /1 Introduction (ON 2012)
1.23 Abstract SD (or SD technology) use cases Implementation of off-the-shelf applications (e.g. common document formats, desktop publishing and single-sourcing systems) Accessing data from an application X (e.g. databases and generic production systems, editors, spreadsheets, CAD systems, pre-processing) Deploying data into an application Y (e.g. views, execution in certain applications, post-processing) Utilisation of specific SD technology Z (e.g. validation, transforming, object manipulation, linking, signing, etc.) Practical motivation often lies in integrating with legacy systems and data, using standard components and tools Structured Documents /1 Introduction (ON 2012)
1.24 Conclusive notes Structured documents aim making information processing more efficient (introducing design patterns and technologies) The basic idea of structured documents lies in ensuring easy editing/programmatic access of source documents, separating structure from presentation, and introducing general-purpose technologies for the job (but end users don't usually care how) The fundamental challenge of SD is common to all design: First ideas and prototypes do not usually scale and successful adoption of SD requires thinking not only data structures, but also (workflows and) data process flows Not all good "technical" SD ideas are always fully adopted in practice, usually due to legacy reasons, lack of planning, etc. Structured Documents /1 Introduction (ON 2012)
2 XML Basics Introduction XML document instance structure Overview of XML technologies Early design ideas Namespaces
2.1 Introduction to Extensible Markup Language (XML) XML is a family of (standard) World Wide Web Consortium W3C XML technologies and related tools and technologies Historically, XML is based on Standard Generalized Markup Language, SGML (an ISO std from 1986); originally XML was defined as an SGML application profile, influenced by SGML/HTML tools, dialects, and Internet applications Today, XML is well-established and part of computer science de facto technologies Looking into the future, perhaps the biggest challenge for XML lies in its strongest design principle: easy machine readability, which means intolerance towards ill-formed data in practice, this leads into competing, relaxed syntaxes Structured Documents /2 XML Basics (ON 2012)
2.2 Working definitions for core XML specifications In this course, we define the core of the XML family to include the following technical specifications: o XML 1.0 (replace v1.0 with v1.1 when have to) o Namespaces in XML (repl. v1.0 with v1.1 when have to) o Document Object Model (DOM) (Level 2 or 3) We also consider XSL Transformations (XSLT; 1.0 and 1.1) critical tool technology for making most use of XML In addition, several other technologies are needed, but these depend on the application (or are needed for standardisation) Note: When people (ambiguously) speak about XML in general, they are often referring to the XML family (here and there), not necessarily the particular XML 1.0 specification Structured Documents /2 XML Basics (ON 2012)
2.3 Design goals (XML v1.0 in 1998, 5 th edition in 2008) XML shall be straightforwardly usable over the Internet. XML shall support a wide variety of applications. XML shall be compatible with SGML. It shall be easy to write programs which process XML documents. The number of optional features in XML is to be kept to the absolute minimum, ideally zero. XML documents should be human-legible and reasonably clear. The XML design should be prepared quickly. The design of XML shall be formal and concise. XML documents shall be easy to create. Terseness in XML markup is of minimal importance. Structured Documents /2 XML Basics (ON 2012)
2.4 A simple example <?xml version="1.0" encoding="utf-8" standalone="no"?> <!DOCTYPE myformat SYSTEM "myformat.dtd"> <myformat xml:lang="en"> <title>hello world!</title> <desc>this document demonstrates XML syntax (and is quite useless otherwise).</desc> <example>in principle, the markup is simple (as ABC) if you can write structured text and remember to escape some special characters, such as <! <!-- The following section does not declare a logical element <foo/>, only character data: --> <![CDATA[ <foo/> ]]> </example> </myformat> <!-- This comment lies in the "XML epilog" - using this part should be avoided... --> Structured Documents /2 XML Basics (ON 2012)
2.5... with an optional Document Type Definition (DTD) <!ELEMENT myformat (title,desc?,example*)> <!ATTLIST myformat xml:lang CDATA "fi"> <!ELEMENT title (#PCDATA)> <!ELEMENT desc (#PCDATA)> <!ELEMENT example (#PCDATA)>...but let us ignore this mostly for now. (But do experiment syntax with XML editors.) Structured Documents /2 XML Basics (ON 2012)
2.6 Mostly all about "neutral" structure XML document includes three parts (partly recursive): o The prolog, the root element, and following misc parts In practice, three major parts (first two are optional): o XML declaration (<?xml...) o Document type declaration (<!DOCTYPE...) o Document instance (root element and its contents) Essentially, the document "is" a (parse) tree of element and other kinds of nodes (no particular presentation, etc.) A document that fails to follow XML syntax is simply a text document (and almost useless from XML point of view...) A document that does not follow/include type declaration is not valid (still useful since processing DTD is optional) Structured Documents /2 XML Basics (ON 2012)
2.7 Markup categories XML declaration Text declaration Comment Character reference Entity reference Start-tag (incl. encoding attributes and attribute value normalisation) End-tag Empty-element tag CDATA section Processing instruction Document type declaration Structured Documents /2 XML Basics (ON 2012)
2.8 Basic constructs Version, encoding, and stand-alone declaration Names, the ":" char, and the use of the prefix "[xx][mm][ll]" Logical structure, uniqueness of attributes Physical structure (come back to this in detail when equipped with the notion of XML DTD & entities) Pre-defined entities (lt, gt, amp, quot, and apos) Pre-defined attributes (xml:lang, xml:space) XML document XML Processor XML Application Basic structure design philosophy (using elements, attributes, PIs, & comments) Structured Documents /2 XML Basics (ON 2012)
2.9 Formal syntax specification (e.g. XML 1.0) A (well-formed) XML document is a textual object (Unicode string) formally defined as a set of rules in EBNF: [1] document ::= prolog element Misc*... [5] Name ::= NameStartChar (NameChar)*... [39] element ::= EmptyElemTag STag content ETag [WFC: Element Type Match] [VC: Element Valid] [40] Stag ::= '<' Name (S Attribute)* S? '>'[WFC: Unique Att Spec]... [45] elementdecl ::= '<!ELEMENT' S Name S contentspec S? '>' [VC: Unique Element Type Declaration]... Resolving conflicts, good to know, & useful when encountering odd syntax errors Structured Documents /2 XML Basics (ON 2012)
2.10 Notes about basic operations What does an XML document "mean"? (Ask application) Editing manually vs. serialising data in XML (tool categories) Viewing (?) o Associating with a stylesheet o Debugging, event tracking, etc. Checking well-formedness (and dealing with errors) Validating (...and dealing with errors) Processing (... we'll come back to this later) Useful additional concepts (not defined by XML 1.0) o Project, source, target, versioning,... (cf. e.g. Eclipse) o Element, attribute and data types o Data island, inline code (...namespaces) Structured Documents /2 XML Basics (ON 2012)
2.11 (Preliminary) notes to avoid misunderstandings Document- vs. Data-oriented modelling Persistent "documents" vs. messages XML "files" vs. XML Objects XML as a (query) interface o Sometimes there is no clear "underlying file" at all o Typical asymmetry in reading and writing XML syntax, SGML syntax, HTML syntax, dialects In a strict sense, XML 1.0 (1.1) standard specifies fatal errors which a conforming XML processor MUST detect and report to the application Structured Documents /2 XML Basics (ON 2012)
2.12 XML namespaces (NS) At some point, there comes shortage for short names... XML namespaces provide a simple mechanism to associate XML vocabularies with a URI (IRI) string o Cf. packages or modules in programming languages New structure for naming things (now elements and attrs): o Expanded name = (namespace name, local name) Two methods of using: o Default namespaces & qualified names Slight variations between specifications 1.0 and 1.1 Main use case: Mixing applications using inline code (A mental image to start with: A namespace is a bag of local names. The bag's name is the namespace name.) Structured Documents /2 XML Basics (ON 2012)
2.13 Default namespace Familiar example, now binding the local element names: <?xml version="1.0"?> <!-- elements are in the HTML namespace, in this case by default --> <html xmlns='http://www.w3.org/1999/xhtml'> <head><title>frobnostication</title></head> <body><p>moved to <a href='http://frob.example.com'>here</a>.</p></body> </html> One "active" default namespace at a time, only elements inherit the binding, can be overwritten or "turned off" (="") Note that in the above, there is no namespace for the attribute href this does not really hurt since one can always read attributes' context from the document Structured Documents /2 XML Basics (ON 2012)
2.14 Qualified namespace Two operations available o Declare prefix binding (in 1.1 can also to remove binding by ="") o (and then, when appropriate) qualify name with the prefix <x xmlns:edi='http://ecommerce.example.org/schema'> <!-- the 'taxclass' attribute's namespace is http://ecommerce.example.org/schema --> <lineitem edi:taxclass="exempt">baby food</lineitem> </x> Several prefixes can be declared but only using qualified names bind namespaces to local names, only prefix bindings are inherited, sometimes inheriting prefixes is a problem Using commonly used prefixes makes usually sense Structured Documents /2 XML Basics (ON 2012)
2.15 A bigger example about scoping <?xml version="1.0"?> <!-- initially, the default namespace is "books" --> <book xmlns='urn:loc.gov:books' xmlns:isbn='urn:isbn:0-395-36341-6'> <title>cheaper by the Dozen</title> <isbn:number>1568491379</isbn:number> <notes> <!-- make HTML the default namespace for some commentary --> <p xmlns='http://www.w3.org/1999/xhtml'> This is a <i>funny</i> book! </p> </notes> </book> Structured Documents /2 XML Basics (ON 2012)
2.16 Notes Comparing URI strings (char by char) Namespace document ("the NS home page") Namespaces provide a concrete technique for adding information to XML documents, for... o Rigorously identifying element and attribute names with unique identifiers o Avoiding name collisions in applications when merging multiple XML applications o Communicating about specifications NS demonstrates a very significant step away from the old SGML semantics and some legacy (XML) applications/specs Structured Documents /2 XML Basics (ON 2012)
2.17 Which namespace, then? Names starting with the prefix "[xx][mm][ll]" are by default bound to the namespace http://www.w3.org/xml/1998/namespace In principle, different application designers decide their namespaces as they see fit (and document it) Beware that there is no explicit method for globally declaring (or non-trivially validating...) namespaces themselves; one simply writes references to namespaces in documents using the above methods o The URI naming quite naturally supports decentralisation o Sometimes XML NS is criticised of favouring the use of domain names (that need registering)... Structured Documents /2 XML Basics (ON 2012)
2.18 XML in the W3C Web technology stack Structured Documents /2 XML Basics (ON 2012)
2.19 Conclusion: What have we learnt so far? Instance syntax, bit of the specifications Architecture and application ideas Namespace syntax However, so far it is unclear what to do with XML and why favour certain kinds of modelling structures (examples in the introduction should help, but still...) To make these things useful, few things are needed o Ways to define schemas (restricted formats) o Ways to implement applications Let us move forward... Structured Documents /2 XML Basics (ON 2012)
3 XML DTD and Schemas Introduction Document classes Logical structure Physical structure Catalogs, namespaces, etc. Other schema languages
3.1 Introduction XML and XML NS establish the class of XML documents Applications, however, are typically designed to manage only certain structures (e.g. vector graphics), so a more specific contract about the communication interface is needed A generic way to do this, is to provide the type of structures via a schema definition. We identify four main use cases: 1. Document the data structure (of domain of interest) 2. Describe instance data (e.g. point out associations) 3. Validate instance data (and verify schema) 4. Assert information into instance data For historical reasons, XML DTD can also... 5. Declare physical structures (schema and instance) Structured Documents /3 XML DTD and Schemas (ON 2012)
3.2 Schema languages for XML Several schema languages exist for XML, including XML DTD (built-in part of the XML 1.0/1.1 spec) XML Schema (integrates nicely with the XML family, supports simple object-oriented design concepts) ISO Schematron (powerful rule-based reporting language) In addition, XML applications following the network data model (RDF) introduce schema languages of their own Despite its problems, XML DTD (Document Type Definition) is important because of its simplicity and very wide support Recall that some schemas (incl. DTD) may have side-effects, i.e. interpreting them can add information to the instance (!) Structured Documents /3 XML DTD and Schemas (ON 2012)
3.3 DTD and logical structure of an XML document In general, logical document structure is a concept that is related to a particular processor (or parser) Considering the XML DTD, the logical structure is roughly defined as follows: Compute the parse tree of an XML document Ignore node types other than elements, attributes, and text Write out entity and character references Crop certain text nodes and normalise attributes In brief, this means that the logical structure is essentially the element structure put another way, XML DTD in insensitive towards comments, processing instructions, etc. Structured Documents /3 XML DTD and Schemas (ON 2012)
3.4 Declaring the document type using XML DTD Declaration referring to the external DTD subset, perhaps associated with a public identifier, and/or including the internal DTD subset <!DOCTYPE greeting PUBLIC "-//ExampleOrg//DTD Common Hellow//EN" "http://www.example.org/hello.dtd" [ <!ELEMENT greeting (#PCDATA)> ]> <greeting></greeting> Note that when used, XML DTD insists adding the type declaration to the document (as markup...) Public identifiers are needed for catalogues etc. Structured Documents /3 XML DTD and Schemas (ON 2012)
3.5 XML DTD declarations in a nutshell The XML 1.0/1.1 built-in XML Document Type Definition (XML DTD) language provides the following main primitives: Element type declaration Attribute-list declaration Entity declaration First two are related to logical structures (which is usually the main use case), the third (entities) to physical structures Additional primitives include Conditional sections Notation declaration Let us start from the logical structures Structured Documents /3 XML DTD and Schemas (ON 2012)
3.6 Recall the simple XML DTD example <!ELEMENT myformat (title,desc?,example*)> <!ATTLIST myformat xml:lang CDATA "fi"> <!ELEMENT title (#PCDATA)> <!ELEMENT desc (#PCDATA)> <!ELEMENT example (#PCDATA)> Technically, this particular DTD is all about declaring the logical structure for XML documents of this type... XML DTD has two major weaknesses: No rich definitions for datatypes, no support for XML NS semantics (use e.g. XML Schema for these) Structured Documents /3 XML DTD and Schemas (ON 2012)
3.7 Element declaration basics Intuitively, XML DTD element declarations specify element name and content model <!ELEMENT NAME CONTENT_MODEL> In practice, four kinds of element declarations Element content: (foo, (bar+ data*)?) Mixed content: (#PCDATA foo bar)* EMPTY ANY During validation, an element found in the instance is compared to the corresponding element declaration Structured Documents /3 XML DTD and Schemas (ON 2012)
3.8 Attribute (list) declaration basics Attribute declarations are associated with elements, given attribute name, attribute type and default declaration: <!ATTLIST E_NAME A_NAME ATT_TYPE DEFAULT_DECL> Types: CDATA, ID (keys), IDREF, IDREFS (foreign keys), ENTITY, ENTITIES, NOTATION (identifier of "helper apps"), NMTOKEN, NMTOKENS, and enumeration ("a b c") Def Dclrtns: #REQUIRED, #IMPLIED, (#FIXED) ATT_VALUE Values are normalised according to type (strip outer "('), write out entity and char refs, collapse S into spaces; if not CDATA, strip also prefix, suffix, and multiple infix white spaces) Valid docs must also declare the xml:lang and xml:space attributes and the predefined entities (lt, gt, amp, apos, quot) Structured Documents /3 XML DTD and Schemas (ON 2012)
3.9 Simple tree diagrams XML DTD does not specify visual notation for DTD design; in practice simple tree diagrams may help communicating element structure (sometimes complement with state or "structure" diagrams) Attributes, partial diagrams, references, use of UML and similar notation, tool-specific diagrams,... Structured Documents /3 XML DTD and Schemas (ON 2012)
3.10 Empty and constraint-free models, recursion EMPTY (placeholder, think e.g. html img) ANY (any known element content allowed) When appropriate, also recursive content models are allowed (essentially tree structures of arbitrary depth) Note that recursion can take place also on the application level, e.g., navigating document structures recursively, which does not necessary require recursive content modelling in the above sense Structured Documents /3 XML DTD and Schemas (ON 2012)
3.11 DTD and physical structure of an XML document In general, physical document structure is a concept that is related to the way the document is serialised (or represented as a program object) In basic XML, the physical structure is captured in terms of the concept entity (cf. file) which is a pair (name, content) Three basic categories of parsed entities: Document entity, general entity, parameter entity Depending on their definition, general and parameter entities are either internal or external There is also the concept of unparsed entity, but nowadays it is not too commonly used Structured Documents /3 XML DTD and Schemas (ON 2012)
3.12 Entity examples: General... <!-- Internal --> <!ENTITY Pub-Status "This is a pre-release of the specification."> <!ENTITY foo "ABC"> <!-- External --> <!ENTITY open-hatch SYSTEM "http://www.textuality.com/boilerplate/openhatch.xml"> <!ENTITY open-hatch2 PUBLIC "-//Textuality//TEXT Standard open-hatch boilerplate//en" "http://www.textuality.com/boilerplate/openhatch.xml">... <!-- References within XML instance -->... &Pub-Status;... &open-hatch; Structured Documents /3 XML DTD and Schemas (ON 2012)
3.13 Entity examples: Parameter <!ENTITY % ents SYSTEM "namelist.ent"> %ents;... <!ENTITY % Boolean.datatype "( false true )" > <!ATTLIST data public %Boolean.datatype; #IMPLIED>... <!-- Conditional section --> <!ENTITY % draft 'INCLUDE' > <!ENTITY % final 'IGNORE' > <![%draft;[ <!ELEMENT book (comments*, title, body, supplements?)> ]]> <![%final;[ <!ELEMENT book (title, body, supplements?)> ]]> Structured Documents /3 XML DTD and Schemas (ON 2012)
3.14 Entity examples: Unparsed (and notation)... <!NOTATION gif SYSTEM "gifutility.exe">... <!ENTITY hatch-pic SYSTEM "../grafix/openhatch.gif" NDATA gif > Structured Documents /3 XML DTD and Schemas (ON 2012)
3.15 DTD driver The external DTD subset which controls (refers to) all the others, is called the DTD driver In practice, a document type declaration written in the XML document prolog refers to a (particular) DTD driver, which then refers to the various DTD modules, etc. This also suggests that a single DTD module may appear in the context of several DTD drivers (e.g. character entities) Similar "driver" concepts can be adopted elsewhere as well (e.g. in XML schemas modules, XSLT modules, etc.) Structured Documents /3 XML DTD and Schemas (ON 2012)
3.16 Notes about entities Common use cases Organising document instance into several files Organising DTD into several files Efficiently managing DTD with parameters and documenting design concepts (cf. pre-processor in C) In general, entities should not be used for styling, etc. Sometimes XML DTD is used solely for declaring entities (even in combination with other schema technologies) In the future, techniques such as XInclude might be used instead of XML DTD entities Structured Documents /3 XML DTD and Schemas (ON 2012)
3.17 OASIS XML Catalogs Downloading content from the Internet each time when a schema is needed, is not always practical URIs and Public identifiers allow identifying archived schemas The OASIS XML Catalogs specification allow mapping URLs and public names to (local) resources; for instance,: catalog2.xml: <!DOCTYPE catalog PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN" "http://www.oasisopen.org/committees/entity/release/1.0/catalog.dtd"> <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog" prefer="public"> <public publicid="-//a//xml CATALOG IDENTIFIER//EN" uri="example.ent"/> </catalog> Structured Documents /3 XML DTD and Schemas (ON 2012)
3.18 XML DTD and XML namespaces Unfortunately, XML DTD does not recognise NS semantics ':' is just a funny (but legal) character in a name... Also, while modular DTDs are possible, each DTD driver needs to be well-defined and known no "future extensions" (need to rewrite the DTD driver each time new vocabulary is introduced) This document-centric thinking in DTDs implies two things: Declaring mixed inline text formats is not convenient...but it is not a real problem is all the vocabularies and the syntax for mixing are completely known during designtime, and we don't mind about enforcing specific namespace prefixes, etc. (cf. SVG) Structured Documents /3 XML DTD and Schemas (ON 2012)
3.19 XML schemas using, well,... XML Schema In some cases, the aforementioned challenges with XML DTD and its peculiar legacy syntax do suggest using other XML schema languages Looking at the W3C XML technology stack, the next choice is XML Schema (who invents these names anyway...) XML Schema provides an XML syntax for declarations (!), is namespace-aware, supports additional design concepts (e.g. derivations, element and attribute-level types), and extends typing into datatypes (string, date, decimal, boolean, etc.); but does not include equivalent to general entities, though However, introducing general-purpose, easy-to-use (inline application) schema drivers is still quite difficult Structured Documents /3 XML DTD and Schemas (ON 2012)
3.20 Example: Compare XML DTD and Schema for... <?xml version="1.0"?> <purchaseorder orderdate="1999-10-20"> <shipto country="us"> <name>alice Smith</name> <street>123 Maple Street</street> <city>mill Valley</city> <state>ca</state> <zip>90952</zip> </shipto> <billto country="us"> <name>robert Smith</name> <street>8 Oak Avenue</street> <city>old Town</city> <state>pa</state> <zip>95819</zip> </billto> <comment>hurry, my lawn is going wild!</comment> <items> Structured Documents /3 XML DTD and Schemas (ON 2012)
<item partnum="872-aa"> <productname>lawnmower</productname> <quantity>1</quantity> <USPrice>148.95</USPrice> <comment>confirm this is electric</comment> </item> <item partnum="926-aa"> <productname>baby Monitor</productName> <quantity>1</quantity> <USPrice>39.98</USPrice> <shipdate>1999-05-21</shipdate> </item> </items> </purchaseorder> Structured Documents /3 XML DTD and Schemas (ON 2012)
3.21 First some DTD to compare with, and... <!ENTITY % USAddress "name,street,city,state,zip"> <!ENTITY % shipattrs "country NMTOKEN #FIXED 'US'"> <!ENTITY % itemattrs "partnum NMTOKEN #REQUIRED"> <!ELEMENT purchaseorder (shipto, billto, comment?, items)> <!ATTLIST purchaseorder orderdate CDATA #REQUIRED> <!ELEMENT shipto (%USAddress;)> <!ATTLIST shipto %shipattrs;> <!ELEMENT billto (%USAddress;)> <!ATTLIST billto %shipattrs;> <!ELEMENT items (item*)> <!ELEMENT item (productname, quantity, USPrice, comment?, shipdate?)> <!ATTLIST item %itemattrs;> Structured Documents /3 XML DTD and Schemas (ON 2012)
<!ELEMENT name (#PCDATA)> <!ELEMENT street (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT state (#PCDATA)> <!ELEMENT zip (#PCDATA)> <!ELEMENT comment (#PCDATA)> <!ELEMENT productname (#PCDATA)> <!ELEMENT quantity (#PCDATA)> <!ELEMENT USPrice (#PCDATA)> <!ELEMENT shipdate (#PCDATA)> <!ELEMENT partnum (#PCDATA)> <!-- When used, remember to add document type declaration to the instance... --> Structured Documents /3 XML DTD and Schemas (ON 2012)
3.22...then an XML Schema (informative, not for exam) <!-- For details, see XML Schema Part 0: Primer --> <xsd:schema xmlns:xsd="http://www.w3.org/2001/xmlschema"> <xsd:annotation> <xsd:documentation xml:lang="en"> Purchase order schema for Example.com. Copyright 2000 Example.com. All rights reserved. </xsd:documentation> </xsd:annotation> <xsd:element name="purchaseorder" type="purchaseordertype"/> <xsd:element name="comment" type="xsd:string"/> <xsd:complextype name="purchaseordertype"> <xsd:sequence> <xsd:element name="shipto" type="usaddress"/> <xsd:element name="billto" type="usaddress"/> Structured Documents /3 XML DTD and Schemas (ON 2012)
<xsd:element ref="comment" minoccurs="0"/> <xsd:element name="items" type="items"/> </xsd:sequence> <xsd:attribute name="orderdate" type="xsd:date"/> </xsd:complextype> <xsd:complextype name="usaddress"> <xsd:sequence> <xsd:element name="name" type="xsd:string"/> <xsd:element name="street" type="xsd:string"/> <xsd:element name="city" type="xsd:string"/> <xsd:element name="state" type="xsd:string"/> <xsd:element name="zip" type="xsd:decimal"/> </xsd:sequence> <xsd:attribute name="country" type="xsd:nmtoken" fixed="us"/> </xsd:complextype> <xsd:complextype name="items"> <xsd:sequence> Structured Documents /3 XML DTD and Schemas (ON 2012)
<xsd:element name="item" minoccurs="0" maxoccurs="unbounded"> <xsd:complextype> <xsd:sequence> <xsd:element name="productname" type="xsd:string"/> <xsd:element name="quantity"> <xsd:simpletype> <xsd:restriction base="xsd:positiveinteger"> <xsd:maxexclusive value="100"/> </xsd:restriction> </xsd:simpletype> </xsd:element> <xsd:element name="usprice" type="xsd:decimal"/> <xsd:element ref="comment" minoccurs="0"/> <xsd:element name="shipdate" type="xsd:date" minoccurs="0"/> </xsd:sequence> <xsd:attribute name="partnum" type="sku" use="required"/> </xsd:complextype> </xsd:element> Structured Documents /3 XML DTD and Schemas (ON 2012)
</xsd:sequence> </xsd:complextype> <!-- Stock Keeping Unit, a code for identifying products --> <xsd:simpletype name="sku"> <xsd:restriction base="xsd:string"> <xsd:pattern value="\d{3}-[a-z]{2}"/> </xsd:restriction> </xsd:simpletype> </xsd:schema> <!-- Modifying the XML instance is not necessary --> Observations and notes Structured Documents /3 XML DTD and Schemas (ON 2012)
3.23 Conclusion Schemas provide a general-purpose method for ensuring a well-defined communication interface between applications Essentially, applications can delegate this task to parsers Note that validation is a special case of reporting; in practice, however, technical validation (or reporting) can deal only certain machine-readable aspects of useful application data (intuitively a marble game of "structures") Despite its problems and the legacy background, XML DTD is quite useful and simple technology to work with, and sufficient for many practical applications (and from a pipeline point of view, one can simply crop the DTD declaration from the document when happy about the structure) Structured Documents /3 XML DTD and Schemas (ON 2012)
4 Transformations Introduction Extensible stylesheet language family XPaths XSL Transformations Examples and design notes
4.1 Introduction The data processing pipeline architecture points out the benefits of mapping information between applications by using common message formats XML provides a nice way to implement the adapter pattern also in general, which allows delegation, reusing and (decoupling of) application modules (allow freezing App 1 ): App 1 Data 1 Transform(Rules 1to2 ) Data 2 App 2 Note that to actually achieve this, we come back to SD fundamentals, self-describing data and machine-readability, since it is usually easier to map, e.g., rich content onto some presentation format than vice versa (when e.g. "collapse" shipto, billto p, information gets lost) Structured Documents /4 Transformations (ON 2012)
4.2 Extensible Stylesheet Language Family From extensible stylesheet technology into functional programming technology W3C XSL Family The XML Path Language (XPath) (and thus XQuery (etc.)) XSL Transformations (XSLT) XSL Formatting Objects (XSL-FO) In this course, we focus on XPath and XSLT, which are very influential technologies in the SD world In principle, XSLT is powerful enough for any XML XML mappings (even if implementing some kinds of mappings is extremely clumsy using pure transformations) Structured Documents /4 Transformations (ON 2012)
4.3 The XML perspective to data, again Note that just like validation, mapping XML instance data "simply requires some programming"; and in principle, "anything found in the document could be processed" In practice, it (usually, again) makes sense to transform things from parse tree point of view (the "XML perspective" to data) Now XPath defines the parse tree and what logical structure means... Having standard, abstract specifications, processors, and APIs simplifies the mapping (transformation) task but in turn requires learning yet another programming language(s) Structured Documents /4 Transformations (ON 2012)
4.4 For instance, considering following source data... <?xml version="1.0" encoding="utf-8"?> <music> <album id="jmi8c5tpbqdxh03ku_kkxn0jvl8-" year="1978" cover="ds-ds.jpg"> <artist>dire Straits</artist> <name>dire Straits</name> <tracks> <track len="04m03s">down to the Waterline</track> <track len="05m27s">water of Love</track> <track len="03m20s">setting Me Up</track> <track len="04m13s">six Blade Knife</track> <track len="03m00s">southbound Again</track> <track len="05m49s">sultans of Swing</track> <track len="06m17s">in the Gallery</track> <track len="04m42s">wild West End</track> <track len="05ms04">lions</track> </tracks> </album>... <!-- This is a toy example; good formats/repositories already exist (check the id...) --> Structured Documents /4 Transformations (ON 2012)
4.5...a simple XSL transformation into HTML(4) <?xml version="1.0" encoding="iso-8859-1"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/xsl/transform" version="1.0"> <xsl:output method="html" indent="yes" encoding="iso-8859-1" doctype-public="-//w3c//dtd HTML 4.0 Transitional//EN" /> <xsl:template match="/music"> <html lang="en" xml:lang="en"> <head> <title>album list</title> </head> <body> <h1>albums</h1> <xsl:apply-templates/> Structured Documents /4 Transformations (ON 2012)
<hr /> </body> </html> </xsl:template> <xsl:template match="album"> <h2> <xsl:value-of select="name"/> </h2> <h3>cover</h3> <img alt=""> <xsl:attribute name="src"> <xsl:value-of select="@cover"/> </xsl:attribute> </img> <xsl:apply-templates/> </xsl:template> <xsl:template match="tracks"> <h3>tracks</h3> <ul> Structured Documents /4 Transformations (ON 2012)
<xsl:apply-templates/> </ul> </xsl:template> <xsl:template match="track"> <li> <xsl:value-of select="."/> (<xsl:value-of select="@len"/>) </li> </xsl:template> <!-- Ignore unwanted nodes. --> <xsl:template match="*"> </xsl:template> </xsl:stylesheet> Important note: HTML plays no special role in XSLT, this is just an example using familiar output vocabulary Structured Documents /4 Transformations (ON 2012)
4.6 Preliminary notes (will come back in detail...) Questions: What is it and what does it do? What kinds of tools are needed? Is it safe? Document structure (root, output, templates) Processing model, references to the source document (now match and select attributes and location paths) Three vocabularies present inline: xsl (transformation), music (source), html (output) need XML namespaces Parse-tree level processing, enhanced with XPath programming concepts (nodes, strings, etc.) Essentially about functional programming... Structured Documents /4 Transformations (ON 2012)
4.7 A very short introduction to XPath XPath is an expression language for addressing and processing parts of XML documents (local or in the Web) Currently two major versions: XPath 1.0: The definitions needed for XSLT and XPointer XPath 2.0: Superset of XPath 1.0, essentially taking XML Schema (types) and query use cases into account In a nutshell, XPath provides Location path syntax Data types and general ("scripting") expressions Built-in functions From a broader perspective XPath (2) might be considered as a special case of XQuery... Structured Documents /4 Transformations (ON 2012)
4.8 Path expressions A (location) path is a (string) expression that allows identifying pieces of data from a source; for instance /child::doc/child::chap[position()=5]/child::sect[position()=2] (: Intuitively means /step1/step2/step3 :) Paths are absolute (e.g. starting with '/') or relative, made of steps of axes, node tests, and predicates When evaluated, expressions may yield any (typed) values acknowledged by the XPath version 1.0: node-set, boolean, number, string 2.0: sequence types (of) complex and simple (XML) schema types (latter including list, union, and atomic types) Structured Documents /4 Transformations (ON 2012)
4.9 Step syntax axis::node[predicate]* Axes Forward: child, descendant, attribute, self, descendant-orself, following-sibling, following, namespace Reverse: parent, ancestor, preceding-sibling, preceding, ancestor-or-self Node (kind) tests Node types: comment, text, processing-instruction, node,... (depends on XPath version) When more specific, use qualified or local names Predicates '[' LOGICAL_EXPRESSION ']' Structured Documents /4 Transformations (ON 2012)
4.10 (Un)abbreviated syntax In many cases, the unabbreviated syntax looks unnecessarily complicated; here the abbreviated syntax comes to the rescue child::para[attribute::type='warning'][position()=5] para[@type="warning"][5] child::*[self::chapter or self::appendix][position()=last()] employee[@secretary and @assistant] Notes on specific notation: /, *, @*, text(), node(), //,..,.,... Some expressions need the full syntax See explained examples from the specs (!) Note that indexing starts from 1 (and not 0...), '=' tests equality, '/' reserved for paths, error mngmnt quite poor Structured Documents /4 Transformations (ON 2012)
4.11 Chickens and eggs Ok; XSLT is built onto XPath on the other hand, XPath starts making sense when actually used (e.g. in XSLT) So, let us again proceed learning XSLT, assuming we know a bit of both Disclaimer: Quite obviously, we shall not cover all aspects of XSLT, but should get the big picture (see the specs for details, they are quite readable) Structured Documents /4 Transformations (ON 2012)
4.12 More general XPath expressions Variables: $v Operators: +, -, div, mod; or, and, =,!=, <, <=, >, >=;... Functions: concat($v,"foo"),... See built-in functions etc. (very important for developers) XPath 1.0 (node-set, string, boolean, number) XQuery 1.0 and XPath 2.0 Functions and Operators XSLT built-ins (cf. document(...), key(...))... Note that XML serialisation is often needed (e.g. in the XSLT context), so prepare to write "$v<2" as "$v<2"... Structured Documents /4 Transformations (ON 2012)
4.13 XSLT basic principles revisited... XSL transformation is a mapping between (classes of) document parse trees using specific output methods XSLT1: T 1 : xml* xml html text (multiple xml inputs, single xml, html, or text output) XSLT2: T 2 : (xml text)* (xml html text)* (multiple xml and text inputs, multiple outputs) XSLT specifies a certain kind of (programmable) data processor; usually several ways to design transformations Execution may fail for several reasons (syntax error, undefined operation, infinite loop, out of memory,...) Used as explicit transformations or embedded stylesheets Malicious XSLT2 transformations may pose security risks Structured Documents /4 Transformations (ON 2012)
4.14 Template rules... <xsl:template match="procedure"> <block> <xsl:value-of select="ancestor-orself::*[@security][1]/@security"/> </block> <xsl:apply-templates/> </xsl:template>... Concepts: (template) rule, template, literal, match pattern, XSLT "command" (elements), rule application, default rules Transformation is a process of creating and extending transformation result tree(s), by finding and instantiating the best matching template rule w.r.t the source tree, and updating the currently active evaluation context (incl. mode) Structured Documents /4 Transformations (ON 2012)
4.15 Top-level (declaration) elements XSLT1 xsl:import xsl:include xsl:strip-space xsl:preserve-space xsl:output xsl:key xsl:decimal-format xsl:namespace-alias xsl:attribute-set xsl:variable xsl:param xsl:template XSLT2 incl. also xsl:character-map xsl:function xsl:import-schema Structured Documents /4 Transformations (ON 2012)
4.16 I/O basics Accessing input Primary source document Secondary source documents (e.g. document("foo.xml")) Parameters Generating output Default result tree (XSLT1), to be serialised In addition, the explicit result documents (XSLT2); xsl:result-document
4.17 Example: xsl:output in XSLT1 <!-- Category: top-level-element --> <xsl:output method = "xml" "html" "text" qname-but-not-ncname version = nmtoken encoding = string omit-xml-declaration = "yes" "no" standalone = "yes" "no" doctype-public = string doctype-system = string cdata-section-elements = qnames indent = "yes" "no" media-type = string /> Structured Documents /4 Transformations (ON 2012)
4.18 Example: Multiple output documents in XSLT2 <!-- Takes an XHTML document as input, and breaks it up so that the text following each <h1> element is included in a separate document. A new document toc.html is constructed to act as an index: --> <xsl:stylesheet xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:xsl="http://www.w3.org/1999/xsl/transform" version="2.0"> <xsl:output name="toc-format" method="xhtml" indent="yes" doctype-system="http://www.w3.org/tr/xhtml1/dtd/xhtml1-strict.dtd" doctype-public="-//w3c//dtd XHTML 1.0 Strict//EN"/> <xsl:output name="section-format" method="xhtml" indent="no" doctype-system="http://www.w3.org/tr/xhtml1/dtd/xhtml1- transitional.dtd" doctype-public="-//w3c//dtd XHTML 1.0 Transitional//EN"/> <xsl:template match="/"> <xsl:result-document href="toc.html" format="toc-format" validation="strict"> <html xmlns="http://www.w3.org/1999/xhtml"> Structured Documents /4 Transformations (ON 2012)
<head><title>table of Contents</title></head> <body> <h1>table of Contents</h1> <xsl:for-each select="/*/xhtml:body/(*[1] xhtml:h1)"> <p><a href="section{position()}.html"><xsl:value-of select="."/></a></p> </xsl:for-each> </body> </html> </xsl:result-document> <xsl:for-each-group select="/*/xhtml:body/*" group-startingwith="xhtml:h1"> <xsl:result-document href="section{position()}.html" format="section-format" validation="strip"> <html xmlns="http://www.w3.org/1999/xhtml"> <head><title><xsl:value-of select="."/></title></head> <body> <xsl:copy-of select="current-group()"/> </body> </html> </xsl:result-document> </xsl:for-each-group> </xsl:template> Structured Documents /4 Transformations (ON 2012)
4.19 Built-in template rules (XSLT1) <xsl:template match="* /"> <xsl:apply-templates/> </xsl:template> <xsl:template match="* /" mode="m"> <xsl:apply-templates mode="m"/> </xsl:template> <xsl:template match="text() @*"> <xsl:value-of select="."/> </xsl:template> <xsl:template match="processing-instruction() comment()"/> Structured Documents /4 Transformations (ON 2012)
4.20 Creating specific node types "Scripting" using attribute value templates (also works with commands; {...} is used more extensively in XQuery...) <xsl:template match="photograph"> <img src="{$image-dir}/{href}" width="{size/@width}"/> </xsl:template> Specific nodes can also be created using special node constructor commands xsl:element, xsl:attribute (xsl:attribute-set), xsl:text, xsl:processing-instruction, xsl:comment,... Structured Documents /4 Transformations (ON 2012)
4.21 Basic control and command structures (XSLT1) Rule control xsl:apply-templates, xsl:call-template, (xsl:with-param) Repetition xsl:for-each (+rule control) Conditional processing xsl:if, xsl:choose, xsl:when, xsl:otherwise Text xsl:value-of Copying xsl:copy, xsl:copy-of Numbering xsl:number Structured Documents /4 Transformations (ON 2012)
Sorting xsl:sort Variables xsl:variable, xsl:param Modular stylesheets, and imports xsl:include, xsl:import Messages xsl:message Extensions (when possible, e.g. via the XML SAX API) Before using, test availability with function-available() Important XSLT 2 additions: Declaring new functions, (general) grouping, regular expressions, reading unparsed text files,... Structured Documents /4 Transformations (ON 2012)
4.22 Notes about variables Central method of passing values to functions, templates, and entire transformations, also useful for simplifying expressions Variables (e.g. $v) are bound to values, cannot be "overwritten" (cf. functional programming) in practice, this sometimes requires carefully planning the intended variable scope (visibility) Variables can also include (be bound to) document fragments (node-set); this is allows, e.g., the definition of structured variables Structured Documents /4 Transformations (ON 2012)
4.23 Simple examples to start with... One vs. many rules vs. loops References, ordering, number formats, etc. Secondary (external) source documents (local/global refs) Parameters and functions (XSLT1, XSLT2) Modes (going sources through several times) Working with namespaces (literal default, qualified, excluding result prefixes) Recursion Processor applications (e.g. plotting graphs with SVG) Technology/application integration (Open office, X3D, VRML,...), GRDDL, etc. Structured Documents /4 Transformations (ON 2012)
4.24 Design notes and hard problems In practice, developers need to learn XSLT and other data processing programming technologies (to see this, we'll cover DOM programming later), to find proper balance when implementing processing tasks (don't try doing everything e.g. with XSLT; construct hybrid data processing pipelines instead) Designing modular (linked) application works when structures and names can be identified in a reasonably persistent way Machine-readability is always an issue and the wellformedness is merely a starting point any significant changes on the schema level may break interoperability (this is the hard problem of SD in general and hard to avoid) Tricky topics in global apps or with several stakeholders (!) Structured Documents /4 Transformations (ON 2012)
4.25 XSL formatting objects XSL-FO; XML-based formatting language (cf. postscript, TeX, etc.) Intuitively, provides a formatting vocabulary & model generalised from CSS Often an intermediate step when generating e.g. PDF documents for paper (e.g. Apache tools) Technically an XSLT use case but in practice, a yet another formatting language in many cases formatting with XHTML etc. makes more sense (common viewers & tools available) Structured Documents /4 Transformations (ON 2012)
4.26 A simple XSL-FO example <?xml version="1.0" encoding="iso-8859-1"?> <fo:root xmlns:fo="http://www.w3.org/1999/xsl/format"> <fo:layout-master-set> <fo:simple-page-master master-name="a4"> <fo:region-body /> </fo:simple-page-master> </fo:layout-master-set> <fo:page-sequence master-reference="a4"> <fo:flow flow-name="xsl-region-body"> <fo:block>hello world!</fo:block> </fo:flow> </fo:page-sequence> Hello world! "Text in A4 canvas" </fo:root> Structured Documents /4 Transformations (ON 2012)
4.27 Being productive: Pipeline processing revisited In general, data processing pipelines include following kinds of steps or components: Generate (read) Transform (process) Serialise (write) While XSLT can implement all of these, XSLT fundamentally considers things on a rather low level ("coding from scratch") In practice, component pipeline frameworks (cf. Apache Cocoon, to certain extent XProc and Ant) introduce common components and tasks for the job, and introduce many useful management concepts (e.g. sitemap, events, exceptions[!]) Incl. generators and serialisers for common mime types... Structured Documents /4 Transformations (ON 2012)
4.28 Conclusive notes Transformations provide a very important method for linking, integrating, generating, etc. applications In principle, any XML-compliant program could include following kinds of options in its Import and Export menus: Transform source (output) using.xsl In practice, however, seldom present in the GUI level Transformations can be perceived as "stylesheets", but most concrete applications are related to data processing (XSLT tools, incl. Apache Xalan and Saxon) and pipeline applications (cf. XProc, Apache Ant, Apache Cocoon, etc.) Structured Documents /4 Transformations (ON 2012)
4.29 Conclusive notes (cont'd) XSLT is powerful enough for implementing "genuine" applications (and when properly encapsulated, user's don't know/care XSLT was used); examples: Schema and reporting processors (e.g. ISO Schematron) System integration (e.g. mapping XML message interfaces, GRDDL,...) Document management, publishing, and single-sourcing applications (e.g. DITA), plot applications,... While XSLT by definition specifies data processors, it can also be used for "generating" reactive applications (e.g. for appropriate runtime environment or build application) At some point, however, additional programming is needed... Structured Documents /4 Transformations (ON 2012)
5 Simple Application Programming Introduction Event based programming with SAX Object model based programming with DOM Scripting, examples and notes
5.1 Introduction In most cases, the basic strategy of applying SD/XML, is to seek, adapt (configure, script,...), and integrate existing XML data and components, perhaps implemented using several technologies (cf. component-based development) However, when suitable applications or tools cannot be found, one can implement the missing bits by him/herself While implementing complex apps can be quite hard, working with plain XML data and basic components is relatively simple, and typically relies on standard (XML) parsers, processors and application programming interfaces (API) (recall the parser, data processor, and pipeline patterns) Use an integrated development environment (such as Eclipse) Structured Documents /5 Simple Application Programming (ON 2012)
5.2 Common XML application development use cases Reading XML (e.g. configuration file or some specific productivity tool format) Implementing some specific (event) handler (e.g. for extending XSLT processor behaviour) Implementing some (missing) data processor for certain data processing pipeline (e.g. Ant task or Cocoon component) Utilising some specific XML component or property (e.g. for messaging, validating, transforming, signing, or viewing data) Writing (well-formed, properly character encoded) XML (!) Today almost any development system supports XML in one way or another, and provides built-in libraries for it Structured Documents /5 Simple Application Programming (ON 2012)
5.3 Java crash course /** * A simple hello world example to discuss basic Java concepts. * @author Ossi */ package fi.tut.hlab.rd.hellow; /** * Sample HelloWorld class */ public class HelloWorld { String msg; public HelloWorld(String s) { msg = s; } void sayhello(int n) { for (int i=0; i<n; i++) System.out.println(msg); Structured Documents /5 Simple Application Programming (ON 2012)
int val = 0; boolean error = false; try { // Demonstrate exception (error) handling val = 1/n; } catch (Exception e) { System.out.println("Problem! "+e.tostring()); e.printstacktrace(); error = true; } if (!error) System.out.println("Psst. Dividing 1 by "+n+" gives "+val); System.out.println("I'm done."); } } public static void main(string[] args) { HelloWorld app = new HelloWorld("Hello world!"); app.sayhello(2); } Structured Documents /5 Simple Application Programming (ON 2012)
5.4 Notes about design patterns Understanding and planning complex systems can be supported with the help of design patterns Delegation, interface, adapter, factory, filter, tree-walker, event listener, iterator,... An XML processor could be thought as a design pattern (or when used in a certain nice way, a best practice) In practice, patterns appear in (XML) development guidelines, naming of APIs, and programming techniques; examples: Using a factory to request a namespace-aware parser Declaring an event listener, and delegating the task for the associated event handler Iterating over a query result set (in some order) Structured Documents /5 Simple Application Programming (ON 2012)
5.5 "Levels" of SD programming Conceptual specifications (e.g. modelbased or requirement-based design; perhaps to be implemented by someone else...) Declarative specifications ("high-level programming") E.g. XProc, compound Web services, and certain specific declarative languages such as dialogue-based systems and declarative animations (cf. SVG, again) Functional specifications E.g. XSLT, XQuery Procedural specifications ("low-level programming") E.g. Java programming or scripting with SAX or DOM Structured Documents /5 Simple Application Programming (ON 2012)
5.6 Disclaimer To honour the course prerequisites, the following presentation only explains the very rudimentary programming techniques for XML application processing before considering developing a significant application, please consult appropriate development or (software) engineering books/tutorials/etc. In real life, coding is easy, but systematic development of large applications is difficult (capturing and managing requirements, technology management, iterating design, testing, licensing, deploying applications, versioning, etc.) Structured Documents /5 Simple Application Programming (ON 2012)
5.7 SAX: Simple API for XML Follows the typical parser pattern XML Data Parser SAX API Application "Select suitable parser component, declare appropriate event handlers & logic, and include to your application" SAX means the Simple API for XML; originally developed by for Java (see saxproject), also other implementations Developers perceive SAX as an event-based parser interface, bundled with some parser distribution (often a builtin lib); the parser generates events while parsing source XML Different versions (2.x onwards supports namespaces) Structured Documents /5 Simple Application Programming (ON 2012)
5.8 A simple SAX example (Java JAXP); see saxproject import java.io.filereader; import org.xml.sax.xmlreader; import org.xml.sax.attributes; import org.xml.sax.inputsource; import org.xml.sax.helpers.xmlreaderfactory; import org.xml.sax.helpers.defaulthandler; public class MySAXApp extends DefaultHandler { public static void main (String args[]) throws Exception { XMLReader xr = XMLReaderFactory.createXMLReader(); MySAXApp handler = new MySAXApp(); xr.setcontenthandler(handler); xr.seterrorhandler(handler); // Parse each file provided on the command line. Structured Documents /5 Simple Application Programming (ON 2012)
for (int i = 0; i < args.length; i++) { FileReader r = new FileReader(args[i]); xr.parse(new InputSource(r)); } } public MySAXApp () { super(); } public void startdocument () { System.out.println("Start document"); } public void enddocument () { System.out.println("End document"); } public void startelement (String uri, String name, String qname, Attributes atts) { if ("".equals (uri)) Structured Documents /5 Simple Application Programming (ON 2012)
System.out.println("Start element: " + qname); else System.out.println("Start element: {" + uri + "}" + name); } public void endelement (String uri, String name, String qname) { if ("".equals (uri)) System.out.println("End element: " + qname); else System.out.println("End element: {" + uri + "}" + name); } public void characters (char ch[], int start, int length) { System.out.print("Characters: \""); for (int i = start; i < start + length; i++) { switch (ch[i]) { case '\\': System.out.print("\\\\"); break; case '"': Structured Documents /5 Simple Application Programming (ON 2012)
System.out.print("\\\""); break; case '\n': System.out.print("\\n"); break; case '\r': System.out.print("\\r"); break; case '\t': System.out.print("\\t"); break; default: System.out.print(ch[i]); break; } } System.out.print("\"\n"); } } Structured Documents /5 Simple Application Programming (ON 2012)
5.9 Notes Simple, very fast, and with a small memory footprint, but... "Looking forward" means processing source several times Bookkeeping data and maintaining any data structure requires case-specific approaches (need a tree model, perhaps...) Used (only) for reading data Navigating to the element(s) of interest is clumsy Today, event-based programming typically takes place in the context of some processor or component-based framework (e.g. adding a custom handler to a XSLT processor) Structured Documents /5 Simple Application Programming (ON 2012)
5.10 A yet another simple SAX example in Perl... use XML::Parser; $parser = new XML::Parser( Handlers => { Start => \&element_start, End => \&element_end, Char => \&characters}); $in_right_album = 0, @lens=(), $in_right_name, $name=""; sub element_start { my ($xp, $element, %attr) = @_; if ($element eq "album" && $attr{id} eq "ds") { $in_right_album = 1; } if ($element eq "track" && $in_right_album==1) { $v = $attr{len}; $v =~ s/m/\./; $v =~ s/s//; push(@lens, $v+0); } if ($element eq "name" && $in_right_album==1) { $in_right_name = 1; Structured Documents /5 Simple Application Programming (ON 2012)
} } sub element_end { my ($xp, $element) = @_; if ($element eq "album" && $in_right_album==1) { $in_right_album = 0; } if ($element eq "name" && $in_right_name==1) { $in_right_name = 0; } } sub characters { my ($xp, $text) = @_; if ($in_right_name==1) { $name.= $text; } } $parser->parsefile('music.xml'); print "$name \n"; foreach my $k (@lens) { print "$k "; } Structured Documents /5 Simple Application Programming (ON 2012)
5.11 Document Object Model (DOM) Follows the typical parser pattern XML Data Parser DOM API Application "Select suitable parser component, request document object, and query/iterate/modify/serialise the object as you wish" The W3C DOM (levels 1,2,3,...) provides abstract specifications for manipulating XML data as programmatic objects (see the old DOM Activity page) Originally started from (SGML-based) HTML DOM, then standardised the abstract XML DOM, and since then work has continued w.r.t. individual technologies (e.g. SVG, HTML5,...) Intuitively close to (procedural) processing à la XPath, but the definition of the parse tree is again slightly different... Structured Documents /5 Simple Application Programming (ON 2012)
5.12 W3C DOM overview Properties standardised in several (interface-level) specifications (level 2 adds support for namespaces) Document Object Model Level 1 Document Object Model Level 2 Core Document Object Model Level 2 Views Document Object Model Level 2 Events Document Object Model Level 2 Style Document Object Model Level 2 Traversal and Range Document Object Model Level 2 HTML Document Object Model Level 3 Core Document Object Model Level 3 Load and Save Document Object Model Level 3 Validation Structured Documents /5 Simple Application Programming (ON 2012)
5.13 W3C DOM overview Other DOM specifications include (and development continues): Document Object Model Level 1 (Second Edition) Document Object Model Level 3 XPath Document Object Model Requirements Document Object Model Level 3 Views and Formatting Document Object Model Level 3 Events Document Object Model Level 3 Abstract Schemas Developers access DOM functionality with particular parser (framework) API that (partially) supports DOM Level X+ Y+ (at least DOM Level 2 Core is typically needed), e.g., with Java, one can use JAXP (part of J2SE) or Apache components Structured Documents /5 Simple Application Programming (ON 2012)
5.14 A simple DOM example: Read XML (Java JAXP) /* * A simple JAXP/DOM example that (partially) pretty-prints the DOM tree. * ON 2012 */ package fi.tut.hlab.rd; import javax.xml.parsers.documentbuilder; import javax.xml.parsers.documentbuilderfactory; import java.io.file; import org.w3c.dom.document; import org.w3c.dom.element; import org.w3c.dom.namednodemap; import org.w3c.dom.node; import org.w3c.dom.nodelist; public class PrintDOM { Structured Documents /5 Simple Application Programming (ON 2012)
Document document; static String nodetypes[] = {"N/A","Element","Attribute","Text","CData","EntRef","EntNode","PI ","Comment","Document","DTD","Fragment","Notation"}; void indent(int dp, String c) { for (int ws=0; ws<dp; ws++) System.out.print(c); } void printparsetree(node e, int dp) { NodeList nl = e.getchildnodes(); for(int i=0; i<nl.getlength(); i++){ Node cn = nl.item(i); String nv = cn.getnodevalue(); indent(dp,"=="); System.out.print(nodeTypes[cn.getNodeType()]); System.out.print("("+cn.getNodeName()+") "); if (cn.getnodetype()==1) { // If an element, might have attributes NamedNodeMap al = ((Element)cn).getAttributes(); Structured Documents /5 Simple Application Programming (ON 2012)
} } for (int j=0; j<al.getlength(); j++) { System.out.print("@"+al.item(j).getNodeName()+" "); } } System.out.print("\n"); if (cn.getnodetype()!=10) // Do not parse the DTD if(cn.haschildnodes()) printparsetree(cn, dp+1); else { indent(dp," "); System.out.print("[" + nv.substring( 0, Math.min(20,nv.length()) ) ); if (nv.length()>20) System.out.print("..."); System.out.print("]\n"); } public PrintDOM(String file) throws Exception { DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); Structured Documents /5 Simple Application Programming (ON 2012)
} //factory.setvalidating(true); //factory.setnamespaceaware(true); DocumentBuilder builder = factory.newdocumentbuilder(); document = builder.parse(new File(file)); printparsetree((node)document,0); public static void main(string[] argv) throws Exception { if (argv.length!= 1) { System.err.println("Usage: java PrintDOM filename"); System.exit(1); } new PrintDOM(argv[0]); } } Structured Documents /5 Simple Application Programming (ON 2012)
5.15 Example input/output (stdout) <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE root [ <!ENTITY data "Some data"> ]> <?P1 aa?> <root lang="en" type="example"> Hello World! <?P2 aa?> <body foo="bar"> <p>this is</p> <p>a simple test.</p> <!-- See the following entities? --> <p> < &data; </p> </body> </root> DTD(root) PI(P1) [aa ] Element(root) @lang @type ==Text(#text) [ Hello World! ] ==PI(P2) [aa ] ==Text(#text) [ ] ==Element(body) @foo ====Text(#text) [ ]... Structured Documents /5 Simple Application Programming (ON 2012)
5.16 Notes XML document interpreted as the (Java) Document object Simple, usually sufficiently fast, but consumes (more) memory (because by default an in-memory object, however, different DOM implementations might physically behave differently) Thanks to Java, also a decent exception mechanism is now available (!), I/O considerations (?) Note that (when implementing genuine apps) explicit XML application programming takes typically (only) place in some of the related software components (often I/O)
5.17 About the DOM2 Core Basic for the basic APIs for XML programming Implemented in Java JAXP & Xerces, Python minidom,... "By default, everything is a Node; when convenient, perceive a node via the appropriate, more specific interface by type." Node types: ELEMENT_NODE, ATTRIBUTE_NODE, TEXT_NODE, CDATA_SECTION_NODE, ENTITY_REFERENCE_NODE, ENTITY_NODE, PROCESSING_INSTRUCTION_NODE, COMMENT_NODE, DOCUMENT_NODE, DOCUMENT_TYPE_NODE, DOCUMENT_FRAGMENT_NODE, NOTATION_NODE Structured Documents /5 Simple Application Programming (ON 2012)
5.18 DOM2 Core Interfaces Interfaces DOMException, ExceptionCode, DOMImplementation, DocumentFragment, Document, Node, NodeList, NamedNodeMap, CharacterData, Attr, Element, Text, Comment, CDATASection, DocumentType, Notation, Entity, EntityReference, ProcessingInstruction "Use Node to navigate, Document to create Elements, Element to access Attributes, etc." Structured Documents /5 Simple Application Programming (ON 2012)
5.19 Example: Node IDL Definition (see also the others) interface Node { // NodeType const unsigned short ELEMENT_NODE = 1; const unsigned short ATTRIBUTE_NODE = 2; const unsigned short TEXT_NODE = 3; const unsigned short CDATA_SECTION_NODE = 4; const unsigned short ENTITY_REFERENCE_NODE = 5; const unsigned short ENTITY_NODE = 6; const unsigned short PROCESSING_INSTRUCTION_NODE = 7; const unsigned short COMMENT_NODE = 8; const unsigned short DOCUMENT_NODE = 9; const unsigned short DOCUMENT_TYPE_NODE = 10; const unsigned short DOCUMENT_FRAGMENT_NODE = 11; const unsigned short NOTATION_NODE = 12; readonly attribute DOMString nodename; attribute DOMString nodevalue; // raises(domexception) on setting Structured Documents /5 Simple Application Programming (ON 2012)
// raises(domexception) on retrieval readonly attribute unsigned short nodetype; readonly attribute Node parentnode; readonly attribute NodeList childnodes; readonly attribute Node firstchild; readonly attribute Node lastchild; readonly attribute Node previoussibling; readonly attribute Node nextsibling; readonly attribute NamedNodeMap attributes; // Modified in DOM Level 2: readonly attribute Document ownerdocument; Node insertbefore(in Node newchild, in Node refchild) raises(domexception); Node replacechild(in Node newchild, in Node oldchild) raises(domexception); Node removechild(in Node oldchild) raises(domexception); Node appendchild(in Node newchild) Structured Documents /5 Simple Application Programming (ON 2012)
raises(domexception); boolean haschildnodes(); Node clonenode(in boolean deep); // Modified in DOM Level 2: void normalize(); // Introduced in DOM Level 2: boolean issupported(in DOMString feature, in DOMString version); // Introduced in DOM Level 2: readonly attribute DOMString namespaceuri; // Introduced in DOM Level 2: attribute DOMString prefix; // raises(domexception) on setting // Introduced in DOM Level 2: readonly attribute DOMString // Introduced in DOM Level 2: boolean hasattributes(); }; localname; Structured Documents /5 Simple Application Programming (ON 2012)
5.20 Example: Create & write XML doc (Java JAXP) /* A simple JAXP/DOM example that creates and writes the DOM tree. ON 2012 */ package fi.tut.hlab.rd; import java.io.file; import javax.xml.parsers.documentbuilder; import javax.xml.parsers.documentbuilderfactory; import javax.xml.transform.transformer; import javax.xml.transform.transformerfactory; import javax.xml.transform.dom.domsource; import javax.xml.transform.stream.streamresult; import org.w3c.dom.document; import org.w3c.dom.element; public class WriteDOM { public WriteDOM(String file) throws Exception { Document document; Structured Documents /5 Simple Application Programming (ON 2012)
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = factory.newdocumentbuilder(); document = builder.newdocument(); Element e, e2, root = document.createelement("data"); document.appendchild(root); e = document.createelement("head"); root.appendchild(e); e2 = document.createelement("title"); e.appendchild(e2); e2.appendchild( document.createtextnode("hello ") ); e2.appendchild( document.createtextnode("world!") ); e = document.createelement("body"); root.appendchild(e); document.getdocumentelement().normalize(); // Merge text nodes... // Note that since not specified below, the output method is inferred from root. // Tip: Try changing root name from data to html... Structured Documents /5 Simple Application Programming (ON 2012)
TransformerFactory transformerfactory = TransformerFactory.newInstance(); Transformer transformer = transformerfactory.newtransformer(); // For a non-identity transformation, use a specific transformation instead: // Transformer transformer = transformerfactory.newtransformer(xsltsource); DOMSource source = new DOMSource(document); StreamResult result = new StreamResult(new File(file)); transformer.transform(source, result); } public static void main(string[] argv) throws Exception { if (argv.length!= 1) { System.err.println("Usage: java WriteDOM filename (Careful: Rewrites filename!)"); System.exit(1); } new WriteDOM(argv[0]); } } Structured Documents /5 Simple Application Programming (ON 2012)
5.21 Notes Details of serialisation mechanisms vary among DOM implementations Note that when fast operation performance (log(n) etc.) is not needed, the DOM tree data model may also be useful "as such", without the idea of serialising it as XML at all (cf. java.util.treemap) Very useful in bookkeeping e.g. program properties Again, the central role of XSLT is clearly visible... Structured Documents /5 Simple Application Programming (ON 2012)
5.22 A yet another simple DOM example in Python... import xml.dom; # Count tracks and add @total from xml.dom.minidom import Node, Attr from fpformat import fix; doc = xml.dom.minidom.parse("music.xml"); nodes = doc.getelementsbytagname("album"); # Some serious assumptions about track format (what?) for k in range(len(nodes)): total = 0 for node in nodes[k].getelementsbytagname("track"): tlen = node.getattribute("len"); if len(tlen)<1: continue; # Count in seconds (assuming float input...) total = total + float(tlen[:1])*60 + float(tlen[2:4]); total = fix(total/60,0) + "M" + fix(totalfloat(fix(total/60,0))*60,0) + "S"; nodes[k].setattribute("total",total); # Modify in-memory object print doc.toxml(); doc.unlink(); Structured Documents /5 Simple Application Programming (ON 2012)
5.23 Scripting Intuitively "scripting" means lightweight-programming often with implicit variable and type declarations, etc. but can take place w.r.t. several programming languages Typically "small computing problems" may be solved with scripts (issues: scalability, debugging, developer tools,...) Procedural script programming often takes place in some JavaScript (ECMAScript) variant Scripting is not necessarily slow (bytecode compilation etc.), but is more sensitive to the execution environment (consider, e.g., trying to control how users update or choose browsers...) Structured Documents /5 Simple Application Programming (ON 2012)
5.24 Scripting use cases in XML app development (Dynamic) processor extension "scripts" (e.g. extending XSLT processor using Java or Javascript, perhaps embedded in the XML source) Pipeline scripts (e.g. as a pipeline [component] implementation tech) Document initialisation scripts (e.g. SVG onload event manager in Batik rasteriser) Dynamic run-time app scripts (e.g. interactive SVG application) As a consequence, scripts may be processor-like or reactive The latter behaviour is typically based on events Structured Documents /5 Simple Application Programming (ON 2012)
5.25 Events Basic concepts: event type, (registering) event listener, (implementing) event handler Typical event categories in SD (cf. DOM 2 Events) User interface (UI) events UI logical events Mutation events In most cases, application programming focuses on UI events Concurrency may be an issue Again, DOM defines events in IDL (in practice, several specifications are needed), some event types app-specific Structured Documents /5 Simple Application Programming (ON 2012)
5.26 DOM 2 Event IDL (see also more specific events) interface Event { // Introduced in DOM Level 2: // PhaseType const unsigned short CAPTURING_PHASE = 1; const unsigned short AT_TARGET = 2; const unsigned short BUBBLING_PHASE = 3; readonly attribute DOMString type; readonly attribute EventTarget target; readonly attribute EventTarget currenttarget; readonly attribute unsigned short eventphase; readonly attribute boolean bubbles; readonly attribute boolean cancelable; readonly attribute DOMTimeStamp timestamp; void stoppropagation(); }; void void preventdefault(); initevent(in DOMString eventtypearg, in boolean canbubblearg, in boolean cancelablearg); Structured Documents /5 Simple Application Programming (ON 2012)
5.27 Examples Recall the first SVG example Game programming Fractals... Structured Documents /5 Simple Application Programming (ON 2012)
5.28 Conclusion (apps are for people, not computers...) Good application development may mean positive laziness Rules of thumb (assuming the true app needs are known) Did someone already do it (how, under which license)? Is there a workaround using XSLT (using which division of labour?) Which component is missing? How to integrate? Program if must, or need to minimise errors (how can be tested?), or programming+executing takes much less than manual processing (remember to document your apps) What is essential (do not let yourself get diverted...) How to minimise n. of technologies? Dependencies? XML development is about working with XML data, standard components/apis, and a jigsaw puzzle of techniques/tools Structured Documents /5 Simple Application Programming (ON 2012)
6 Design Notes Introduction Common design activities and processes (Process) examples Design examples Conclusion
6.1 Introduction Designing is a process of communication, analysis (incl. trial-and-error), choice-making, testing, and documentation Knowing the activities of typical development is helpful and provides useful models and checklists for hands-on work A hard fact of commercial development, however, is that customers and contracts tend to dictate how development actually goes like it or not, this e.g. sometimes means Waterfall development model (which is not that bad if requirements really are explicitly known...) "Selling first", (detailed) designing second (oh dear...) Typical design error is aiming for (technical) "perfection"; rather, fulfil the customer needs in a sound & a fair way Structured Documents /6 Design Notes (ON 2012)
6.2 Common activities of (most) design processes Basic activities, common to all (software) development Requirements analysis and definition ( formalisation of the need and constraints; correct problem) System and software (and content) design ( architecture) Implementation and unit testing ( correct units) Integration and system testing ( correct system) Operation and maintenance ( evolution) Typical process models: waterfall development, (iterative) evolutionary Conceptualisation Problem-solving Realisation Problem Solution development, component-based Inception Elaboration Construction Transition development, etc. Development Cycle Product Generation Structured Documents /6 Design Notes (ON 2012)
6.3 Few words about risks Commercial... Requirements Technology Skills... Political... Structured Documents /6 Design Notes (ON 2012)
6.4 Related, specific development methods Structured systems analysis and design method (SSADM) Logical data modelling, Data flow modelling, Entity behaviour modelling Stages: Feasibility study, Investigation of the current environment, Business systems options, Requirements specification, Technical system options, Logical design, Physical design Object Oriented Hypermedia Design Method (OOHDM) Conceptual design, Navigational design, Abstract interface design, Implementation Dynamic systems development method (DSDM), Agile methods, specific use of UML diagrams, OMG processes,... Structured Documents /6 Design Notes (ON 2012)
6.5 A specific example: Mapping applications A very typical SD use case... Note that test data not only helps in understanding the problem but may also provide e.g. concrete templates (cf. XSLT) Level of documentation case-specific, but needs, test cases, and important design choices should be captured Structured Documents /6 Design Notes (ON 2012)
6.6 A specific example: Document type design Given a test case, analyse the information w.r.t. content, structure, presentation, and functionality, and consider what falls into the DTD, XSLT, CSS, and programming domains etc. Structured Documents /6 Design Notes (ON 2012)
6.7 Notes about EDI integration Before developing (propietary) schemas and vocabularies, preliminary studies are usually in order Think from the perspective of component-based re-use of Content Editors Common tools (e.g. validation, mapping) Viewers Users (Customers) Developers Training... Structured Documents /6 Design Notes (ON 2012)
6.8 XML Schema design briefly revisited Now, having outlined basics of functional and procedural XML application programming which involve design principles of software design in general let us revisit some schema design principles using XML DTD In brief, a document type (schema) describes an interface to structured data Structured Documents /6 Design Notes (ON 2012)
6.9 Basic document type design building blocks Basic idea is capturing information objects in design, with elements, attributes, and references Elements and attributes may have different roles in design Global top-down element structures (placeholder for objects) Complex object-like element structures (main objects) Field-like simple data element or attribute structures (properties of objects) Also linking structures together is important Systematic use of names (of proper scope, local or global) Consistent system of references (in suitable technology) Do not include stylistic parts or boilerplates in schemas Structured Documents /6 Design Notes (ON 2012)
6.10 Useful design pattern: Container elements A common design task is to group related information Container elements are placeholders for element groups; an obvious specialisation is to compiling joint information (or information that can be inferred) onto the container level <warehouses canbelocked="yes" owner="dock Warehouses, Inc."> <wh id="wh1" canbelocked="no" address="fine Road 1B" /> <wh id="wh2" address="fine Road 1C" /> <wh id="wh3" address="fine Road 2-4" /> </warehouses> Often related to information normalisation (consider adding elements "goods/item" that refer to particular warehouses) (If got interested, look also for more XML Design Patterns) Structured Documents /6 Design Notes (ON 2012)
6.11 Notes on references Complex structures are typically built with references A good reference is usually Correct (points to a right place) Supported by some technology (currently applicable) Persistent (irrelevant changes do not break) Consistent (can be validated) Of right scope (e.g. element vs. doc vs. site vs. Web) Robust against obvious changes (e.g. moving container) Not all references need to be formal or machineunderstandable but this usually helps Good names typically have useful structure, combining names typically requires appropriate information architecture (spec) Structured Documents /6 Design Notes (ON 2012)
6.12 Some linking technologies XML DTD (id, idref, idrefs) XML Schema (also element structures that suffice for keys, but often only ad hoc ways are used for referencing them) XLink (simple and multi-directional extended links, behaviours, traversal rules, roles, links may be defined outside the documents they reference, etc.) Specific application-specific system (e.g. container management systems, sitemaps, general-purpose data item management systems) URL, URI, IRI (central for Web integration, also crucial in namespaces and semantic descriptions) Structured Documents /6 Design Notes (ON 2012)
6.13 Hierarchical vs. relational (or tabular) structures Hierarchical and relational modelling are the basic approaches in modelling (other kinds of models: graphs and hybrids) Consider a simple hierarchical (tree-like) model: <data> <warehouses> <wh canbelocked="no" address="fine Road 1B"> <item name="car" amount="10" /> </wh> <wh canbelocked="yes" address="fine Road 1C"> <item name="piano" amount="20" /> <item name="drum set" amount="40" /> </wh> <wh canbelocked="yes" address="fine Road 2-4" /> </warehouses> </data> Structured Documents /6 Design Notes (ON 2012)
6.14 Hierarchical vs. relational structures (Cont'd) Consider a simple relational model: <data> <warehouses> <wh id="wh1" canbelocked="no" address="fine Road 1B" /> <wh id="wh2" canbelocked="yes" address="fine Road 1C" /> <wh id="wh3" canbelocked="yes" address="fine Road 2-4" /> </warehouses> <items> <item whlocation="wh1" name="car" amount="10" /> <item whlocation="wh2" name="piano" amount="20" /> <item whlocation="wh2" name="drum set" amount="40" /> </items> </data> Structured Documents /6 Design Notes (ON 2012)
6.15 Hierarchical vs. relational structures (Cont'd) Implications How "information objects" are realised Use of names and references (hierarchies yield implicit names...) Processing and programming methods (e.g. iterations vs. loops) Normalisation methods and techniques to physically chunk into modules It is usually easier to map relational structure into hierarchical than vice versa (a good hierarchy is a hierarchy because it needs to be, not because a better model was not thought of...) Structured Documents /6 Design Notes (ON 2012)
6.16 Data normalisation revisited... Well-established normalisation definitions and methods exist for relational data Free the database of modification anomalies (update, insert, delete) Minimize redesign when extending the database structure Make the data model more informative to users Avoid bias towards any particular pattern of querying Cf. 1NF-6NF; in database design "Codd's normal forms" 1NF-3NF are particularly widely used (which are typically free from update, insert and delete anomalies) Structured Documents /6 Design Notes (ON 2012)
6.17 Controlled terms and vocabularies Also, one needs to carefully pay attention to what is the important information in an application sometimes people use "data" when a set of controlled terms is really needed: <items> <item whlocation="wh1" type="it1" amount="10" /> <item whlocation="wh2" type="it2" amount="20" /> <item whlocation="wh2" type="it3" amount="40" /> </items> <itemtypes> <type id="it1" name="car" /> <type id="it2" name="piano" /> <type id="it3" name="drum set" /> </itemtypes> When anomalies seem to appear in terms of "data", more structures (subject of normalisation) may be needed Structured Documents /6 Design Notes (ON 2012)
6.18 Conclusion: Good design... follows specification (which hopefully captures the need...) is complete and sound (clear scope, no unspecified bits) is understandable (for developers, authors, administrators, and where appropriate, for other users) is simple and uses natural, well-defined concepts (for people's sake) favours well-defined modules and content (exploits common components but without redundancy/unnecessary couplings) is error-free (documents errors and plans for correcting them) can be tested (requirements and test cases) can be traced (documents also major versions/choices) provides a good, working solution with reasonable costs Structured Documents /6 Design Notes (ON 2012)