RUT developers handbook 9.51 Introduction to XML and DOM, with applications in Matlab v. 2.0



Similar documents
Overview of DatadiagramML

An XML Based Data Exchange Model for Power System Studies

XML: extensible Markup Language. Anabel Fraga

Data Integration through XML/XSLT. Presenter: Xin Gu

Extensible Markup Language (XML): Essentials for Climatologists

XML Schema Definition Language (XSDL)

Introduction to XML Applications

Standard Recommended Practice extensible Markup Language (XML) for the Interchange of Document Images and Related Metadata

DTD Tutorial. About the tutorial. Tutorial

Managing XML Documents Versions and Upgrades with XSLT

XML and Data Management

Structured vs. unstructured data. Motivation for self describing data. Enter semistructured data. Databases are highly structured

XML Processing and Web Services. Chapter 17

Semistructured data and XML. Institutt for Informatikk INF Ahmet Soylu

Last Week. XML (extensible Markup Language) HTML Deficiencies. XML Advantages. Syntax of XML DHTML. Applets. Modifying DOM Event bubbling

Agents and Web Services

Introduction to Web Services

10CS73:Web Programming

XSLT Mapping in SAP PI 7.1

[MS-ASMS]: Exchange ActiveSync: Short Message Service (SMS) Protocol

XML WEB TECHNOLOGIES

by LindaMay Patterson PartnerWorld for Developers, AS/400 January 2000

Java and XML parsing. EH2745 Lecture #8 Spring

Data Tool Platform SQL Development Tools

LabVIEW Internet Toolkit User Guide

A Workbench for Prototyping XML Data Exchange (extended abstract)

Firewall Builder Architecture Overview

XML. Document Type Definitions XML Schema

Web Services Technologies

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)?

Chapter 3: XML Namespaces

1. Overview of the Java Language

JobScheduler Web Services Executing JobScheduler commands

Qlik REST Connector Installation and User Guide

XML- New meta language in e-business

A LANGUAGE INDEPENDENT WEB DATA EXTRACTION USING VISION BASED PAGE SEGMENTATION ALGORITHM

HOW TO CREATE THEME IN MAGENTO 2

What's New in ADP Reporting?

ART 379 Web Design. HTML, XHTML & CSS: Introduction, 1-2

An Approach to Eliminate Semantic Heterogenity Using Ontologies in Enterprise Data Integeration

Ficha técnica de curso Código: IFCAD320a

OpenIMS 4.2. Document Management Server. User manual

About XML in InDesign

metaengine DataConnect For SharePoint 2007 Configuration Guide

XML. CIS-3152, Spring 2013 Peter C. Chapin

Markup Languages and Semistructured Data - SS 02

Exchanger XML Editor - Canonicalization and XML Digital Signatures

Internet Technologies_1. Doc. Ing. František Huňka, CSc.

Installation & User Guide

Developers Guide. Designs and Layouts HOW TO IMPLEMENT WEBSITE DESIGNS IN DYNAMICWEB. Version: English

Web Development. Owen Sacco. ICS2205/ICS2230 Web Intelligence

REDUCING THE COST OF GROUND SYSTEM DEVELOPMENT AND MISSION OPERATIONS USING AUTOMATED XML TECHNOLOGIES. Jesse Wright Jet Propulsion Laboratory,

Short notes on webpage programming languages

CST6445: Web Services Development with Java and XML Lesson 1 Introduction To Web Services Skilltop Technology Limited. All rights reserved.

Chapter 1. Dr. Chris Irwin Davis Phone: (972) Office: ECSS CS-4337 Organization of Programming Languages

Introduction to Ingeniux Forms Builder. 90 minute Course CMSFB-V6 P

Foglight. Dashboard Support Guide

Search and Information Retrieval

Introducing Apache Pivot. Greg Brown, Todd Volkert 6/10/2010

Xtreeme Search Engine Studio Help Xtreeme

Basic Website Creation. General Information about Websites

Data XML and XQuery A language that can combine and transform data

SEPA formats - an introduction to XML. version September

Multimedia Applications. Mono-media Document Example: Hypertext. Multimedia Documents

Ektron to EPiServer Digital Experience Cloud: Information Architecture

Developing XML Solutions with JavaServer Pages Technology

Working with JSON in RPG. (YAJL Open Source JSON Tool)

Translating between XML and Relational Databases using XML Schema and Automed

OVERVIEW OF ASP. What is ASP. Why ASP

Extending the Linked Data API with RDFa

Introduction to XML. Data Integration. Structure in Data Representation. Yanlei Diao UMass Amherst Nov 15, 2007

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

Introduction to Web Design Curriculum Sample

Schematron Validation and Guidance

FileMaker Server 9. Custom Web Publishing with PHP

StreamServe Persuasion SP4 StreamServe Connect for SAP - Business Processes

[MS-ACCDT]: Access Template File Format. Intellectual Property Rights Notice for Open Specifications Documentation

INSTALLATION AND CONFIGURATION MANUAL ENCODER

Increasing Productivity and Collaboration with Google Docs. Charina Ong Educational Technologist

IBM Operational Decision Manager Version 8 Release 5. Getting Started with Business Rules

04 XML Schemas. Software Technology 2. MSc in Communication Sciences Program in Technologies for Human Communication Davide Eynard

12 File and Database Concepts 13 File and Database Concepts A many-to-many relationship means that one record in a particular record type can be relat

Design and Development of Website Validator using XHTML 1.0 Strict Standard

AJAX The Future of Web Development?

1/20/2016 INTRODUCTION

UNIVERSITY OF WATERLOO Software Engineering. Analysis of Different High-Level Interface Options for the Automation Messaging Tool

WEB SITE DEVELOPMENT WORKSHEET

Managing large sound databases using Mpeg7

Web Design & Development - Tutorial 04

Contents. Launching FrontPage Working with the FrontPage Interface... 3 View Options... 4 The Folders List... 5 The Page View Frame...

Mobile Web Design with HTML5, CSS3, JavaScript and JQuery Mobile Training BSP-2256 Length: 5 days Price: $ 2,895.00

Concrete uses of XML in software development and data analysis.

Transcription:

2004-01-05 LiTH RUT developers handbook 9.51 Introduction to XML and DOM, with applications in Matlab v. 2.0 Eric Karlsson Abstract An XML document can be used as an excellent intermediate storage for data that needs to be categorized and later reused and published in different forms. Examples of this kind of data are test reports, source code documentation and logs. XML is based on DOM, an abstract model for describing documents as a tree structure. The interpretation of XML files as DOM object is done by an XML parser. This document first describes the concept XML and DOM and how these concepts are realized in the XML parser Xerces. Thereafter it shows how XML documents can be validated using DTD and XML Schema. Finaly it shows examples in Matlab, implementing Xerces. RUT developers handbook 9.51 Introduction to XML and DOM, with applications in Matlab v. 2.0 1

2 RUT developers handbook 9.51 Introduction to XML and DOM, with applications in Matlab v. 2.0

Field of application 1 Field of application This RUT tries to give an overview of DOM, XML and validation of XML using XML Schema and DTD. It also describes the XML parser Xerces and finally shows an example of usage of all these parts using Matlab. This RUT is primarily for PUM projects that want to publish XML data from Matlab 6.5. Since DOM and XML are well-established standards and Xerces is a very common XML parser used in for instance the web server Apache, this document can also be used in other projects. Considering the extensive documentation concerning the above concept, this document is considered an introduction to the subject in whole and to show the connections between the parts. This is something that can be hard to find in existing literature, which rarely are written at a level useful for quickly getting a good overview over the concepts in the early stage of a PUM project. This document can partly be used in the design phase of the project to create a notion of the possibilities and advantages of using XML for intermediate data storage, and partly in the implementation phase for introduction and reference. 2 Prerequisites This document is written in a somewhat basic level but some knowledge in HTML and XML syntax will make the reading easier. The concept object oriented design should also be familiar. To use XML in Matlab, version 6.5 is required, in which the XML parser Xerces Java 2 is implemented. This RUT can also be of use for projects that plan to incorporate an XML parser in the project, but the document does not mention anything about how an installation is performed. Instead it refers to the documentation on the Xerces homepage [Apache]. 3 Realization This section deals with the basics of XML, DOM and Xerces and how this is implemented in Matlab. Both XML and DOM are standards developed by World Wide Web Consortium (W3C). Xerces is developed by Apache Software Foundation and Matlab is developed by Mathworks. 3.1 About XML XML, extensible Markup Language, is a language used to structure information and works the same way as the HTML format used to structure layout on a web page. In difference to HTML, XML supports creation of custom tags. These tags don t contain any layout information; all formatting is done in external style sheets. Data is separated from layout. This leads to portable da- RUT developers handbook 9.51 Introduction to XML and DOM, with applications in Matlab v. 2.0 3

Realization ta, since different style sheets can be used on different plattforms and applications. One example of an XML document is given in section 7.1, page 8. The advantage of using XML instead of a custom data format is naturally that there is a lot of tools and standards that can be used to ease and improve the project, and at the same time eliminating many sources of errors by using a ready and well documented standard. Thanks to the structured storage of data, different style sheets can be used to easily filter, organize and present the information in different ways on a web page, or transform the document to pdf, doc or other file formats. It s worth noting that XML isn t an effective storage format since is uses relatively much storage space, but also because of the fact that parsing 1 and transforming needs a lot of processing power. On the other hand, the advantages of simplicity, portability and possibility of transformation often outweigh these downsides. 3.2 About DOM DOM, Document Object Model, is a standardized platform and language independent object oriented interface defining a number of object with which a document (in particular HTML or XML) can be described as a hiearchic tree structure. The standardized objects and methods are used to easily manipulate documents and produces uniform, reusable programs. The DOM specification is split into a number of levels, where some, at the time of writing, are not yet finished. Each level contains several parts, each more or less individually developed. Level 1 Basic functions for navigation and manipulation of documents. Split into Core, a minimalistic general document presentation, and HT- ML, containing structures of higher order in addition to those specified in Core. A second version of level 1 is being developed. Level 2 Contains models for style sheets and functions for manipulating them. It also contains functions for traversing documents and event management. The Core part contains support for namespaces in XML. Level 3 Functions for opening and saving documents, content models (DTD, XML Schema) and validation of documents. None of the parts of this level are completed at the time of writing. There are more or less well-developed plans for different query languages, window systems, multithreading, security and other things that belong to higher levels. It s in level 1 (Core) and to some extent level 2 (Core) that we find the objects and methods needed to handle XML documents. [Christiaanse] gives and overview over these on level 1, but for a more complete description of DOM, se [Cover] or [W3Cb]. 1. Parsing, see 3.3, page 5 4 RUT developers handbook 9.51 Introduction to XML and DOM, with applications in Matlab v. 2.0

Realization Structure DOM describes a document as a tree structure of nodes. There are 12 variants of node objects 1 in level 1 (Core), each with different properties and structure, but they inherit certain attributes and methods from a general node object. A node can have any number of children, but only one parent. Pointers to these are stored in pointer lists, which can be traversed and manipulated to add more children or move nodes. The figure below shows a simple picture of how an XML document is represented in DOM. The nodes are represented by squares and any names are given in italic inside the node. Any value of a node is given in a rounded square. For an explanation of XML syntax, see section 6.1, page 7. <?xml version= 1.0?> <person email= john@doe.com >John Doe</person>. Document ProcessingInstruction version Element person 1.0 Attr email Text Text John Doe john@doe.com Figure 1. DOM representation What s important to note is that DOM is only a recommendation and that it is up to the software developers to follow this standard. 3.3 About Xerces An XML document is in itself just a text file, a long stream of characters, which itself doesn t resemble a tree structure. This stream must be parsed for the computer to interpret it, sorting it into a number of linked lists in memory. These lists can be traversed and modified using the standard interface that DOM provides. The program performing this job is called an XML parser. 1. Document, DocumentFragment, DocumentType, EntityReference, Element, Attr, ProcessingInstruction, Comment, Text, CDATASection, Entity, Notation RUT developers handbook 9.51 Introduction to XML and DOM, with applications in Matlab v. 2.0 5

Result Xerces is a validating 1 XML parser developed as a par of the Apache XML project [Apache]. It consists of a software library written in C++, Java and to some extent in Perl and are to be used in other applications to give them the possibility to handle XML documents. For example Xerces is implemented in Matlab 6.5 and the Apache web server. It s in other words possible to download Xerces from [Apache] and incorporate it into a PUM project. Xerces implements DOM level 1 and 2 (Core). It also supports some of the preliminary recommendations specified for level 3 by W3C. Observe that there are other alternatives except Xerces if you want to give your application the possibility to handle XML data. The reason for focusing on Xerces is because it is implemented in Matlab 6.5 and the Apache web server, because it s well documented and because it s available in C++, Java, (to some extent) Perl and Microsoft Com. For a performance comparison between different XML parsers see [Cooper] or [Webreference]. 3.4 About XML in Matlab Matlab 6.5 contains a Xerces implementation in Java, capable of interpreting XML elements in a text file as nodes in a hierarchical tree. The developers of Matlab, Mathworks, only needed to add three Matlab functions to Xerces to read and write to an XML file; xmlread, xmlwrite and xslt. xmlread(filename) parses an XML file and returns the DOM Document object which can be manipulated according to the implementation of DOM that Xerces Java 2 specifies in its JavaDoc documentation [Apache]. xmlwrite(filename, DOMnode) writes the Document object to the XML file. The function can also be used to convert DOM nodes to text strings. xslt(source, style, destination) performs a transformation from XML to HTML based on a style sheet (XSLT). It s very simple to start using XML in Matlab 6.5. Section 7.2, page 9 illustrates how XML can be used in Matlab. For more information and help on Matlab, see [Mathworks]. 4 Result The process described in this document will save much time and generate a safer, more extensible and stable product if: The documentation should be automatically generated The documentation should be displayed on a web page The documentation should be fetched from Matlab 1. See section 6.2, page 7 for explanation in validation. 6 RUT developers handbook 9.51 Introduction to XML and DOM, with applications in Matlab v. 2.0

Templates and forms 5 Templates and forms No particular templates or forms exist. 6 Verification of results Using a ready XML parser eliminates many sources of error. What s left is to verify that everything that is fed into the XML document is text strings and not of any other data type. 6.1 Well-built XML Well-built XML means that the XML document fulfils the basic demands of XML; one root element, end tags and quoted values. This is generally no problem if all data is added to the XML document using an XML parser, but if the XML document is written by hand it is good to know the basic rules that apply for writing correct XML. One root element Every XML document must exactly one root element which encloses all other elements. The only parts of XML that are allowed outside (before) the root element is parsing instructions. End tags Every element must have an end tag, alternatively a slash before the last > if the element contains no text. Elements may be nested, but not overlapping. If you start element A and the start element B, B must be ended before A is ended. Case XML is case sensitive. Letter, LETTER and letter are considered different tags. Values must be quoted. The only way of verifying that your XML document are well built are to attempt to open them in an XML capable browser, like newer versions of Internet Explorer, Mozilla, Opera or Netscape Navigator. All of these will tell you on which line any error occurred. 6.2 Validated XML If you plan to use extensive XML documents it is a good idea to create not only well-built documents, but also validated documents. The difference between these two is that in the later case a schema, or specification, is created. This is then used to validate the XML documents. The validation process is entirely automatic and there are free tool in the Internet to do this online. Today there are two variants of schema for code validation: DTD and XML Schema. RUT developers handbook 9.51 Introduction to XML and DOM, with applications in Matlab v. 2.0 7

Examples with explanations DTD DTD, Document Type Definition, is the older variant and consists of a text document in which all valid elements of an XML document is specified. One example of DTD can be found in section 7.3, page 10. When a DTD is defined it can be used to validate XML documents using different kinds of validators. There are several validators available, both for download and for online validation. If the XML document is to be publicly published on the Internet, the Scholarly Technology Group XML validator from Brown University [STG] is recommended. XML Schema XML Schema is newer, simpler and contains much more possibilities than DTD, but is at the time of writing not completely implemented. Here, schemas are built like ordinary XML documents and it s possible to define namespaces with local elements and check data types (like numbers, dates and custom data types). There is however as yet no support for defining entities 1, these must still be defined in a DTD file. One example of XML Schema can be found in section 7.4, page 10. [W3Cc] offers one good validator for XML Schema. 7 Examples with explanations This RUT describes a number of different techniques and standards briefly. This section gives explained examples of these techniques in order to clarify the descriptions and provide a better understanding of how these techniques can be used. 7.1 XML XML documents are written in normal text files and are given the extension.xml. Below is an example of a simple XML document describing how email could look like. Only a minimal amount of elements are used for easier reading. <?xml version= 1.0 encoding= ISO-8859-1?> <letter> <addressee email= john@doe.com >John Doe</addressee> <sender>jane Doe</sender> <paragraph>hello John!</paragraph> <paragraph /> </letter> 1. Entities are partly used for including longer texts, like copyright statements, or for including special characters, like the copyright symbol. 8 RUT developers handbook 9.51 Introduction to XML and DOM, with applications in Matlab v. 2.0

Examples with explanations On the first line of the example we see the XML declaration specifying that this is XML version 1.0 (the only one so far) and that we are using the character set ISO-8859-1, making it possible to use international characters. The question marks denote that this is a processing instruction, which can also be used to state which style sheets to use and so on. All of these must be placed at the beginning of the document. Strings surroundend by < and > are called tags. These should allways be matched with end tags, beginning with a slash (as in </sender>), or if the tag doesn t contain any data, it can simply end with a slash (as in <paragraph / >). Everything between a start tag and an end tag is the contents of an element, regardless if it s just text or more elements. The element <letter> above contains all the other elements (except the processing instructions) and is therefore called the root element of the documents. The element <addressee> has an attribute, email, which in turn has a value, which is quoted. These quotes are required. There are no limitations on how many attributes an element can have, the only requirement is that they are all unique. The second paragraph (<paragraph />) shows how you can write an element that doesn t contain any text, which is by simply adding a slash at the end of the tag instead of using an end tag. This kind of element can naturally also contain attributes. The end tag of an element never contains any attributes, it s only purpose it to mark the end of the element. 7.2 Xerces in Matlab Here is an example which opens the XML file letter.xml defined in section 7.1, page 8 and adds another addressee. Observer that the functions xmlread and xmlwrite are Matlab functions. The other functions are from Xerces Java Parser. Rows beginning with % are Matlab comments and explain the function of the rows below. %Open letter.xml and retrieve DOM document node docnode = xmlread( letter.xml ); letter = docnode.getdocumentelement; %Create node <addressee>joe Doe</addressee> addr2 = docnode.createelement( addressee ); addr2.appendchild(docnode.createtextnode( Joe Doe )); %Add attributes (email="joe@doe.com") addr2.setattribute( email, joe@doe.com ); % append <addressee...> to document and save letter.appendchild(addr2); xmlwrite( letter.xml,docnode); RUT developers handbook 9.51 Introduction to XML and DOM, with applications in Matlab v. 2.0 9

Examples with explanations 7.3 DTD Here is an example of a DTD that can be used to validate the XML document found in section 7.1, page 8. <!ELEMENT letter (addressee+, sender, paragraph*> <!ELEMENT addressee (#PCDATA)> <!ATTLIST addressee email CDATA #IMPLIED> <!ELEMENT sender (#PCDATA)> <!ELEMENT paragraph (#PCDATA)>!ELEMENT specifies an XML element, i.e a tag.!attlist specifies an attribute in an element, like email in <addressee email= >. Inside the parenthesis is a spcification of which and how many child nodes are allowed in the element. The number is denoted by an asterisk, * (none or more), a plus sign, + (one or more) or a question mark,? (none or one). If none of these characters are included, it means that the number is exactly one. This means that <letter> must contain at least one addressee, exactly one sender and any number of paragraphs. The element addressee has the attribute email, which is required (#IM- PLIED) and whos value is normal text (CDATA, character data). #PCDATA stands for Parsed CDATA, meaning that all characters except &, < and ]]> are allowed, since these will be parsed. 7.4 XML Schema The following XML Schema describes the same letter as the DTD above. <?xml version="1.0"?> <xsd:schema xmlns="http://www.ida.liu.se/~tddc02/"> <element name="letter" type="lettertype" /> <element name= addressee type= addresseetype /> <xsd:complextype name="lettertype"> <xsd:all> <element name="sender" type="xsd:string" /> <element name="paragraph" type="xsd:string" maxoccurs= unbounded minoccurs= 0 /> <element ref="addressee" maxoccurs="unbounded" /> </xsd:all> </xsd:complextype> <xsd:complextype name= addresseetype > <xsd:attribute name= email type= xsd:string /> </xsd:complextype> </xsd:schema> 10 RUT developers handbook 9.51 Introduction to XML and DOM, with applications in Matlab v. 2.0

Solutions to common problems All elements belonging to XML Schema are given in the namespace xsd, e.g <xsd:schema>. The second row has a unique namespace been defined for our email document by borrowing the URL of the PUM course. The following rows define two elements, <letter> and <addressee>, which become global since they are defined at the first level of the tree structure. After that comes the definition of a complex data type, lettertype. The Schema tag all has been used to group the included elements <addressee>, <paragraph> and <sender>, and denotes that they don t have to be given in any special order. Alternatives to all is sequence, denoting that the given order must be followed, and choice, denoting that only one element (or group of elements) may be used. The elements <addressee> and <paragraph> have been defined inside the element <letter>, but the element <addressee> has instead been referred to the global element defined above. The attributes minoccurs and maxoccurs denote how many instances of the element that may appear. The standard value for both is 1. The attribute type defines which type of element an element is. This can either be one of the following types in XML Schema; string, decimal, float, boolean, date, time, uri-reference, language etc. or a custom type like lettertype. The custom types can also be specified using regular expressions. 8 Solutions to common problems Invalid characters By default, XML documents use UTF-8 or UTF-16 for character coding. This means that international characters are not supported. To change this, rewrite the XML declaration on the first row to use another charcter set: <?xml version= 1.0 encoding= ISO-8859-1?> Case XML is case sensitive. Letter, LETTER and letter are different elements. 9 Adjustment to the PUM course No particular adjustment is needed to suit the PUM course. 10 Measurement of process No particular measurements exist for this process. RUT developers handbook 9.51 Introduction to XML and DOM, with applications in Matlab v. 2.0 11

History of the process 11 History of the process Version Datum Redaktör Kommentar 0.1 03-05-20 Henrik Leion Document created 1.0 03-06-02 Henrik Leion Updated after review 2.0 04-01-05 Eric Karlsson Translated to English 12 Changes not yet attended to The document became quite extensive and well-presented, but I think there is a benefit from getting a quick overview of the subject and hints on where to find more information. The parts can of course be extended a lot, especially the section about validation, but there is at the same time no major point in descibing all the techniques too detailed, as there is better literature that does this. One part that would be valuable to include is experience from implementing Xerces in a PUM project and any problems this caused. This document could, especially is it s complemented with experience from a Xerces implementation, possibly be split into two or more parts. One area which has intentionally been excludes is XSL(T), but there is nothing that prevents this from being mensioned in newer versions. 13 References [Apache], Apache Software Foundation, http://xml.apache.org/index.html [Christiaanse] Vance Christiaanse, Visualizing DOM level 1, http://www.xml.com/pub/a/1999/07/dom/index.html [Cooper], Clark Cooper, XML Parser Performance Testing, http://www.xml.com/lpt/a/benchmark/exec.html [Cover], Robin Cover, Document Object Model, http://xml.coverpages.org/dom.html [Mathworks], Matlab documentation, http://www.mathworks.com/matlab [STG] Scholarly Technology Group, Brown university, XML-Validator, http://www.stg.brown.edu/service/xmlvalid/. [W3Ca], World Wide Web Consortium, XML specification, http://www.w3.org/xml 12 RUT developers handbook 9.51 Introduction to XML and DOM, with applications in Matlab v. 2.0

References [W3Cb], W3C, DOM specification, http://www.w3.org/dom/ [W3Cc] W3C, XML-Schema validator, http://www.w3.org/2000/06/webdata/xsv. [Webreference], XML Parser Comparison, http://webreference.com/xml/column22/ RUT developers handbook 9.51 Introduction to XML and DOM, with applications in Matlab v. 2.0 13