Java and XML parsing. EH2745 Lecture #8 Spring 2015. larsno@kth.se



Similar documents
XML and Data Management

Introduction to XML. Data Integration. Structure in Data Representation. Yanlei Diao UMass Amherst Nov 15, 2007

Last Week. XML (extensible Markup Language) HTML Deficiencies. XML Advantages. Syntax of XML DHTML. Applets. Modifying DOM Event bubbling

DTD Tutorial. About the tutorial. Tutorial

Chapter 3: XML Namespaces

04 XML Schemas. Software Technology 2. MSc in Communication Sciences Program in Technologies for Human Communication Davide Eynard

XML: extensible Markup Language. Anabel Fraga

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)?

Structured vs. unstructured data. Motivation for self describing data. Enter semistructured data. Databases are highly structured

Languages for Data Integration of Semi- Structured Data II XML Schema, Dom/SAX. Recuperación de Información 2007 Lecture 3.

XML & Databases. Tutorial. 2. Parsing XML. Universität Konstanz. Database & Information Systems Group Prof. Marc H. Scholl

Introduction to XML Applications

Semistructured data and XML. Institutt for Informatikk INF Ahmet Soylu

An XML Based Data Exchange Model for Power System Studies

by LindaMay Patterson PartnerWorld for Developers, AS/400 January 2000

Data Integration through XML/XSLT. Presenter: Xin Gu

XML Processing and Web Services. Chapter 17

Web Services Technologies

Agents and Web Services

Structured vs. unstructured data. Semistructured data, XML, DTDs. Motivation for self-describing data

Overview of DatadiagramML

Firewall Builder Architecture Overview

RUT developers handbook 9.51 Introduction to XML and DOM, with applications in Matlab v. 2.0

6. SQL/XML. 6.1 Introduction. 6.1 Introduction. 6.1 Introduction. 6.1 Introduction. XML Databases 6. SQL/XML. Creating XML documents from a database

Data XML and XQuery A language that can combine and transform data

Extensible Markup Language (XML): Essentials for Climatologists

Exchanger XML Editor - Canonicalization and XML Digital Signatures

Managing XML Documents Versions and Upgrades with XSLT

The Forger s Art Exploiting XML Digital Signature Implementations HITB 2013

Standard Recommended Practice extensible Markup Language (XML) for the Interchange of Document Images and Related Metadata

CST6445: Web Services Development with Java and XML Lesson 1 Introduction To Web Services Skilltop Technology Limited. All rights reserved.

XML Databases 6. SQL/XML

XBRL Processor Interstage XWand and Its Application Programs

XML in programming. Patryk Czarnik. XML and Modern Techniques of Content Management 2012/13

How To Write A Contract Versioning In Wsdl 2.2.2

10CS73:Web Programming

<Namespaces> Core XML Technologies. Why Namespaces? Namespaces - based on unique prefixes. Namespaces. </Person>

High Performance XML Data Retrieval

Extracting data from XML. Wednesday DTL

Connecting to WebSphere ESB and WebSphere Process Server

REDUCING THE COST OF GROUND SYSTEM DEVELOPMENT AND MISSION OPERATIONS USING AUTOMATED XML TECHNOLOGIES. Jesse Wright Jet Propulsion Laboratory,

Interaction Translation Methods for XML/SNMP Gateway Using XML Technologies

DataDirect XQuery Technical Overview

XML and Data Integration

Visualization of GML data using XSLT.

T XML in 2 lessons! %! " #$& $ "#& ) ' */,: -.,0+(. ". "'- (. 1

Modern Databases. Database Systems Lecture 18 Natasha Alechina

WWW. World Wide Web Aka The Internet. dr. C. P. J. Koymans. Informatics Institute Universiteit van Amsterdam. November 30, 2007

Presentation / Interface 1.3

[MS-MDM]: Mobile Device Management Protocol. Intellectual Property Rights Notice for Open Specifications Documentation

T Network Application Frameworks and XML Web Services and WSDL Tancred Lindholm

2. Distributed Handwriting Recognition. Abstract. 1. Introduction

XML Schema Definition Language (XSDL)

Schematron Validation and Guidance

How To Use Xml In A Web Browser (For A Web User)

XML WEB TECHNOLOGIES

CS 378 Big Data Programming. Lecture 9 Complex Writable Types

XML. CIS-3152, Spring 2013 Peter C. Chapin

XSL - Introduction and guided tour

XML- New meta language in e-business

Unified XML/relational storage March The IBM approach to unified XML/relational databases

Markup Languages and Semistructured Data - SS 02

Towards XML-based Network Management for IP Networks

Translating between XML and Relational Databases using XML Schema and Automed

XML - A Practical Application and Design

JavaScript: Client-Side Scripting. Chapter 6

XSLT Mapping in SAP PI 7.1

Lecture 5: Java Fundamentals III

Application of XML Tools for Enterprise-Wide RBAC Implementation Tasks

Deferred node-copying scheme for XQuery processors

AN ENHANCED DATA MODEL AND QUERY ALGEBRA FOR PARTIALLY STRUCTURED XML DATABASE

Chapter 2: Designing XML DTDs

Design Patterns in Parsing

Concrete uses of XML in software development and data analysis.

CS3600 SYSTEMS AND NETWORKS

XML for RPG Programmers: An Introduction

EMC Documentum xdb. Installation and Administration Guide P/N A01. Version 9.0

Lecture 9. Semantic Analysis Scoping and Symbol Table

Managing large sound databases using Mpeg7

Coping with Semantics in XML Document Management

Compiler I: Syntax Analysis Human Thought

READING DATA FROM KEYBOARD USING DATAINPUTSTREAM, BUFFEREDREADER AND SCANNER

Modernize your NonStop COBOL Applications with XML Thunder September 29, 2009 Mike Bonham, TIC Software John Russell, Canam Software

By Nabil ADOUI, member of the 4D Technical Support team

DEVELOPMENT OF THE INTEGRATING AND SHARING PLATFORM OF SPATIAL WEBSERVICES

Ficha técnica de curso Código: IFCAD320a

Multiple electronic signatures on multiple documents

Transcription:

Java and XML parsing EH2745 Lecture #8 Spring 2015 larsno@kth.se

Lecture Outline Quick Review The XML language Parsing Files in Java

Quick Review We have in the first set of Lectures covered the basics of the Java Language. Next is to make something useful with it. Like learning a new human language requires practice, so does programming

I/O and Files in Java My Java Program JVM (Packages) OS (Linux, OSX, Win..) HW (CPU, RAM, HDD, I/O) Accessing I/O is built on the InputStream and FileStream For keyboard input, this comes from System.in to which the InputStream is pointed. InputStream is enhanced in InputStreamreader which maps the read Bytes to Characters For file input, the FileStream is handed a filename

Enhanced reading Both InputStreamreader and FileReader can be used by a BufferedReader, which reads sets of characters (e.g a line) ABC InputStreamReader BufferedReader A B C 001010101010001 InputStream AA F1 08 FileReader fr = new FileReader( readme.txt ); BufferedReader br = new BufferedReader(fr);

Parsing Parsing involves analysing a string of characters (or words) and and breaking them down into components. Likeifapersonreadsthislongparagraphoftextwhic hactuallyhasseveralwordsinitbutsinceitiswritt enwithoutanyseparatorsitisdifficulttoseethese wordsfortunatelyapersonisreasoanblyintelligen tandcanreadthisentenceanywayacomputercannotdo sohoweversinceitisnotintelligentitneedssomewa yoftellingitwhichcharactersshouldbegroupedint ogroups(words)thatcanbestoredinitsmemoryandus edintheprogram

Parsing - How to convert read lines to data the program can work with 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-setosa 5.4,3.9,1.7,0.4,Iris-setosa 4.6,3.4,1.4,0.3,Iris-setosa Comma Separate Values (CSV) is a straifghtforward way to store the data in a file The program(mer) needs to know the structure of the file. String SplitBy = ", ; List<Flower> flowerlist = new ArrayList<Flower>(); br = new BufferedReader(new FileReader(dataFile) while ((line = br.readline())!= null) { String[] flower = line.split(splitby); double[] param = {0.0,0.0,0.0,0.0}; for (int j = 0; j < 4; j++) { param[j] = Double.parseDouble(flower[j]); } flowerlist.add(new Flower(param,flower[4])); }

Is the file format important? l 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-setosa 5.4,3.9,1.7,0.4,Iris-setosa 4.6,3.4,1.4,0.3,Iris-setosa Suppose that you ve got book data that you want to pass between some systems

Is the file format important? <Flower> <Sepal Length>5.0</Sepal Length> <Sepal Width>3.6</Sepal Width> <Petal Length>0.8</PetalLength> <Petal Width>0.2</Petal Width> <Species>Iris-Setosa</Species> </Flower> <Flower> <Sepal Length>5.0</Sepal Length> <Sepal Width>3.6</Sepal Width> <Petal Length>0.8</PetalLength> <Petal Width>0.2</Petal Width> <Species>Iris-Setosa</Species> </Flower> <Flower> <Sepal Length>5.0</Sepal Length> <Sepal Width>3.6</Sepal Width> <Petal Length>0.8</PetalLength> <Petal Width>0.2</Petal Width> <Species>Iris-Setosa</Species> </Flower>

Which is the better choice? l l Comma Separated Values Slash-delimited Advs: Pros Little space Little overhead extra data has to be sent (only the commas) Disadv: Cons: Rigid format Data has (cannot to arrive shuffle in the the right data order around) Tag-delimited (XML) Using Tags(XML) Advs: Flexible Pros: format (can shuffle the data around) Tags enhance Flexible ability format to (data search can for arrive data out of order) Tags enhance Thanks ability to tags to we extract can search subsets for of data data Disadvs: People can read it!!! Verbose (XML Bloat) Cons: Verbose a lot of overhead!

Overhead and filesize What is the overhead in these two examples? 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-setosa 5.4,3.9,1.7,0.4,Iris-setosa 4.6,3.4,1.4,0.3,Iris-setosa <Flower> <Sepal Length>5.0</Sepal Length> <Sepal Width>3.6</Sepal Width> <Petal Length>0.8</PetalLength> <Petal Width>0.2</Petal Width> <Species>Iris-Setosa</Species> </Flower> <Flower> <Sepal Length>5.0</Sepal Length> <Sepal Width>3.6</Sepal Width> <Petal Length>0.8</PetalLength> <Petal Width>0.2</Petal Width> <Species>Iris-Setosa</Species> </Flower> <Flower> <Sepal Length>5.0</Sepal Length> <Sepal Width>3.6</Sepal Width> <Petal Length>0.8</PetalLength> <Petal Width>0.2</Petal Width> <Species>Iris-Setosa</Species> </Flower>

Lecture Outline Quick Review The XML language Parsing Files in Java

XML extensible Markup Language <?xml version="1.0" encoding="utf-8"?> <note> <to>tove</to> <from>jani</from> <heading>reminder</heading> <body>don't forget me this weekend!</body> </note>

XML is (M)a(ny) standard developed by W3C XML Core Working Group: XML 1.0 (Feb 1998), 1.1 (candidate for recommendation) XML Namespaces (Jan 1999) XML Inclusion (candidate for recommendation) XSLT Working Group: XSL Transformations 1.0 (Nov 1999), 2.0 planned XPath 1.0 (Nov 1999), 2.0 planned extensible Stylesheet Language XSL(-FO) 1.0 (Oct 2001) XML Linking Working Group: XLink 1.0 (Jun 2001) XPointer 1.0 (March 2003, 3 substandards) XQuery 1.0 (Nov 2002) plus many substandards XMLSchema 1.0 (May 2001) Source: XML for Beginners, Ralf Schenkel April 2003.

A Simple XML Document <article> <author>gerhard Weikum</author> <title>the Web in Ten Years</title> <text> <abstract>in order to evolve...</ abstract> <section number= 1 title= Introduction > The <index>web</index> provides the universal... </section> </text> </article> Source: XML for Beginners, Ralf Schenkel April 2003.

A Simple XML Document <article> <author>lars Nordström</author> <title>the Meaning of Life</title> <text> <abstract>to begin with<abstract> <section number= 1 title= Introduction > </text> </article> Freely definable tags The <index>meaning</index> of life is.. </section> Source: XML for Beginners, Ralf Schenkel April 2003.

Attribute A Simple XML Document <article> <author>lars Nordström</author> <title>meaning of Life</title> <text> <abstract>to begin with.<abstract> <section number= 1 title= Introduction > The <index>meaning</index> of life is.. </section> </text> </article> End Tag Start Tag Elements Source: XML for Beginners, Ralf Schenkel April 2003.

Elements in XML Documents (Freely definable) tags: article, title, author with start tag: <article> etc. and end tag: </article> etc. Elements: <article>... </article> Elements have a name (article) and a content (...) Elements may be nested. Elements may be empty: <this_is_empty/> Element content is typically parsed character data (PCDATA), i.e., strings with special characters, and/or nested elements Source: XML for Beginners, Ralf Schenkel April 2003.

Elements vs. Attributes Elements may have attributes (in the start tag) that have a name and a value, e.g. <section number= 1 >. What is the difference between elements and attributes? Only one attribute with a given name per element (but an arbitrary number of subelements) Attributes have no structure, simply strings (while elements can have subelements) As a rule of thumb: Content into elements Metadata into attributes Example: <person born= 1912-06-23 died= 1954-06-07 > Alan Turing</person> proved that Source: XML for Beginners, Ralf Schenkel April 2003.

XML Documents as Ordered Trees article author title text Lars Nordstr öm The meaning of Life abstract To start with The section index mea ning number= 1 title= of life is... Source: XML for Beginners, Ralf Schenkel April 2003.

Well-Formed XML Documents A well-formed document must adher to, among others, the following rules: Every start tag has a matching end tag. Elements may nest, but must not overlap. There must be exactly one root element. Attribute values must be quoted. be processed by XML parsers. An element may not have two attributes with the same name. Comments and processing instructions may not appear inside tags. No unescaped < or & signs may occur inside character data. Only well-formed documents can Source: XML for Beginners, Ralf Schenkel April 2003.

Namespaces <library> <description>library of the CS Department</ description> <book bid= HandMS2000 > <title>principles of Data Mining</title> <description> Short introduction to <em>data mining</ em>, useful for the IRDM course </description> </book> Semantics of the description element is ambigous Content may be defined differently </library> Renaming may be impossible (standards!) Disambiguation of separate XML applications using unique prefixes Source: XML for Beginners, Ralf Schenkel April 2003.

Namespace Syntax <dbs:book xmlns:dbs= http://www-dbs/dbs > Prefix as abbrevation of URI Unique URI to identify the namespace Signal that namespace definition happens Source: XML for Beginners, Ralf Schenkel April 2003.

Namespace Example <dbs:book xmlns:dbs= http://www-dbs/dbs > <dbs:description>... </dbs:description> <dbs:text> <dbs:formula> <mathml:math xmlns:mathml= http:// www.w3.org/1998/math/mathml >... </mathml:math> </dbs:formula> </dbs:text> </dbs:book> Source: XML for Beginners, Ralf Schenkel April 2003.

Default Namespace Default namespace may be set for an element and its content (but not its attributes): <book xmlns= http://www-dbs/dbs > <description>...</description> <book> Can be overridden in the elements by specifying the namespace there (using prefix or default namespace) ORGANIZING AND SEARCHING INFORMATION WITH XML APRIL 29TH, 2003 25

Specifying XML files How do you specify what the tags should be in an XML file? DTD was the initial idea - Document Type definition <?xml version="1.0"?> <!DOCTYPE note [ <!ELEMENT note (to,from,heading,body)> <!ELEMENT to (#PCDATA)> <!ELEMENT from (#PCDATA)> <!ELEMENT heading (#PCDATA)> <!ELEMENT body (#PCDATA)> ]> <note> <to>tove</to> <from>jani</from> <heading>reminder</heading> <body>don't forget me this weekend!</body> </note> Source: XML for Beginners, Ralf Schenkel April 2003.

XML Schema Why not use XML to specify the format of the XML file? <xsd:schema xmlns:xsd="http://www.w3.org/2001/xmlschema"> <xsd:annotation> <xsd:documentation xlm:lang="en"> <xsd:element name="bookstore" type="bookstoretype"/> <xsd:complextype name="bookstoretype"> <xsd:sequence> <xsd:element name="name" type="xsd:string"/> <xsd:element name="topic" type="topictype" minoccurs="1"/> </xsd:sequence> </xsd:complextype> </xsd:documentation> </xsd:annotation>

Lecture Outline Quick Review The XML language Parsing Files in Java

Parsing XML We want to read XML files and store them in the working memory so that we can use and manipulate the data. Possible strategies Parse by hand (write from scratch) Parse into generic tree structure Parse as sequence of events Source: XML parsing, Andrew Siegel, University of Chicago

Parsing into generic tree structure Advantages Industry-wide, language neutral standard exists called DOM (Document Object Model) Learning DOM for one language makes it easy to learn for any other Have to write much less code to get XML to something you want to manipulate in your program Disadvantages Non-intuitive API, doesn t take full advantage of Java Still quite a bit of work Source: XML parsing, Andrew Siegel, University of Chicago

DOM abstraction layer in Java -- architecture Emphasis is on allowing vendors to supply their own DOM Implementation without requiring change to source code Returns specific parser implementation Source: org.w3d.dom.document

Sample Code A factory instance is the parser implementation. DocumentBuilderFactor factory = DocumentBuilderFactory.newInstance(); From the factory one obtains DocumentBuilder builder = an instance of the parser factory.newdocumentbuilder(); Document doc = builder.parse(xmlfile); The method parse in the builder (which has been created by the factory (by calling the method newdocumentbuilder in factory creates the object doc by reading the xmlfile, this doc is of class Document doc is the root of the document tree Source: XML parsing, Andrew Siegel, University of Chicago

Document object Once a Document object is obtained, rich API to manipulate. First call is usually Element root = doc.getdocumentelement(); This gets the root element of the Document as an instance of the Element class Note that Element subclasses Node and has methods gettype(), getname(), and getvalue(), and getchildnodes() Source: XML parsing, Andrew Siegel, University of Chicago

Types of Nodes Note that there are many types of Nodes (ie subclasses of Node: Attr, CDATASection, Comment, Document, DocumentFragment, DocumentType, Element, Entity, EntityReference, Notation, ProcessingInstruction, Text Each of these has a special and non-obvious associated type, value, and name. Standards are language-neutral and are specified on chart on following slide Source: XML parsing, Andrew Siegel, University of Chicago

Node nodename() nodevalue() Attributes nodetype() Attr Attr name Value of attribute null 2 CDATASection #cdata-section CDATA cotnent null 4 Comment #comment Comment content null 8 Document #document Null null 9 DocumentFragment #documentfragment null null 11 DocumentType Doc type name null null 10 Element Tag name null NamedNodeMap 1 Entity Entity name null null 6 EntityReference Name entitry referenced null null 5 Notation Notation name null null 1 ProcessingInstruction target Entire string null 7 Text #text Actual text null 3 Source: XML parsing, Andrew Siegel, University of Chicago

Lecture Outline Quick Review The XML language Parsing Files in Java