XML in programming Patryk Czarnik Institute of Informatics University of Warsaw XML and Modern Techniques of Content Management 2012/13
Introduction XML in programming XML in programming what for? To access data in XML format To use XML as data carrier (storage and transmission) To support XML applications (Web, content management, ) To make use of XML-related standards (XML Schema, XInclude, XSLT, XQuery, XLink, ) To develop or make use of XML-based software technology (Web Services etc) Patryk Czarnik 10 Programming XML 2012/13 4 / 1
Introduction XML in programming how? XML in programming Bad way Treat XML as plain text files and write low-level XML support from scratch Better approach Use existing libraries and tools Even better Use standardised interfaces independent from particular suppliers Patryk Czarnik 10 Programming XML 2012/13 5 / 1
XML in Java Introduction XML in programming Java motivation for us One of the most popular programming languages Very good XML support Standards Both included in Java Standard Edition from 60 Java API for XML Processing (JAXP) many interfaces and few classes, factories and pluggability layer support for XML parsing and serialisation (DOM, SAX, StAX) support for XInclude, XML Schema, XPath, XSLT Java API for XML Binding (JAXB) binding between Java objects and XML documents Patryk Czarnik 10 Programming XML 2012/13 6 / 1
Models Classification of XML access models Document read into memory generic interface example: DOM interface depending on document type (schema) example: JAXB Document processed node by node event model (push parsing) example: SAX streaming model (pull parsing) example: StAX Patryk Czarnik 10 Programming XML 2012/13 8 / 1
Models Document tree (DOM) Document Object Model Whole document in memory, as tree of objects W3C Recommendations DOM Level 1 1998 DOM Level 3 2004 Specification split into many modules (separate recommendations) Dom Core the most important DOM module for XML Standard independent of programming language (interfaces given in IDL) Java realisation presented here Patryk Czarnik 10 Programming XML 2012/13 9 / 1
DOM important interfaces Node NodeList NamedNodeMap Element Attr Entity Document Processing Instruction CharacterData Entity Reference DocumentFragment Comment Text DocumentType CDATASection
Models Document tree (DOM) Node interface chosen methods Content access getattributes() getchildnodes() getfirstchild() getlastchild() getnextsibling() getnodename() getnodevalue() getnodetype() getownerdocument() getparentnode() haschildnodes() Content modification appendchild(node) insertbefore(node, Node) removechild(node) replacechild(node, Node) setnodevalue(string) setnodename(string) Copying clonenode(boolean) importnode() Patryk Czarnik 10 Programming XML 2012/13 11 / 1
Example introduction Models Document tree (DOM) Example document <department id="acc"> <name>accountancy</name> <person id="102103" position="specialist"> <first-name>dawid</first-name> <last-name>paszkiewicz</last-name> <phone type="office">+48223213203</phone> <phone type="fax">+48223213200</phone> <email>paszkiewicz@examplecom</email> <salary>2222</salary> </person> <person id="102104" position="chief"> <first-name>monika</first-name> <last-name>domżałowicz</last-name> <salary>3500</salary> </person> Problem: Count sum of salaries of assistants Patryk Czarnik 10 Programming XML 2012/13 12 / 1
DOM example (1) Models Document tree (DOM) Program int r e s u l t = 0; DocumentBuilderFactory f a c t o r y = DocumentBuilderFactory newinstance ( ) ; DocumentBuilder b u i l d e r = f a c t o r y newdocumentbuilder ( ) ; Document doc = b u i l d e r parse ( args [ 0 ] ) ; Node cur = doc g e t F i r s t C h i l d ( ) ; while ( cur getnodetype ( )!= Node ELEMENT NODE) cur = cur getnextsibling ( ) ; cur = cur g e t F i r s t C h i l d ( ) ; while ( cur!= null ) { i f ( cur getnodetype ( ) == Node ELEMENT NODE) { Node a t t r = cur g e t A t t r i b u t e s ( ) getnameditem ( p o s i t i o n ) ; i f ( a t t r!= null && s p e c i a l i s t equals ( a t t r getnodevalue ( ) ) ) { r e s u l t += processperson ( cur ) ; } } cur = cur getnextsibling ( ) ; } Patryk Czarnik 10 Programming XML 2012/13 13 / 1
DOM example (2) Models Document tree (DOM) Method processperson private static int processperson ( Node person ) { Node cur = person g e t F i r s t C h i l d ( ) ; while ( cur!= null ) { i f ( cur getnodetype ( ) == Node ELEMENT NODE && s a l a r y equals ( cur getnodename ( ) ) ) { S t r i n g B u f f e r buf = new S t r i n g B u f f e r ( ) ; Node c h i l d = cur g e t F i r s t C h i l d ( ) ; while ( c h i l d!= null ) { i f ( c h i l d getnodetype ( ) == Node TEXT NODE ) buf append ( c h i l d getnodevalue ( ) ) ; c h i l d = c h i l d getnextsibling ( ) ; } return Integer parseint ( buf t o S t r i n g ( ) ) ; } cur = cur getnextsibling ( ) ; } return 0; / / no s alar y found } Patryk Czarnik 10 Programming XML 2012/13 14 / 1
Models Document tree (DOM) DOM two styles of programming Generic approach use only Node interface nodetype property kind of node properties: nodename, nodevalue, childnodes content access and modification methods: appendchild(node), removechild(node) structure modification Object-oriented approach use specialised interfaces Document: getdocumentelement(), getelementbyid(string) Element: getelementsbytagname(string), getattribute(string), setattribute(string, String) Attribute: boolean getspecified() Text: String substringdata(int, int), insertdata(int, String) Patryk Czarnik 10 Programming XML 2012/13 15 / 1
Models Document tree (DOM) DOM with subclasses example (1) Program int r e s u l t = 0; DocumentBuilderFactory f a c t o r y = DocumentBuilderFactory newinstance ( ) ; DocumentBuilder b u i l d e r = f a c t o r y newdocumentbuilder ( ) ; Document doc = b u i l d e r parse ( args [ 0 ] ) ; Element root = doc getdocumentelement ( ) ; NodeList l i s t = root getelementsbytagname ( person ) ; for ( int i = 0; i < l i s t getlength ( ) ; ++i ) { Element person = ( Element ) l i s t item ( i ) ; i f ( s p e c i a l i s t equals ( person g e t A t t r i b u t e ( p o s i t i o n ) ) ) r e s u l t += processperson ( person ) ; } Patryk Czarnik 10 Programming XML 2012/13 16 / 1
Models Document tree (DOM) DOM with subclasses example (2) Method processperson private static int processperson ( Element person ) { NodeList l i s t = person getelementsbytagname ( s a l ary ) ; i f ( l i s t getlength ( ) > 0) { S t r i n g value = l i s t item ( 0 ) gettextcontent ( ) ; return Integer parseint ( value ) ; } else { return 0; } } Patryk Czarnik 10 Programming XML 2012/13 17 / 1
XML binding idea Models XML binding (JAXB) XML documents and objects (eg Java): schema (type) corresponds to class, document/node corresponds to object Implementations: JAXB (Sun), Castor (Exolab), XML Beans (Apache) Patryk Czarnik 10 Programming XML 2012/13 18 / 1
Models Java API for XML Binding (JAXB) XML binding (JAXB) Inspired by Sun, supported by javanet Current version: 20 Motivated by Web Services Components generic API fragment two-way translation specification customisation and annotations Patryk Czarnik 10 Programming XML 2012/13 19 / 1
JAXB usage scenarios Models XML binding (JAXB) Schema to Java 1 Preparation of schema 2 Schema to Java compilation (XJC) translation customisable result: annotated classes basing on schema types 3 Developing application making use of generic part of JAXB API generated classes Java to Schema 1 Preparation of model classes 2 Adding annotations to classes 3 Making use of generic JAXB API to perform marshalling, unmarshalling etc 4 (Optional) Generating schema basing on classes Patryk Czarnik 10 Programming XML 2012/13 20 / 1
JAXB example (1) Schema and generated classes Schema Models XML binding (JAXB) <xs:element name="department"> <xs:complextype> <xs:sequence> <xs:element name="name" type="xs:string" minoccurs="0"/> <xs:element name="person" type="tperson" maxoccurs="unbounded"/> </xs:sequence> <xs:attribute name="id" type="xs:string"/> </xs:complextype> </xs:element> Generated class @XmlRootElement (name = department ) public class Department { protected S t r i n g name; protected L i s t <TPerson> person ; @XmlAttribute (name = i d ) protected S t r i n g i d ; Patryk Czarnik 10 Programming XML 2012/13 21 / 1
JAXB example (2) Models XML binding (JAXB) Schema <xs:complextype name="tperson"> <xs:sequence> <xs:element name="first-name" type="xs:string"/> <xs:element name="last-name" type="xs:string"/> <xs:element name="phone" type="tphone" minoccurs="0" maxoccurs="unbounded"/> <xs:element name="email" type="xs:token" minoccurs="0" maxoccurs="unbounded"/> <xs:element name="salary" type="xs:int" minoccurs="0"/> </xs:sequence> <xs:attribute name="id" type="xs:integer" use="required"/> <xs:attribute name="position" type="xs:token"/> </xs:complextype> Patryk Czarnik 10 Programming XML 2012/13 22 / 1
JAXB example (3) Models XML binding (JAXB) Generated class public class TPerson { @XmlElement (name = f i r s t name, required = true ) protected S t r i n g firstname ; @XmlElement (name = l a s t name, required = true ) protected S t r i n g lastname ; protected L i s t <TPhone> phone ; @XmlJavaTypeAdapter ( CollapsedStringAdapter class ) protected L i s t <String> email ; protected Integer sala ry ; @XmlAttribute (name = i d, required = true ) protected BigInteger i d ; @XmlAttribute (name = p o s i t i o n ) protected S t r i n g p o s i t i o n ; Patryk Czarnik 10 Programming XML 2012/13 23 / 1
JAXB example(4) Models XML binding (JAXB) Program JAXBContext j c = JAXBContext newinstance ( Department class ) ; Unmarshaller u = j c createunmarshaller ( ) ; Department doc = ( Department ) u unmarshal (new FileInputStream ( args [ 0 ] ) ) ; L i s t <TPerson> employees = doc getperson ( ) ; for ( TPerson person : employees ) { i f ( s p e c i a l i s t equals ( person g e t P o s i t i o n ( ) ) ) { i f ( person getsalary ( )!= null ) r e s u l t += person getsalary ( ) ; } } System out p r i n t l n ( Result : +r e s u l t ) ; Patryk Czarnik 10 Programming XML 2012/13 24 / 1
Models Event model (SAX) Processing documents node by node motivation Whole document in memory (DOM, JAXB) Convenient, high-level approach Cost for efficiency memory usage creating tree with references (DOM) translating data between text and binary format (JAXB) Reading document node by node Efficient possible processing in constant memory less operations to perform Harder to develop complex logic Patryk Czarnik 10 Programming XML 2012/13 25 / 1
Event model idea Models Event model (SAX) Document seen as a stream of events: element beginning, text node, element end, ), Programmer registers code to be run during processing Parser reads document and calls programmer s methods relevant to particular events data (element names, text fragments) given in parameters Separation of responsibility Parser responsible for physical-level processing Programmer responsible for logical-level processing Patryk Czarnik 10 Programming XML 2012/13 26 / 1
SAX typical usage (1) Models Event model (SAX) Status Simple API for XML Version 10 in 1998 Interfaces specified in Java Idea applicable for other programming languages Typical usage 1 Programmer-provided class implementing ContentHandler 2 Optionally classes implementing ErrorHandler, DTDHandler, or EntityResolver one class may implement all of them DefaultHandler convenient base class to start with Patryk Czarnik 10 Programming XML 2012/13 27 / 1
SAX typical usage (2) Models Event model (SAX) Typical application schema 1 Obtain XMLReader (or SAXParser) from factory 2 Create ContentHandler instance 3 Register handler in reader 4 Invoke parse method 5 Parser conducts processing and calls methods of our ContentHandler 6 Use data collected by ContentHandler Patryk Czarnik 10 Programming XML 2012/13 28 / 1
Models SAX example (1) ContentHandler implementation Event model (SAX) class StaffHandler extends DefaultHandler { enum State {EXTERNAL, PERSON, SALARY }; private int r e s u l t = 0; private State state = State EXTERNAL ; private S t r i n g B u f f e r buf ; public int getresult ( ) { return r e s u l t ; } public void startelement ( S t r i n g uri, S t r i n g localname, S t r i n g qname, A t t r i b u t e s a t t r i b u t e s ) throws SAXException { i f ( person equals (qname ) ) { S t r i n g a t t r V a l = a t t r i b u t e s getvalue ( p o s i t i o n ) ; i f ( s p e c i a l i s t equals ( a t t r V a l ) ) state = State PERSON ; } else i f ( s a l a r y equals (qname ) ) { i f ( state == State PERSON) { state = State SALARY ; buf = new S t r i n g B u f f e r ( ) ; } } } Patryk Czarnik 10 Programming XML 2012/13 29 / 1
SAX example (2) Models Event model (SAX) StaffHandler continued public void characters ( char [ ] ch, int s t a r t, int length ) throws SAXException { i f ( state == State SALARY ) buf append ( ch, s t a r t, length ) ; } public void endelement ( S t r i n g uri, S t r i n g localname, S t r i n g qname) throws SAXException { i f ( person equals (qname ) ) { i f ( state == State PERSON) { state = State EXTERNAL ; } } else i f ( s a l a r y equals (qname ) ) { i f ( state == State SALARY ) { state = State PERSON ; r e s u l t += Integer parseint ( buf t o S t r i n g ( ) ) ; } } } } Patryk Czarnik 10 Programming XML 2012/13 30 / 1
SAX example (3) Models Event model (SAX) Main program (Traditional SAX version) XMLReader reader = XMLReaderFactory createxmlreader ( ) ; StaffHandler handler = new StaffHandler ( ) ; reader setcontenthandler ( handler ) ; reader parse (new InputSource ( args [ 0 ] ) ) ; System out p r i n t l n ( Result : +handler getresult ( ) ) ; Main program (JAXP version) SAXParserFactory f a c t o r y = SAXParserFactory newinstance ( ) ; SAXParser parser = f a c t o r y newsaxparser ( ) ; StaffHandler handler = new StaffHandler ( ) ; parser parse ( args [ 0 ], handler ) ; System out p r i n t l n ( Result : +handler getresult ( ) ) ; Patryk Czarnik 10 Programming XML 2012/13 31 / 1
SAX filters Models Event model (SAX) Implement XMLFilter interface, indirectly also XMLReader behave like parser, but receive events from another XMLReader (parser or filter) may be chained Default implementation: XMLFilterImpl: implements ContentHandler, ErrorHandler and so on passes all events through Capabilities: events filtering changing data or structure, by calling subsequent handler methods with changed parameters on-line document processing processing by many modules in one-pass reading Patryk Czarnik 10 Programming XML 2012/13 32 / 1
SAX filter example public void characters ( char [ ] ch, int s t a r t, int length ) throws SAXException { i f ( pass ) super characters ( ch, s t a r t, length ) ; } } public class A s s i s t a n t F i l t e r extends XMLFilterImpl { private boolean pass = true ; public void startelement ( S t r i n g uri, S t r i n g localname, S t r i n g qname, A t t r i b u t e s a t t s ) throws SAXException { i f ( person equals (qname) && a s s i s t a n t equals ( a t t s getvalue ( p o s i t i o n ) ) ) pass = false ; i f ( pass ) super startelement ( uri, localname, qname, a t t s ) ; } public void endelement ( S t r i n g uri, S t r i n g localname, S t r i n g qname) throws SAXException { i f ( pass ) super endelement ( uri, localname, qname ) ; i f ( person equals (qname ) ) czyprzepuszczac = true ; }
Models Stream model (pull parsing) Stream model (StAX) Alternative for event model application pulls events/nodes from parser processing controlled by application, not parser idea analogous to: iterator, cursor, etc Advantages of SAX saved high efficiency possible to process large documents Standardisation Common XmlPull API Java Community Process, JSR 173: Streaming API for XML Patryk Czarnik 10 Programming XML 2012/13 34 / 1
Models StAX Streaming API for XML Stream model (StAX) Important interfaces XMLStreamReader low level parser hasnext(), int next(), int geteventtype(), getname(), getvalue(), getattributevalue(), XMLEventReader (relatively) high level parser XMLEvent next(), XMLEvent peek() XMLEvent represents event in high level approach geteventtype(), isstartelement(), ischaracters(), podinterfejsy StartElement, Characters, XMLStreamWriter, XMLEventWriter XMLStreamFilter, XMLEventFilter Patryk Czarnik 10 Programming XML 2012/13 35 / 1
StAX example (1) Models Stream model (StAX) Program private static XMLStreamReader reader ; public void run ( ) { int r e s u l t = 0; XMLInputFactory f a c t o r y = XMLInputFactory newinstance ( ) ; reader = f a c t o r y createxmlstreamreader (new FileInputStream ( args [ 0 ] ) ) ; } while ( reader hasnext ( ) ) { int eventtype = reader next ( ) ; i f ( eventtype == XMLStreamConstants START ELEMENT ) { i f ( person equals ( reader getlocalname ( ) ) ) { S t r i n g a t t r V a l = reader getattributevalue ( null, p o s i t i o n ) ; i f ( s p e c i a l i s t equals ( a t t r V a l ) ) { r e s u l t += this processgroup ( ) ; } } } } } reader close ( ) ; System out p r i n t l n ( Result : +r e s u l t ) ; Patryk Czarnik 10 Programming XML 2012/13 36 / 1
StAX example (2) Models Stream model (StAX) Method processperson private int processperson ( ) throws XMLStreamException { int r e s u l t = 0; } while ( reader hasnext ( ) ) { int eventtype = reader next ( ) ; switch ( eventtype ) { case XMLStreamConstants START ELEMENT : i f ( s a l a r y equals ( reader getlocalname ( ) ) ) { S t r i n g val = reader getelementtext ( ) ; r e s u l t += Integer parseint ( val ) ; } break ; case XMLStreamConstants END ELEMENT : i f ( person equals ( reader getlocalname ( ) ) ) { return r e s u l t ; } } } return r e s u l t ; Patryk Czarnik 10 Programming XML 2012/13 37 / 1
Models Which model to choose? (1) Comparison Document tree in memory: small documents (must fit in memory) concurrent access to many nodes creating new and editing existing documents in place Generic document model (like DOM): not established or not known structure of documents lower efficiency accepted XML binding (like JAXB): established and known structure of documents XML as data serialisation method Patryk Czarnik 10 Programming XML 2012/13 38 / 1
Models Which model to choose? (2) Comparison Processing node by node potentially large documents relatively simple, local operations efficiency is the key factor Event model (like SAX): using already written logic (SAX is more mature) filtering of events asynchronous events several aspects of processing during one reading of document Streaming model (like StAX): processing depending on context; complex states processing should end after data is found reading several documents simultaneously Patryk Czarnik 10 Programming XML 2012/13 39 / 1
Support for XML-related standards Validation Validation against DTD during parsing Parsing using validating parser SAXParserFactory factory = SAXParserFactorynewInstance(); factorysetvalidating(true); XMLReader reader = factorynewsaxparser()getxmlreader(); readersetcontenthandler(mojconenthandler); readerseterrorhandler(mojerrorhandler); readerparse(args[0]); Note Ṿalidating parser defined formally in XML Recommendation Patryk Czarnik 10 Programming XML 2012/13 41 / 1
Support for XML-related standards Validation Validation against XML Schema during parsing One of possible approaches SchemaFactory schemafactory = SchemaFactorynewInstance(XMLConstantsW3C_XML_SCHEMA_NS_URI); Schema schema = schemafactorynewschema(new StreamSource("staffxsd")); SAXParserFactory factory = SAXParserFactorynewInstance(); factorysetvalidating(false); factorysetschema(schema); factorysetnamespaceaware(true); XMLReader reader = factorynewsaxparser()getxmlreader(); readersetcontenthandler(myconenthandler); readerseterrorhandler(myerrorhandler); readerparse(args[0]); Patryk Czarnik 10 Programming XML 2012/13 42 / 1
Support for XML-related standards Validation Validators, saving DOM tree to file In-memory validation of DOM tree SchemaFactory schemafactory = SchemaFactorynewInstance(XMLConstantsW3C_XML_SCHEMA_NS_URI); Schema schema = schemafactorynewschema(new StreamSource("staffxsd")); Validator validator = schemanewvalidator(); validatorseterrorhandler(myerrorhandler); validatorvalidate(new DOMSource(doc)); DOM tree saved using DOM Load and Save interface DOMImplementationLS lsimpl = (DOMImplementationLS)domImplgetFeature("LS", "30"); LSSerializer ser = lsimplcreatelsserializer(); LSOutput out = lsimplcreatelsoutput(); outsetbytestream(new FileOutputStream("outxml")); serwrite(doc, out); Patryk Czarnik 10 Programming XML 2012/13 43 / 1
Support for XML-related standards Transformers for XSLT Transformers XSLT run for external XML files TransformerFactory t f = TransformerFactory newinstance ( ) ; Transformer transformer = t f newtransformer ( new StreamSource ( s t a f f h t m l x s l ) ) ; Source src = new StreamSource ( s t a f f xml ) ; Result res = new StreamResult ( r e s u l t html ) ; transformer transform ( src, res ) ; Patryk Czarnik 10 Programming XML 2012/13 44 / 1
Support for XML-related standards Transformers Transformers other applications SAX filter + Transformer = on-line document processing SAXParserFactory p a r s e r f a c t = SAXParserFactory newinstance ( ) ; XMLReader reader = p a r s e r f a c t newsaxparser ( ) getxmlreader ( ) ; TransformerFactory t f = TransformerFactory newinstance ( ) ; Transformer transformer = t f newtransformer ( ) ; XMLFilter f i l t e r = new A s s i s t a n t F i l t e r ( ) ; f i l t e r setparent ( reader ) ; InputSource doc = new InputSource ( args [ 0 ] ) ; Source src = new SAXSource ( f i l t e r, doc ) ; Result res = new StreamResult ( out xml ) ; transformer transform ( src, res ) ; Patryk Czarnik 10 Programming XML 2012/13 45 / 1
Support for XML-related standards Transformers Source, Result, and Transformers Source and Result universal wrappers for various Java representations of XML data Transformer translation between them Source DOMSource JAXBSource SAXSource StAXSource StreamSource Result DOMResult JAXBResult SAXResult StAXResult StreamResult SAAJResult Patryk Czarnik 10 Programming XML 2012/13 46 / 1