An XML file XML Parsing Michael Li Email: jwl@cs.nott.ac.uk <note> <to>jerry</to> <from>tom</from> <heading>reminder</heading> <body>don't forget me this weekend!</body> </note> XML stands for EXtensible Markup Language XML was designed to store and transfer data XML tags are not predefined. You must define your own tags XML is designed to be self-descriptive An XML file represents a tree structure XML Parsing Generic XML Parsers Understand XML notation Extracting the tree from serialized XML into a format that the computer can process As the document is parsed, the data in the document becomes available to the application using the parser Many possible methods Can also perform validation and well-formedness checks Common types in use today Tree-based (DOM) Event-driven (SAX) Pull-based (Microsoft.NET) 3 4 Tree-based XML Parser Event-based Parser Deserializes the XML, and builds an in-memory representation of the XML tree Provides an API for the user to manipulate the tree Slower than event-based parsers One-size-fits-all, which can be a never-reallyfits-anywhere Low-level interface Tree-based parsers are usually built on top of an event-based parser Send messages to user s code when they discover content in the XML file Can work in either a push (event-driven) or a pull 5 6 1
Trees or Events Which parsing is best? Depends on the task in hand Event-based good for building your own object structures Tree-based suitable for jobs that change quickly, or have a short lifetime Traditionally, the execution of a program is under the programmer s control Execution starts at a defined entry point (e.g. main()) Continues until the program exits, executing the program line after line 7 8 The program sometimes hands control over to the OS (e.g. to read input from the keyboard) Event-driven programming operates in the opposite direction The program surrenders control to the OS The OS then sends messages to the program telling it about events that have happened Examples include mouse input, key presses etc. 9 10 Event-based Parsing So how does this work for XML? Using SAX as the example User registers a handler for the events with the OS The OS calls back into the program when the event occurs Parser is passed a pointer to a user-implemented object This object supports a defined interface () As the parser finds certain types of object in the serialized XML stream, it generates calls to the methods 11 12 2
Parsing This is exactly what happens in an event-based parser The parser recognizes each XML tag and calls the appropriate method on the interface An object that implements the interface will then be notified of the relevant parts of the XML document. <?xml version= 1.0?> <Document> Hello World <Bold> Goodbye Universe! </Bold> ); </Document> StartDocument(); StartElement( Document ); Characters( Hello World ); StartElement( Bold ); Charcters( Goodbye Universe! ); EndElement( Bold ); EndElement( Document ); EndDocument(); 13 14 SAX SAX (Simple API for XML) defines an interface called The parser is passed a reference to an object that implements the interface The SAX parser then calls the relevant methods on that object in response to the XML document public interface public void setdocumentlocator(locator locator); public void startdocument() throws SAXException; public void enddocument() throws SAXException; public void startprefixmapping(string prefix, String uri) throws SAXException; public void endprefixmapping(string prefix) throws SAXException; public void startelement(string namespaceuri, String localname, String rawname, Attributes atts) throws SAXException; public void endelement(string namespaceuri, String localname, String rawname) throws SAXException; public void characters(char ch[], int start, int length) throws SAXException; public void ignorablewhitespace(char ch[], int start, int length) throws SAXException; public void processinginstruction(string target, String data) throws SAXException; public void skippedentity (String name) throws SAXException; 15 16 public void startdocument() throws SAXException; public void enddocument() throws SAXException; No need to implement unused methods since SAX provides a default does nothing implementation Therefore inherit from DefaultHandler, not Self-explanatory Called only once when parsing starts and when it finishes respectively enddocument() can be a useful place to put code to process the deserialized data 17 18 3
public void startelement(string namespaceuri, String localname, String qname, Attributes atts); public void endelement(string namespaceuri, String localname, String qname); Attributes Each element in the document invokes these methods namespaceuri and rawname are used when dealing with multiple namespaces and can be ignored for our purposes localname contains the name of the element startelement() methods are passed a reference to an Attributes object User can obtain the value of an attribute by using the method getvalue(string attname) String date = atts.getvalue( number ); 19 20 public void characters(char ch[], int start, int length); State Called whenever PCDATA is parsed Note the start parameter do not expect your data to start at ch[0] SAX does not define how it will pass the PCDATA into this function You may get called once for each character! SAX is stateless If you need to know state, then you must maintain it yourself Recall Finite State Automata 21 22 A SAX example A SAX example (cont.) Sting file = "memo.xml"; // try block to create and use the parser try // create the parser as a SAX2 parser and set its handlers SAXParserFactory fact = SAXParserFactory.newInstance(); fact.setvalidating(false); fact.setnamespaceaware(true); SAXParser parser = fact.newsaxparser(); // start the parser parser.parse(file, new CCountTags()); // catch any errors from either the parser or the parser setup catch (Exception e) System.err.println(e.getMessage()); public class CCountTags extends DefaultHandler private int m_cstartelements; private int m_cendelements; public void startdocument() m_cstartelements = 0; m_cendelements = 0; public void startelement(string namespaceuri, String localname, String rawname, Attributes atts) m_cstartelements++; public void endelement(string namespaceuri, String localname, String rawname) m_cendelements++; public void enddocument() System.out.println("Found "+m_cstartelements+" start elements"); System.out.println("Found "+m_cendelements+" end elements"); 23 24 4
Readings "SAX, the power API" (Benoît Marchal, developerworks, August 2001): Learn about when to use the SAX API instead of DOM, plus get an overview of commonly used SAX interfaces and detailed examples in a Java-based application with many code samples "Simplify XML programming with JDOM" (Wes Biggs and Harry Evans, developerworks, May 2001): Explore an alternate object API that is optimized for the Java language. 25 5