@, $#" 654 5RD DS ;,QWURGXFWLR W ; 3DUVLQJ Based on Lecture Notes by Qiang Yang, SFU Thanks: Ethan Cerami New York University What is a Parser Defining Parser Responsibilities Evaluating Parsers Validation Validating v Nonvalidating Parsers XML Interfaces Object v Tree Based Interfaces Interface Standards: DOM, SAX Java XML Parsers 7K %L 3LFWXUH :KD L D ; 3DUVHU" Java Application Or Servlet XML Parser XML Document An XML Parser enables your Java application or Servlet to more easily access XML Data )*(+ %&' ;<:= 789 3
3 B 3 A 987 &%$ 'HILQLQ 3DUVH 5HVSRQVLELOLWLHV 3DUVH 5HVSRQVLELOLWLHV A XML Parser has three main responsibilities: Retrieve and Read an XML document For example, the file may reside on the local file system or on another web site The parser takes care of all the necessary network connections andor file connections This helps simplify your work, as you do not need to worry about creating your own network connections Ensure that the document adheres to specific standards Does the document match the DTD Is the document wellformed 3 Make the document contents available to your application The parser will parse the XML document, and make this data available to your application :K XV D ; 3DUVHU" If your application is going to use XML, you could write your own parser 3 But, it makes more sense to use a prebuilt XML parser This enables you to do build your application much more quickly (YDOXDWLQ 3DUVHUV =@ :;< 56 4 54 4 4 +,* '() "# "
= < ; 3,, $#" 4XHVWLRQ W DVN When evaluating which XML Parser to use, there are two very important questions to ask: Is the Parser validating or nonvalidating What interface does the parser provide to the XML document We will explore each of these questions in detail ; 9DOLGDWLRQ ; 9DOLGDWLRQ 3HUIRUPF HPRU\ Validating Parser a parser that verifies that the XML document adheres to the DTD NonValidating Parser a parser that does not check the DTD Lots of parsers provide an option to turn validation on or off Questions: Which parser will have better performance Which parser will take up less memory Validating parsers: more useful (robust) slower take up more memory Nonvalidating parsers: less useful faster take up less memory 897: 456 )*(+ %&'
@ " 654 &%$ 3HUIRUPF HPRU\ Therefore, when high performance and lowmemory are the most important criteria, use a nonvalidating parser Examples: Java applets Palm Pilot Applications ;,QWHUIDFHV *HQHUD $UFKLWHFWXUH ;,QWHUIDFHV Broadly, there are two types of interfaces provided by XML Parsers: Java Application Or Servlet XML Parser XML Document ObjectTree Interface Event Based Interface Let s examine each of these in detail The Parser sits in the middle of your application and your data What s the best way to extract that data ;<:= 789 3 +,* '() "#
= " 543 &%$ EMHFW7U,QWHUIDFH 6DPSO ; 'RFXPHQW Definition: Parser reads the XML document, and creates an inmemory tree of data For example: Given a sample XML document on the next slide, what kind of tree would be produced <xml version="" encoding="utf8" <DOCTYPE WEATHER SYSTEM "Weatherdtd" <WEATHER <CITY NAME= Vancouver" <HI<HI <LOW<LOW <CITY <WEATHER Weather City Hi On Object Tree for a sample XML Document The tree represents the hierarchy of the XML document Text: Note the Text Nodes (YHQ %DVH 3DUVHU Definition: Parser reads the XML document, and generates events for each parsing event For example: Given the same XML document, what kind of tree would be produced Lo Text: :;9< 678 +,* '() "#
@ " 654 %$# 6DPSO ; 'RFXPHQW ; 3DUVLQ (YHQWV Events generated: <xml version="" encoding="utf8" <DOCTYPE WEATHER SYSTEM "Weatherdtd" <WEATHER <CITY NAME= Vancouver" <HI<HI <LOW<LOW <CITY <WEATHER Start of <Weather Element Start of <CITY Element 3 Start of <HI Element 4 Character Event: 5 End of <HI Element 6 Start of <LOW Element 7 Character Event: 8 End of <LOW Element 9 End of <CITY Element End of <WEATHER Element (YHQ %DVH,QWHUIDFH 3HUIRUPF HPRU\ For each of these events, the your application implements event handlers Each time an event occurs, a different event handler is called Your application intercepts these events, and handles them in any way you want Questions: Which parser is faster Which parser takes up less memory Tree based: more useful slower takes up more memory Event based: less useful faster takes up much less memory ;<:= 789 3 *+), &'(
" = 543 %$# 3HUIRUPF HPRU\ Therefore, when high performance and lowmemory are the most important criteria, use an eventbased parser Examples: Java applets Palm Pilot Applications ;,QWHUIDF 6WGDUGV ;,QWHUIDF 6WGDUGV ' Standards are important: Easier to create XML applications You can swap parsers as your application evolves There are two main XML Interface standards: Tree Based: Document Object Model (DOM) Event Based: Simple API for XML (SAX) Document Object Model Tree Based Interface Developed by the W3C Supports both XML and HTML Originally specified using an IDL (Interface Definition Language) Hence, DOM Versions exist for Java, JavaScript, C++, Perl, Python :;9< 678 *+), &'(
=, 543 $#" 6$; Simple API for XML Developed by volunteers on the xmldev DY ; 3DUVHUV Event Based mailing list http:wwwsaxprojectorg DY ; 3DUVHUV DY ; 3DUVHUV There are currently dozens of XML Parsers written in Java IBM XML Parser for Java One of the first parsers Widely used Validating and Nonvalidating Options Supports DOM and SAX http:wwwalphaworksibmcomtechx ml4j Sun (Java XML Pack) Validating and Nonvalidating Options Supports DOM, SAX, http:javasuncomxmljavaxmlpackhtml Aelfred XML Parser Very lightweight, Low Memory Usage NonValidating, EventBased Interface Good for building Applets http:wwwopentextcommicrostar : ;9< 6 78 ) *(+ % &'
H 3 I O M L N 4 5 ` 5 qt wl s J 5 K 7 J 7 7 J 8 J J GF =< '&% DY ; 3DUVHUV DY ; 3DUVHUV Apache Xerces Validating and Nonvalidating Options Supports DOM and SAX http:xmlapacheorg For a full list of XML Parsers, go to http:wwwxmlsoftwarecomparsers Note that XML Parsers also exist for lots of other languages: CC++, JavaScript, Python, Perl, etc Most parsers support both DOM and SAX, and most have options for turning validation on or off $9 ; 3DUVLQJ 6$ (YHQ 7UDFH Parsing Enumerates the whole document, and Validates the wellformedness 6 Method : A Simple API for XML: SAX Event driven interfaces must be filled in startelement( ) characters ( ) endelement ( ) Can use parsers to do database selection X UVTW RS P Q d a bc kklnm ij h f `efg h ^_ ] YZ\[ h v tsuvp kqr lpo start document Start element: doc start element: para K Characters: Hello, world end element: para end element: doc end document CDBE @A :9 :;9 9 9,+ ()* #" #$" " "
ï î í ü ð û ó õô " $# " =< ;@ A 93 8, 6 34, @ < 8 7 65 + &' MN L HK HJI L II G DEF BC z{ y wx st q r j l b f c `ab Y_ ]^ \ Y SVU žÿ š VW X SS Š ˆ x { wv tu s on mr k j hi Ä ³ à  ¼ ¹ ³ ³ š œ š Ý Ü ÙÛ Í Í ÊÉ Ç È ÆÃÆ» Ú œ Ý «œ ÏÎ ÆÍ ÌË È Ç ÇÆ Á ¼ ¼ ¹º µ» µ «² Ü ÖÕ ÔÙ ÖÕ ÑÒ ìë ô ö õ ó Generating a parser and validating it against a dtd file ü úù ÿ ýû ý ûü þ Reasons to be cautious about attributes & # %$ # " 3,+ << : 89 7,+ 65 55 HI, D= * 9F E *+ D= * HI, *: E *+ 5 *:=5 B5 *: ^ [] \ [ZYXT VW TT UT O *:= NM 5 vw pu st qqr nm a fg _` } Œ Š ˆ òñ ð ëì êé èç 6XQ IUR 6D['ULYHUMDY 6$; úù ö ø ðñòó ýþÿ % '(& *)" :; 7 5 +, R QN MN OP JK E HG ^ \]U WX [Z VWXU YU TSS a \_` U \Y d ] b _ eb WX \d WXU Z c UX [ \b _ Œ Ž vt }~ u yz xv p q lm fg ž Ÿ ± ª «šÿ œ Ï Ð ÉÊ ÄÅ Ã Â ¼µ Àµ ½¾ ÏÝ ÎÚÛ ÙÜ Ð ³ Õ Ø ÓÔ ã â áßà Þ ß èéçê äåæ ãâá ßÞ ßàÞ Þ Þ ; R UHPDUN PRU IH Attributes Each tag may be qualified by an arbitrary number of attributes Attributes can not contain multiple values Attributes are not expandable Attributes are more difficult to manipulate by program code Attribute values are not easy to test against a DTD Ž D GHILQ ILUVW 6$; 8VLQ SURJUDP (YHQW+GOHUMDY $ % "# C A;B = :; 59 34 ), ()*+ FH FGH DE E }~ u v jjk plq mjon ijjk gfh de R] WX YZ[ QRST OP œ Œ Ž Œ ˆ Š ± ² ÀÁ ½¾ º» ² ±² ³ µµ ³ ±² ªœ «žœ ª ªœ Î ÒÐ ÌÙ Ô Ø ÔÕÖ ÒÎÓ ÎÎÌ ÑÐ ÎÏ ËËÌ Æ Åµ ± SDFNDJH,% WK ZLW 3DUVLQ referred to in an input xml file ü ü ü ÿ ý øù ÿ ü ýþ ûùÿ ûü øùú øùú øþ )* # ( ',=5 <65 ; * 7 44* JK G=5 C5 A+B; )* 9: C5 3 5@ A+B; JK G=5 5 L=5 L=5 SN R PQ 5 ( <GB 9 xyz op lk jk hi a d ab_ced } ~pt { s îïí åä åæä ä ä
" # $ $ $ $ $ $WWULEXWH ± ZLW ZLWKRXW HDUQLQ RU DERX ; <xml version="" <note <date <day<day <month<month <year99<year <date <topat<to <fromjani<from <headingreminder<heading <bodydon't forget this weekend<body <note <xml version="" <note day="" month="" year="99" to= Pat" from="jani" heading="reminder" body="don't forget this weekend" <note http:wwww3schoolscomdefaultasp Learn XML Learn XSL Learn DTD Learn DOM Learn Schema