Validating XML Data with an XML Schema Date: May 2007 Version: DRAFT 0.2 1
Contents 1. XML Validation Concepts a. Concepts b. Errors c. Resources 2. Example: Validation with XMLSpy a. Downloading Spy b. Creating a new XMLSpy Project c. Associate the homestead XML Schema with a folder d. Open the file in XMLSpy e. Add the active file to the folder f. Click the "Validate" button 3. Example: Manipulating Large XML Data Sets with Ant & Eclipse a. Tools for Records and Metadata vs. Tools for Data b. Apache Ant DOS command line c. Eclipse GUI interface d. V The File Viewer Viewing large files e. XML databases 2
Disclaimer The information and examples in this document are for demonstration purposes only. The information and examples presented are for your information to assist in enhancing the abilities of counties to work with and validate XML datasets with Minnesota Revenue XML schemas. The Minnesota Department of Revenue does not endorse nor support any products mentioned in this presentation. It is beyond the scope of the mission of the Property Tax Division to support tools within each county. Your staff is responsible for assuring that your tools match you business requirements. 3
XML Validation Concepts If you have: 1) A valid XML file. And 2) a well defined XML Schema, you can 3) check the XML file to see if it is XML and has all the required tags defined by the schema with any standard XML validation program. This is called validation. <XML File/> <XML Schema/> XML Validator Validation errors Validates 4
XML Validation Concepts XML is a text file where well defined tags surround each data value. Tag example: <Zip_Code>55101</Zip_Code> An XML Schema describes what tags are needed and where they need to be for a particular file. <xs:element name="zip_code"> <xs:simpletype> <xs:restriction base="xs:string"> <xs:pattern value= [0-9]{5}"/> </xs:restriction> </xs:simpletype> </xs:element> This fragment from an XML Schema defines a tag for Zip_Code 5
XML Validation Errors Tag example: <Zip_Code>55101</Zip_Code> <XML File/> <XML Schema/> XML Validator Validation errors Validates If you have: 1) An invalid XML file: You get an invalid XML, malformed XML or content error. Examples are missing tag brackets or other syntax errors. 2) A valid XML file with tag errors: You get a reasonable list of XML tag errors found that are inconsistent with the specific XML Schema being validated against. 6
XML Validation Errors for XML Escape Characters There are five characters are used in XML syntax that cannot be used directly in a data value. They must be escaped by representing the character using the ampersand representation Character < > & ' " Name less than greater than ampersand single quote or apostrophe double quote Escape < > & ' " 7
10 Common XML Transmission Errors 1. Mal-formed XML 2. Missing namespace declarations 3. Invalid document structure 4. Missing required element 5. Missing data in element 6. Invalid document type code values 7. Invalid property type code value 8. Invalid character values 9. Incorrect number of repeating fields 10. Incorrect tax year For more information about XML Errors, please also refer to the document: XML and XML Errors 8
XML & Validation Resources W3C XML Standards Page http://www.w3.org/xml/ OASIS XML Cover Pages http://xml.coverpages.org/xml.html#xmlvalresources (lots of references) XML.com http://www.xml.com (up-to-date XML information) XML.com Schema Tools http://www.xml.com/pub/a/2000/12/13/schematools.html (older list of schema tools) XMLSpy http://www.altova.com (free 30 day eval xml tools and validation) XMLStar http://xmlstar.sourceforge.net (free tools and validation) 9
Example: Validating a Homestead File with XMLSpy 10
Validating with XMLSpy Steps 1. Download XML Spy (30 day free eval) and homestead zip file 2. Create a new XML Spy Project 3. Associate the homestead XML Schema with a folder 4. Open the file in XMLSpy 5. Add the active file to the folder 6. Click the "Validate" button 11
Download XML Spy http://www.altova.com/products/xmlspy/xml_editor.html Altova will e-mail you a 30 day license key 12
Download Homestead Files 13
Start XML Spy Double click the XML Spy icon Create a New Project 14
New Project Window Note: if the window is not visible use the Window/Project menu to show the project window 15
Set the Properties of the XML Folder Right click over the XML files folder in the project view NOTE: RIGHT CLICK not left click 16
Folder Properties Click the "Validate with:" check box 17
Browse to homestead schema Click OK and then double click on your xml data file to be validated 18
Add this file to your project RIGHT click and select the "Add Active File" 19
Click the green check 20
View Results in Validation View If your file is valid a green check will appear in the validation view Error message will appear in this same window 21
File Size Limitations XMLSpy tends to have problems validating files over about 25MB on a system with 1GB of RAM Use Apache Ant and/or Eclipse if you want to validate larger files 22
Example: Manipulating Large XML Data Sets with Ant & Eclipse Tips for XML Files Above 25MB 23
Agenda Tools for Records and Metadata vs. Tools for Data Apache Ant DOS command line Eclipse GUI interface V The File Viewer Viewing large files XML databases 24
Records vs. Databases XML File Viewers (like XML Spy) are ideal for viewing single records and metadata (XML Schemas) Visual editing tools tend stop working when file sizes exceed about 25MB (given 2GB of RAM) (e.g. We don't use MS-Word to edit 100,000 records in a database) Other tools are more appropriate for debugging large data sets 25
In Memory vs. Streaming There are several different approaches to checking large files Load the entire file into memory (DOM) Stream the file through memory (SAX) Page only relevant sections into memory (Chunking used in V-The-File-Viewer) 26
Apache Ant Open source build manager User give ant a high-level description of a task Ant executes task using dependency analysis (only validate after extract) Called from shell (DOS or UNIX) Called from Integrated Development Environment (IDE) Download Link http://www.uniontransit.com/apache/ant/binaries/apache-ant-1.7.0-bin.zip See Wikipedia "Apache Ant" 27
28
Download.zip file 29
Adding tools.jar Apache ant needs one missing jar file call "tools.jar" that is free with Sun's Software Development Tools It is freely available from the Java download as part of the JavaSDK 1.4+ (but not the JDK) Temporary file is on the Java Open Source User Group JOSUG web site (www.josug.org/tools.jar) File is about 6MB! This must be in your build "Classpath" 30
Apache Ant 1.7 Many new features Simple <schemavalidate> target Faster execution path to your xml schema <schemavalidate nonamespacefile="homestead-data_v0.28.xsd" file="my-homestead-data.xml"> </schemavalidate> path to your xml data 31
build.xml Ant From DOS Command Line <?xml version="1.0" encoding="utf-8"?> <project default="validate-homestead"> <property name="srcdir" value="c:/homestead/stress-test"/> <property name="schemadir" value="c:/homestead/schemas"/> <target name="validate-homestead"> <schemavalidate nonamespacefile="${schemadir}/homestead-data_v0.28.xsd" file="${srcdir}/100mb-test.xml"> </schemavalidate> </target> </project> Change these to match your local system 1. Download Apache Ant version 1.7.0 2. Copy the build.xml into a directly 3. Change file locations in properties of the build file to match your local files 4. Run ant.bat (using the full path name) in folder that build file is located in 32
Apache Ant Tasks schemavalidate New Ant 1.7 optional task just for XML Schema xmlvalidate very general Ant 1.6 task for validation of XML files check for well-formed files check for validation against an XML Schema xslt transforms XML files replace replace specific text in large files 33
schemavalidate options http://ant.apache.org/manual/optionaltasks/xmlvalidate.html http://ant.apache.org/manual/optionaltasks/schemavalidate.html 34
Example <schemavalidate> task 100MB file validates in 10 seconds 35
Sample Ant 1.6 Validate Script This will validate only the 100MB-test.xml file Replace this with *.xml and all XML files in the source directory will be validated 36
Eclipse OpenSource Integrated development environment originally sponsored by IBM "GUI" front end to Apache Ant See http://www.eclipse.org/ 37
Sample Ant Classpath 38
Complete Ant 1.7 Build File <?xml version="1.0" encoding="utf-8"?> <project default="validate-homestead"> <property name="datadir" value="c:/homestead/data-files"/> <property name="schemadir" value="c:/homestead/schemas"/> <target name="validate-homestead"> <schemavalidate nonamespacefile="${schemadir}/homestead-data_v0.28.xsd" file="${datadir}/my-data-file.xml"> </schemavalidate> </target> </project> Properties can be set once in the file and reference many times. This makes your build files easier to maintain. 39
GUI "Point and Click" UI Sample "point and click" GUI interface Alt+Shift+X, Q to run a task 40
XML Transform View a homestead record of a specific parcel ID Big File (Gigabytes) XML Transform With Matching Rules match Very Small File no match 41
Sample XML Transform <?xml version="1.0" encoding="utf-8"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/xsl/transform" xmlns:mn="http://data.state.mn.us" xmlns:c="http://niem.gov/niem/common/1.0" xmlns:u="http://niem.gov/niem/universal/1.0" xmlns:mnr="http://revenue.state.mn.us" xmlns:mnr-ptx="http://propertytax.state.mn.us" > <xsl:output indent="yes" exclude-result-prefixes="mn mnr c u mnr-ptx"/> <!-- only display the homestead record for this parcel ID --> <xsl:template match="/homesteadrecordsdocument/countyhomesteadrecord/homesteadparcels/homesteadparcel/countypr opertytaxstatement[mn:parcelid='1234567']"> <!-- copy the CountyHomesteadRecord that matched this parcel ID to the output --> <xsl:copy-of select="../../.."/> </xsl:template> <!-- do not output anything else --> <xsl:template match="@* node()"> <xsl:apply-templates select="@* node()"/> </xsl:template> </xsl:stylesheet> 42
V-The File Viewer Opens multi-gigabyte files in a few seconds $20 application (less in quantity) Easily allows viewing of files greater than 1GB (uses file "chunking" technology) Note: read-only tool See http://www.fileviewer.com/ 43
Use Goto Function or Goto is (Ctrl-G) 44
XML Databases XML databases store XML in its native format You can associate a column in your databases or a "collection" with the homestead XML Schema This allows you to have the database itself validate data before transmission to the state 45
Example of XML Databases IBM DB2 version 9 "PureXML" free and low-cost "express" versions for development and testing exist (open source) native XML database with XML Schema validation Over 50 other free and low-cost solutions with 30, 60 or 90 day evaluation periods http://www.rpbourret.com/xml/xmldatabaseprods.htm 46
DB2 IBM DB2 version 9 supports fast searches on complex XML data sets Load records into XML datatype Records are quickly validated using an XML Schema Searching is very fast 47
exist Open source Built in web-administration Easy to setup and configure Allows data to be validated on insert Fast searches Every XQuery IS a REST web service 48
Microsoft SQL Server 2005 Supports native XML datatype Supports fast indexing Add SOAP services to XML documents Support for XQuery and XQuery updates 49
Ant Book Covers Ant 1.7 50
Questions? 51