XML Character Encoding and Decoding



Similar documents
Section 1.4 Place Value Systems of Numeration in Other Bases

Base Conversion written by Cathy Saxton

The Unicode Standard Version 8.0 Core Specification

Storing Measurement Data

**Unedited Draft** Arithmetic Revisited Lesson 5: Decimal Fractions or Place Value Extended Part 3: Multiplying Decimals

PDF Primer PDF. White Paper

Bachelors of Computer Application Programming Principle & Algorithm (BCA-S102T)

Independent samples t-test. Dr. Tom Pierce Radford University

CS 1133, LAB 2: FUNCTIONS AND TESTING

Numeration systems. Resources and methods for learning about these subjects (list a few here, in preparation for your research):

Introduction to UNIX and SFTP

**Unedited Draft** Arithmetic Revisited Lesson 4: Part 3: Multiplying Mixed Numbers

EE 261 Introduction to Logic Circuits. Module #2 Number Systems

Lesson 1 Introduction to Rapid Application Development using Visual Basic

Everything you wanted to know about using Hexadecimal and Octal Numbers in Visual Basic 6

An overview of FAT12

Chapter 4: Computer Codes

Mail merge using Eudora - creating customized messages.

FILEMAKER SERVER 12 BACKUPS FREQUENTLY ASKED QUESTIONS

About the Junk Filter

Windows Presentation Foundation (WPF) User Interfaces

The Hexadecimal Number System and Memory Addressing

University of Hull Department of Computer Science. Wrestling with Python Week 01 Playing with Python

One Report, Many Languages: Using SAS Visual Analytics to Localize Your Reports

Translating QueueMetrics into a new language

Qlik REST Connector Installation and User Guide

User manual BS1000 LAN base station

Numbering Systems. InThisAppendix...

Everyone cringes at the words "Music Theory", but this is mainly banjo related and very important to learning how to play.

=

Detecting Anonymous Proxy Usage

Appendix K Introduction to Microsoft Visual C++ 6.0

Performance Testing Web 2.0

Lab 4.4 Secret Messages: Indexing, Arrays, and Iteration

Continuous Integration Part 2

Free Report. My Top 10 Tips to Betting Like a Pro With Zero Risk

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

HP Business Notebook Password Localization Guidelines V1.0

Using SQL Server Management Studio

2 SYSTEM DESCRIPTION TECHNIQUES

SEPA formats - an introduction to XML. version September

XML Processing and Web Services. Chapter 17

HTML Web Page That Shows Its Own Source Code

FirstClass FAQ's An item is missing from my FirstClass desktop

SCRIPT: Security Training

QaTraq Pro Scripts Manual - Professional Test Scripts Module for QaTraq. QaTraq Pro Scripts. Professional Test Scripts Module for QaTraq

ASCII Encoding. The char Type. Manipulating Characters. Manipulating Characters

Installing C++ compiler for CSc212 Data Structures

Frequently Asked Questions on character sets and languages in MT and MX free format fields

MS Access Lab 2. Topic: Tables

How To Proofread

Online signature API. Terms used in this document. The API in brief. Version 0.20,

Consulting. Personal Attention, Expert Assistance

Iowa State Poll. Page 1

Java Interview Questions and Answers

File Management Where did it go? Teachers College Summer Workshop

Demonstrating a DATA Step with and without a RETAIN Statement

Using CertAgent to Obtain Domain Controller and Smart Card Logon Certificates for Active Directory Authentication

Q Lately I've been hearing a lot about WS-Security. What is it, and how is it different from other security standards?

Security Patch Management and MSMUG. Bill Cotter 3M

SQL Injection. The ability to inject SQL commands into the database engine through an existing application

Computer Science 281 Binary and Hexadecimal Review

Setting Up Database Security with Access 97

Pulse Secure Client. Customization Developer Guide. Product Release 5.1. Document Revision 1.0. Published:

BBBT Podcast Transcript

Today s Outline. Computer Communications. Java Communications uses Streams. Wrapping Streams. Stream Conventions 2/13/2016 CSE 132

Documentation to use the Elia Infeed web services

Number Representation

The programming language C. sws1 1

Select the Crow s Foot entity relationship diagram (ERD) option. Create the entities and define their components.

REACHING YOUR GOALS. Session 4. Objectives. Time. Materials. Preparation. Procedure. wait4sex

SMPP protocol analysis using Wireshark (SMS)

A Comparison of Programming Languages for Graphical User Interface Programming

THE MAKING OF THE CONSTITUTION LESSON PLANS

Product Internationalization of a Document Management System

Frequently Asked Questions For CAISO Digital Certificates

Smooks Dev Tools Reference Guide. Version: GA

Merkle Hash Trees for Distributed Audit Logs

Address Phone & Fax Internet

HOMEWORK # 2 SOLUTIO

A Process is Not Just a Flowchart (or a BPMN model)

When the program is complied, built, and sent to the programmer, the return is an ERROR 27 message and the programming stops.

This explains why the mixed number equivalent to 7/3 is 2 + 1/3, also written 2

Rotorcraft Health Management System (RHMS)

A Case Study of ebay UTF-8 Database Migration

PHP Tutorial From beginner to master

How to register and use our Chat System

Cobol. By: Steven Conner. COBOL, COmmon Business Oriented Language, one of the. oldest programming languages, was designed in the last six

Roadmap for Ph.D. Students Aiming for a Successful Career in Science

Moven Studio realtime. streaming

The full setup includes the server itself, the server control panel, Firebird Database Server, and three sample applications with source code.

Table Of Contents. iii

The Forensic Analysis of the Microsoft Windows Vista Recycle Bin. By Mitchell Machor

C# and Other Languages

Exchanger XML Editor - Canonicalization and XML Digital Signatures

BBC LEARNING ENGLISH 6 Minute English Never too old

Transcription:

XML Character Encoding and Decoding January 2013 Table of Contents 1. Excellent quotes 2. Lots of character conversions taking place inside our computers and on the Web 3. Well-formedness error when encoding="..." does not match actual character encoding 4. How to generate an encoding well-formedness error 5. If everyone used UTF-8 would that be best for everyone? 6. Interoperability of XML (i.e., Character Encoding Interoperability) 7. Acknowledgements 1. Excellent quotes... around 1972... "Why is character coding so complicated?"... it hasn't become any simpler in the intervening 40 years. To be able to track end-to-end the path of conversions and validate that your application from authoring through to storage through to search and retrieval is completely correct is amazingly difficult... it's a skill far too few programmers have, or even recognize that they do not have. We can't really tell what's going on without access to your entire tool chain. It's possible that your editor changed the character encoding of your text when you changed the XML declaration (emacs does this)! It's unlikely that the encoding of the characters in this email is byte-identical with the files you created.... my preferred solution is to stick to a single encoding everywhere

... I vote for UTF-8...... make sure every single link in the chain uses that encoding. 2. Lots of character conversions taking place inside our computers and on the Web All of the characters in the following XML are encoded in iso-8859-1: <?xml version="1.0" encoding="iso-8859-1"?> Now consider this problem: Suppose we search the XML for the string López, where the characters in López are encoded in UTF-8. Will the search application find a match? I did the search and it found a match. How did it find a match? The underlying byte sequence for the iso-8859-1 López is: 4C F3 70 65 7A (one byte -- F3 -- is used to encode ó). The underlying byte sequence for the UTF-8 López is: 4C C3 B3 70 65 7A (two bytes -- C3 B3 -- are used to encode ó). The search application cannot be doing a byte-for-byte match, else it would find no match. The answer is that the search application either converted the XML to UTF-8 or converted the search string (López) to iso-8859-1. Inside our computers, inside our software applications, and on the Web there is a whole lot of character encoding conversions occurring... transparently... without our knowing.

3. Well-formedness error when encoding="..." does not match actual character encoding Create an XML document and encode all the characters in it using UTF-8: Add an XML declaration and specify that the encoding is iso-8859-1: <?xml version="1.0" encoding="iso-8859-1"?> There is a mismatch between what encoding="..." says and the actual encoding of the characters in the document. Check the XML document for well-formedness and you will NOT get an error. Next, create the same XML document but encode all the characters using iso-8859-1: Add an XML declaration and specify the encoding is UTF-8: <?xml version="1.0" encoding="utf-8"?> Again there is a mismatch between what encoding="..." says and the actual encoding of the characters in the document. But this time when you check the XML document for well-formedness you WILL get an error. Here's why. In UTF-8 the ó symbol is encoded using these two bytes: C3 B3 In iso-8859-1, C3 and B3 represent two perfectly fine characters, so the UTF-8 encoded XML is a fine encoding="iso-8859-1" document. In iso-8859-1 the ó symbol is encoded using one byte: F3 F3 is not a legal UTF-8 byte, so the iso-8859-1 encoded XML fails as an encoding="utf-8" document.

4. How to generate an encoding well-formedness error George Cristian Bina from oxygen XML gave the scoop on how things work inside oxygen. a. Create an XML document and encode all the characters in it using iso-8859-1: <?xml version="1.0" encoding="iso-8859-1"?> b. Using a hex editor, change encoding="iso-8859-1" to encoding="utf-8": <?xml version="1.0" encoding="utf-8"?> c. Drag and drop the file into oxygen. d. oxygen will generate an encoding exception: Cannot open the specified file. Got a character encoding exception [snip] George described the encoding conversions that occur inside oxygen behind-the-scenes: If you have an iso-8859-1 encoded XML file loaded into oxygen and change encoding="iso-8859-1" to encoding="utf-8" then oxygen will automatically change the encoding of every character in the document to UTF-8. George also made this important comment: Please note that the encoding is important only when the file is loaded and saved. When the file is loaded the bytes are converted to characters and then the application works only with characters. When the file is saved then those characters need to be converted to bytes and the encoding used will be determined from the XML header with a default to UTF-8 if no encoding can be detected. 5. If everyone used UTF-8 would that be best for everyone? There are a multiplicity of character encodings and a huge number of character encoding conversions taking place behind-the-scene. If everyone used UTF-8 then there would be no need to convert encodings.

Suppose every application, every IDE, every text editor, and every system worldwide used one character encoding, UTF-8. Would that be a good thing? The trouble with that is that UTF-8 makes larger files than UTF-16 for great numbers of people who use ideographic scripts such as Chinese. The real choice for them is between 16 and 32. UTF-16 is also somewhat harder to process in some older programming languages, most notably C and C++, where a zero-valued byte (NUL, as opposed to the zero-valued machine address, NULL) is used as a string terminator. So UTF-8 isn't a universal solution. There isn't a single solution today that's best for everyone. See this excellent discussion on StackOverflow titled, Should UTF-16 be Considered Harmful? http://programmers.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful In that discussion one person wrote: After long research and discussions, the development conventions at my company ban using UTF-16 anywhere except OS API calls... 6. Interoperability of XML (i.e., Character Encoding Interoperability) Remember not long ago you would visit a web page and see strange characters like this: â œgood morning, Daveâ You don't see that much anymore. Why? The answer is this: Interoperability is getting better. In the context of character encoding and decoding, what does that mean?

Interoperability means that you and I interpret (decode) the bytes in the same way. Example: I create an XML file, encode all the characters in it using UTF-8, and send the XML file to you. Here is a graphical depiction (i.e., glyphs) of the bytes that I send to you: You receive my XML document and interpret the bytes as iso-8859-1. In UTF-8 the ó symbol is a graphical depiction of the "LATIN SMALL LETTER O WITH ACUTE" character and it is encoded using these two bytes: C3 B3 But in iso-8859-1, the two bytes C3 B3 is the encoding of two characters: C3 is the encoding of the à character B3 is the encoding of the ³ character Thus you interpret my XML as: <Name>López</Name> We are interpreting the same XML (i.e., the same set of bytes) differently. Interoperability has failed. So when we say: Interoperability is getting better. we mean that the number of incidences of senders and receivers interpreting the same bytes differently is decreasing. Let's revisit our first example, you open a document and see this: â œgood morning, Daveâ Here's how that happened: I created a document, using a UTF-8 editor, containing this text: Good morning, Dave

The left quote character is U+201C and the right quote character is U+201D. You open the document using Microsoft Word (or any Windows-1252 editor) and see: â œgood morning, Daveâ Here s why: In UTF-8 the left smart quote is codepoint 201C, which is encoded inside the computer as these hex values: E2 80 9C In Windows-1252 the byte E2 is displayed as â and the byte 80 is displayed as and the byte 9C is displayed as œ In UTF-8 the right smart quote is codepoint 201D, which is encoded inside the computer as these hex values: E2 80 9D In Windows-1252 the byte E2 is displayed as â and the byte 80 is displayed as and the byte 9D has no representation 7. Acknowledgements Thanks to the following people for their contributions to this paper: David Allen Roger Costello John Cowan Jim DeLaHunt Ken Holman Michael Kay David Lee Chris Maloney Liam Quin Michael Sokolov Andrew Welch Hermann Stam-Wilbrandt