Talk 2008. Encoding Issues. An overview to understand and be able to handle encoding issues in a better way. Susanne Ebrecht



Similar documents
URL encoding uses hex code prefixed by %. Quoted Printable encoding uses hex code prefixed by =.

Memory is implemented as an array of electronic switches

ASCII Code. Numerous codes were invented, including Émile Baudot's code (known as Baudot

Voyager 9520/40 Voyager GS9590 Eclipse 5145

Xi2000 Series Configuration Guide

BARCODE READER V 2.1 EN USER MANUAL

Chapter 5. Binary, octal and hexadecimal numbers

DEBT COLLECTION SYSTEM ACCOUNT SUBMISSION FILE

Symbols in subject lines. An in-depth look at symbols

The ASCII Character Set

This is great when speed is important and relatively few words are necessary, but Max would be a terrible language for writing a text editor.

Chapter 1. Binary, octal and hexadecimal numbers

Numeral Systems. The number twenty-five can be represented in many ways: Decimal system (base 10): 25 Roman numerals:

ASCII CODES WITH GREEK CHARACTERS

BAR CODE 39 ELFRING FONTS INC.

plc numbers Encoded values; BCD and ASCII Error detection; parity, gray code and checksums

Create!form Barcodes. User Guide

BI-300. Barcode configuration and commands Manual

CHAPTER 8 BAR CODE CONTROL

Control Functions for Coded Character Sets

Representação de Caracteres

Part No. : MUL PROGRAMMING GUIDE

DL910 SERIES. Instruction Manual

Barcode Magstripe. Decoder & Scanner. Programming Manual

TELOCATOR ALPHANUMERIC PROTOCOL (TAP)

Model 200 / 250 / 260 Programming Guide

Command Emulator STAR Line Mode Command Specifications

7-Bit coded Character Set

ESPA Nov 1984 PROPOSAL FOR SERIAL DATA INTERFACE FOR PAGING EQUIPMENT CONTENTS 1. INTRODUCTION 2. CHARACTER DESCRIPTION

MINIMAG. Magnetic Stripe Reader Keyboard Wedge. User s Manual

Barcode Scanning Made Easy. Programming Guide

ASCII control characters (character code 0-31)

Barcode Scanning Made Easy. WWS500 Programming Guide

BRMO 80 / ETH-IP. User Manual. Réf : MU-BRMO 80-ETH-IP-1.4-EN

Teletypewriter Communication Codes

Scanner Configuration

S302D. Programming Guide. 2D Imaging Barcode Scanner. Advanced Handheld High-Speed Laser Scanner

Enter/Exit programming

HANDHELD LASER SCANNER

Characters & Strings Lesson 1 Outline

Index...1. Introduction...3. Installation- Keyboard Wedge...3 RS USB...3. Default Setting for each barcode shown as below:...

Security Protection of Software Programs by Information Sharing and Authentication Techniques Using Invisible ASCII Control Codes

MK-SERIE 1000/1500/2000 AllOfBarcode.de Michael Krug Traunstein BARCODE SCANNER

Applied Data Communication Lecture 14


INTERNATIONAL STANDARD

MK D Imager Barcode Scanner Configuration Guide

Barcode reader setup manual

NVT (Network Virtual Terminal) description

CD-3860 Bar Code Scanner User s Manual

ASCII Characters. 146 CHAPTER 3 Information Representation. The sign bit is 1, so the number is negative. Converting to decimal gives

Digital Logic Design. Introduction

This guide specifies the required and supported system elements for the application.

Bar Code CCD Scanner OPERATION MANUAL

C Examples! Jennifer Rexford!

PRINTED MANUAL AGG Software (

PROPERTY MANAGEMENT SYSTEM

How to represent characters?

Preservation Handbook

TAP Interface Specifications

How To Use A Microsoft Powerbook With A Microtron 2 (Ios) On A Microsatellite (Ipl) On An Iphone Or Ipro (Iphones) On Your Computer Or Ipo (Iphone)

Reference Manual. IQ Administrator Pro. and. PostgreSQL Database Server Installation Guide

How To Use A Powerpoint On A Microsoft Powerpoint 2.5 (Powerpoint 2) With A Microsatellite 2.2 (Powerstation 2) (Powerplant 2.3) (For Microsonde) (Micros

Revision History. Advanced Handheld CCD/Laser Scanner

Character Code Structure and Extension Techniques

ESC/POS Command Specifications

Chapter 4: Computer Codes

!"#$$$$First in Document Technology BARCODE User Guide & Programming Manual

S PT-H500LI ELECTRONIC E C LABELING L SYSTEM INTRODUCTION EDITING A LABEL LABEL PRINTING USING THE FILE MEMORY USING P-TOUCH SOFTWARE

The use of binary codes to represent characters

Import and Export User Guide. PowerSchool 7.x Student Information System

XR-500 [Receipt Printer User s Manual ]

QuickScan i. QD2100 Barcode Imager. Product Reference Guide

IBM Emulation Mode Printer Commands

Electronic Business Course - SA 2008 Project : E-Ticketing

1 Introduction FrontBase is a high performance, scalable, SQL 92 compliant relational database server created in the for universal deployment.

A short description to the VABus protocol

Tested configuration for Major versions of Primavera:-

Code. Barc. ber 20100

Install BA Server with Your Own BA Repository

Using VMware vfabric Postgres

ASCII Character Set and Numeric Values The American Standard Code for Information Interchange

Data Hiding in s and Applications Using Unused ASCII Control Codes

White paper FUJITSU Software Enterprise Postgres

SQL Databases Course. by Applied Technology Research Center. This course provides training for MySQL, Oracle, SQL Server and PostgreSQL databases.

How To Write A Domain Name In Unix (Unicode) On A Pc Or Mac (Windows) On An Ipo (Windows 7) On Pc Or Ipo 8.5 (Windows 8) On Your Pc Or Pc (Windows

FileMaker Server 9. Custom Web Publishing with PHP

FileMaker 13. ODBC and JDBC Guide

Module 7. Backup and Recovery

ROBO CYLINDER. Serial Communications Protocol. Intelligent Actuator, Inc.

Configuring an Alternative Database for SAS Web Infrastructure Platform Services

Banana is a native application for Windows, Linux and Mac and includes functions that allow the user to manage different types of accounting files:

Settlement (CO) HELP.COABR. Release4.6C

Still Aren't Doing. Frank Kim

HP Service Manager Compatibility Matrix

Parallels Containers for Windows 6.0

Data Integrator. Encoding Reference. Pervasive Software, Inc B Riata Trace Parkway Austin, Texas USA

Business Interaction Server. Configuration Guide Rev A

vtiger CRM Database UTF-8 Configuration (For MySQL)

Advanced Handheld High-Speed Laser Scanner

Transcription:

Talk 2008 Encoding Issues An overview to understand and be able to handle encoding issues in a better way Susanne Ebrecht PostgreSQL Usergroup Germany PostgreSQL European User Group PostgreSQL Project February, 2008

Definition Character Set A collection of signs... ŧ ðđŋħ ĸł~ The Greek alphabet A-Z ABCDEFGHIJKLMNOPQRSTUVWXYZ Roman numbers I V X L C D M A 1-9 1 2 3 4 5 6 7 8 9 ISO-8859-15 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US SP! " # $ % & ' ( ) * +, -. / 0 1 2 3 4 5 6 7 8 9 : ; < = >? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` The German alphabet AaÄäBbCcDdEeFfGgHhIiJjKkLlMmNnO oööppqqrrssßttuuüüvvwwxxyyzz UNICODE a b c d e f g h i j k l m n o p q r s t u v w x y z { } ~ DEL PAD HOP BPH NBH IND NEL SSA ESA HTS HTJ VTS PLD PLU RI SS2 SS3 DCS PU1 PU2 STS CCH MW SPA EPA SOS SGCI SCI CSI ST OSC PM APC NBSP Š š ª «SHY ± ² ³ Ž µ ž ¹ º» Œ œ Ÿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ ÿ ß 2

Definition Encoding Implementation of abstract signs, bits and bytes UTF-32 KOI8-R UTF-8 KOI8-U UTF-7 ISO-8859-15 A => 1 B => 2 C => 3 D => 4... UTF-16 ASCII EUC-JP BIG5 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SP! " # $ % & ' ( ) * +, -. / 3 0 1 2 3 4 5 6 7 8 9 : ; < = >? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ \ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { } ~ DEL 8 PAD HOP BPH NBH IND NEL SSA ESA HTS HTJ VTS PLD PLU RI SS2 SS3 9 DCS PU1 PU2 STS CCH MW SPA EPA SOS SGCI SCI CSI ST OSC PM APC A NBSP Š š ª «SHY B ± ² ³ Ž µ ž ¹ º» Œ œ Ÿ C À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï D Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ ß E à á â ã ä å æ ç è é ê ë ì í î ï F ð ñ ò ó ô õ ö ø ù ú û ü ý þ ÿ 3

Encoding Names in PostgreSQL Encoding names are partially defined by the SQL standard Encoding names are SQL identifiers Spaces are not allowed Most of all languages UTF8 or UNICODE Japanese EUC_JP Turkish LATIN5 or ISO_8859_9 or ISO88599 Western European LATIN1 or ISO_8859_1 or ISO88591 Greek ISO_8859_7 LATIN1 with Euro and accents LATIN9 or ISO_8859_15 or ISO885915 More informations: http://www.postgresql.org/docs/current/static/multibyte.html 4

Definition Collation DIN 5007-1, Duden ä is equivalent to a ö is equivalent to o ü is equivalent to u ß is equivalent to s sort sequence configuration which guideline is used for sorting UPPER(), LOWER() LIKE DIN 5007-2, phone book ä is equivalent to ae ö is equivalent to oe ü is equivalent to ue ß is equivalent to ss DIN 5007-2, Austria ä after az ö after oz ü after uz ß is equivalent to ss DIN 5007-2, British DIN 5007-2, Sweden, Finl. å after z ä after å ö after ä ü is equivalent to y Example for capitalisation a:a, b:b, c:c, ä:ä, ö:ö, ü:ü, ß:SZ, å:å, ä after a ö after o ü after u ß after s Mc is treated as Mac 5

Collation What is important? The encoding type has to match the collation type There are no rules in an ISO collation for UTF-8 You are able to choose the collation type for your system when you are making the initdb: $ initdb lc_collate=de_de Usually initdb will get the collation type from the locale Changing the collation type after initdb is not possible 6

Currency sign or EUR $ or USD or JPY or MLT or GBP 元 or HKD... System messages Definition Locale collection of political, cultural or language specific computerised rules Capitalisation rules Sheet size DIN A4 LETTER A No space left on device Auf dem Gerät ist kein Speicherplatz mehr verfügbar Aucun espace disponible sur le périphérique Geen ruimte meer over op apparaat Spazio insufficiente sul dispositivo Inget utrymme kvar på enheten Ikke mere plads på enheden Laitteella ei ole tilaa jäljellä No queda espacio libre en el dispositivo...... Sorting rules Date Numbers 1618.03 1618,03 1.618,03 1,618.03 1 618,03 1'618.03 1'618,03... 2008-02-24 24.02.2008 02/24/2008 2008/02/24 24. Feb. 2008 Feb, 24 th 2008... 7

Locale How to figure out the locale Unix: $ locale Which locales are possible on the system: Windows: $ locale -a Examples: C/POSIX means no locale de_de.utf-8 de_de.iso8859-15 en_en.utf-8 tr_tr.iso8859-9 System language setting 8

Locale Categories lc_ctype classification of signs What is a letter? lc_collate sort sequence rules capitalisation rules lc_messages language of the system messages lc_numeric number format (i.e. to_char) lc_monetary currency sign (i.e. to_char) lc_time date format (not used at the moment) 9

Locale Be careful Automatically, the system gets all values from the locale of the user who builds the cluster (made the initdb). Usually, this is the user: postgres. After initialising you can only change: lc_monetary, lc_messages, lc_numeric You can change them by editing postgresql.conf or using SET 10

Locale initdb Before making initdb you should take care of the locale of your corresponding user. You can add the locale or the single values to initdb: $ initdb locale=utf8 $ initdb --lc_collate=de_de --lc_messages=en_us... 11

Encoding Server Management of data storage on the server (on the disk) Default is defined by initdb Default set up can be seen by using \l in psql It is the encoding that is listed for the databases: template0 and template1 Encoding definition (i.e. LATIN9) for a new database: $ createdb -E LATIN9 dbname CREATE DATABASE dbname ENCODING 'LATIN9'; Changing database encoding later is impossible. 12

Encoding Client Defines the interpretation of the data that are sent/received from the client The actual binary data are defined by the client software i.e. psql, PGAdminIII, own software The client software has to inform the server about the encoding of the sent data about the encoding that received data should have Changing client encoding is possible The client encoding has to fit to the environment 13

Encoding Client encoding definition Default: Shell: psql: libpq: PHP: JDBC: server encoding $ export PGCLIENTENCODING=UTF8 \encoding UTF8 PQsetClientEncoding() pg_set_client_encoding() automatic (always UTF-8) and similar more... 14

Encoding Automatic conversion During transfer the data will be converted from client encoding to server encoding and vice versa. This is automatic and transparent if client and server encoding match. 15

Encoding Client encoding identification psql \encoding Console $ locale charmap Java/JDBC software Doesn't matter/automatic Web software (PHP, Perl,...) Form data encoding will be negotiated between browser and web server Web server encoding is the database client encoding Other development environments Should be documented 16

Encoding Mismatch ISO encoding always use 1 byte for characters UTF8 encoding use 1-4 byte for characters One of the famous mistakes occurs during INSERT/UPDATE The function length() displays the byte length of the text The other famous mistake is during SELECT: You will recognise this because of weird outputs: Examples (ISO/UTF8 mismatch): ö => à or üß => Ìà Grüße => Gr or Café => Caf Output like: Grüße => Gre usually is a mismatch between ASCII and something else. 17

Because of LATIN9 the byte length should be: 4, 5 and 3 Data are stored wrong in the database Reason: wrong environment (terminal) encoding during insert Repairing this needs a huge effort. i.e. dump => recode => restore Solution that this won't happens: Take care of environment and client encoding Switch environment (i.e. terminal) encoding to ISO or Switch client encoding to UTF8 (i.e. \encoding UTF8) Mismatch Stored data example Terminal encoding: UTF8 $ createdb -E LATIN9 dbname dbname=# \encoding => LATIN9 dbname=# create table t(id serial, txt text); dbname=# insert into t(txt) values ('Café'),('Grüße'),('Bär'); dbname=# select length(txt) from t; => 5, 7 and 4 18

Mismatch Error message example Default database settings: UTF8 Terminal: ISO-8859-15 $ createdb dbname dbname=# \encoding => UTF8 dbname=# create table t(id serial, txt text); dbname=# insert into t(txt) values ('Café'); ERROR: invalid byte sequence for encoding "UTF8": 0xe92729 Reason: environment and client encoding don't match Solution that this won't happens: Take care of environment and client encoding Switch environment (i.e. terminal) encoding to UTF8 or Switch client encoding to LATIN9 (i.e. \encoding LATIN9) 19

Mismatch Output example Database: UTF8 Terminal: ISO-8859-15 dbname=# \encoding => UTF8 dbname=# select txt from t; ------- Cafà GrÃŒÃÃe Bà r Reason: environment and client encoding don't match Solution that this won't happens: Take care of environment and client encoding Switch environment (i.e. terminal) encoding to UTF8 or Switch client encoding to LATIN9 (i.e. \encoding LATIN9) 20

Mismatch Output example Database: LATIN9 Terminal: UTF8 dbname=# \encoding => LATIN9 dbname=# select txt from t; ------- Caf Gr B Reason: environment and client encoding don't match Solution that this won't happens: Take care of environment and client encoding Switch environment (i.e. terminal) encoding to ISO or Switch client encoding to UTF8 (i.e. \encoding UTF8) 21

Recommendation Which encoding? Always recommended: UTF8 Locale: i.e. de_de.utf-8 or fr_fr.utf-8 Server encoding: UTF8 Caution! No Windows UTF8 support before PostgreSQL 8.1 Also recommended: LATIN9/ISO-8859-15 (if UTF8 occurs trouble) Locale: i.e. de_de.iso8859-15 or fr_fr.iso8859-15 Server encoding: LATIN9 Be careful with SQL_ASCII It is advised not to use it Asian encoding Ask a specialist or look at the documentation Recommendation for special languages: MULE_INTERNAL 22

Summary Dependency Encoding/Locale Sort sequence is defined by locale libc (OS libraries) requires a special encoding for sorting This is defined by locale Server encoding and locale settings has to match If not => byte chaos during sorting Server encoding and lc_collate has to match Server encoding should be the same for all databases 23

Summary The right way Think about encoding and locale before initialise PostgreSQL Elect the locale for initdb which kind of sort sequence is necessary for my software? Automatically intidb will elect the matching server encoding Don't use database specific encodings Always convert client encoding or make sure that client and server environment are equal Make sure that environment and client encoding are equal 24

Summary Summary Specify locale for the initdb process Server encoding is managing the data storage Client encoding and environment encoding has to match 25

Encoding Issues Closing Words Thank you Peter for once let me in on this topic Thank you Wikipedia for existing Thank you PostgreSQL project for the excellent documentation Thanks for listening 26