MySQL and Unicode Daniël van Eeden Percona Live Amsterdam 23 September 2015

Similar documents
Database Migration from MySQL to RDM Server

Unicode Enabling Java Web Applications

How To Write A Domain Name In Unix (Unicode) On A Pc Or Mac (Windows) On An Ipo (Windows 7) On Pc Or Ipo 8.5 (Windows 8) On Your Pc Or Pc (Windows

MS ACCESS DATABASE DATA TYPES

Internationalizing the Domain Name System. Šimon Hochla, Anisa Azis, Fara Nabilla

MySQL Storage Engines

Oracle Database 11g Express Edition PL/SQL and Database Administration Concepts -II

Product Internationalization of a Document Management System

Part 3. MySQL DBA I Exam

Unraveling Unicode: A Bag of Tricks for Bug Hunting

Multi-lingual Label Printing with Unicode

Choosing a Data Model for Your Database

FileMaker 14. ODBC and JDBC Guide

Internationalized Domain Names -

Preservation Handbook

Unicode Security. Software Vulnerability Testing Guide. July 2009 Casaba Security, LLC

FileMaker 13. ODBC and JDBC Guide

Using SQL Server Management Studio

Right-to-Left Language Support in EMu

JasperServer Localization Guide Version 3.5

SQL Databases Course. by Applied Technology Research Center. This course provides training for MySQL, Oracle, SQL Server and PostgreSQL databases.

Internationalization of Domain Names

Determining your storage engine usage

ASCII Code. Numerous codes were invented, including Émile Baudot's code (known as Baudot

DBA Tutorial Kai Voigt Senior MySQL Instructor Sun Microsystems Santa Clara, April 12, 2010

SQL Server An Overview

MySQL for Beginners Ed 3

FmPro Migrator - FileMaker to SQL Server

The Unicode Standard Version 8.0 Core Specification

SQL. Short introduction

How to represent characters?

Chapter 4: Computer Codes

HKSCS-2004 Support for Windows Platform

Database Administration with MySQL

Data Integrator. Encoding Reference. Pervasive Software, Inc B Riata Trace Parkway Austin, Texas USA

Internationalizing JavaScript Applications Norbert Lindenberg. Norbert Lindenberg All rights reserved.

Services. Relational. Databases & JDBC. Today. Relational. Databases SQL JDBC. Next Time. Services. Relational. Databases & JDBC. Today.

SMPP protocol analysis using Wireshark (SMS)

This guide specifies the required and supported system elements for the application.

vtiger CRM Database UTF-8 Configuration (For MySQL)

MySQL+HandlerSocket=NoSQL

The use of binary codes to represent characters

Encoding Text with a Small Alphabet

Extracting META information from Interbase/Firebird SQL (INFORMATION_SCHEMA)

flask-mail Documentation

XtraBackup: Hot Backups and More

Java Interview Questions and Answers

SQL INJECTION TUTORIAL

Excel 2013 Sort: Custom Sorts, Sort Levels, Changing Level & Sorting by Colored Cells

Kazuraki : Under The Hood

Frequently Asked Questions on character sets and languages in MT and MX free format fields

IBM Unica emessage Version 8 Release 6 February 13, User's Guide

Data Tool Platform SQL Development Tools

ShoutCast v2 - Broadcasting with SAM Broadcaster

Linas Virbalas Continuent, Inc.

XML Character Encoding and Decoding

The first thing to do is choose if you are creating a mail merge for printing or an merge for distribution over .

3.GETTING STARTED WITH ORACLE8i

EURESCOM - P923 (Babelweb) PIR.3.1

How to translate your website. An overview of the steps to take if you are about to embark on a website localization project.

Salesforce Classic Guide for iphone

B.1 Database Design and Definition

Comparison of Open Source RDBMS

Table of Contents. Introduction: 2. Settings: 6. Archive 9. Search Browse Schedule Archiving: 18

not at all a manual simply a quick how-to-do guide

Teradata SQL Assistant Version 13.0 (.Net) Enhancements and Differences. Mike Dempsey

Table Of Contents. iii

USING MICROSOFT WORD 2008(MAC) FOR APA TASKS

New Features in MySQL 5.0, 5.1, and Beyond

Ontrack PowerControls User Guide Version 8.0

The future of International SEO. The future of Search Engine Optimization (SEO) for International Business

sqlite driver manual

ELFRING FONTS UPC BAR CODES

A list of data types appears at the bottom of this document. String datetimestamp = new java.sql.timestamp(system.currenttimemillis()).

Connectivity Pack for Microsoft Guide

Introduction to Unicode. By: Atif Gulzar Center for Research in Urdu Language Processing

NØGSG DMR Contact Manager

Apache Cassandra Query Language (CQL)

EMBL-EBI. Database Replication - Distribution

Binary Representation. Number Systems. Base 10, Base 2, Base 16. Positional Notation. Conversion of Any Base to Decimal.

A Brief Introduction to MySQL

Discovering SQL. Wiley Publishing, Inc. A HANDS-ON GUIDE FOR BEGINNERS. Alex Kriegel WILEY

Japanese Character Printers EPL2 Programming Manual Addendum

How to be a CSI (encoding Crime Scene Investigator)

National Language (Tamil) Support in Oracle An Oracle White paper / November 2004

MS SQL Performance (Tuning) Best Practices:

D61830GC30. MySQL for Developers. Summary. Introduction. Prerequisites. At Course completion After completing this course, students will be able to:

XML. CIS-3152, Spring 2013 Peter C. Chapin

Mailsteward Pro Table of Contents

Abstract. For notes detailing the changes in each release, see the MySQL for Excel Release Notes. For legal information, see the Legal Notices.

OmniDB - User s Guide

An Newsletter Using ASP Smart Mailer and Advanced HTML Editor

Efficient Pagination Using MySQL

Chapter 6: Physical Database Design and Performance. Database Development Process. Physical Design Process. Physical Database Design

MySQL Command Syntax

Centricity Enterprise Web 3.0 DICOM Conformance Memo DOC

Today s topics. Digital Computers. More on binary. Binary Digits (Bits)

Transcription:

MySQL and Unicode Daniël van Eeden Percona Live Amsterdam 23 September 2015

Booking.com is available in more than 40 languages So Unicode is important to us.

Also my name is Daniël, not Daniel

Also my name is Daniël, not Daniël

First some history

ASCII

Encodes characters as 7-bit

The 8 th bit can be used as parity, but that was never common.

The 3568 ASCII astroid is named after it.

ISO-8859

Uses the extra bit to be able to store a second set of 127 characters

The base characters (<127) are shared between ASCII and ISO-8859-?

The other characters differ per country/region

Windows-1252 (CP1252) is mostly identical with ISO-8859-1

ISO-8859-1 is also known as Latin1

Latin1 in MySQL is not ISO-8859-1, but CP1252.

Unicode

Allows you to store text in any language

Allows you to store text combining multiple languages in the same file

Each character gets a number (a.k.a. code point) and a description.

That doesn't guarantee your font will display it.

UTF-8

This is an character encoding for unicode.

This translates from code points to a binary string.

UTF-8 and ASCII share the same characters for 0<127.

Non-ASCII characters are stored as 2, 3 or 4-bytes.

UTF-32 UTF-16 UTF-8 ISO-8859-1 ASCII Baudot 0 8 16 24 32

If a byte starts with '0xxxxxxx' then it is a 1-byte character

If a byte starts with '110' it is a start of a 2- byte character.

If a byte starts with '10' then it is a continuation of a multibyte character.

If a byte starts with '1110' it is the start of a 3-byte character.

If a byte starts with '11110' it is the start of a 4-byte character.

Examples: a = 01100001 ë = 11000011 10101011

UTF-8 And MySQL

Some reasons to use UTF-8 in MySQL

Non-english scripts like Chinese, Cyrillic or Greek.

Emoji (including in the help text of your mobile app)

utf8 in MySQL is an alias for utf8mb3

utf8mb3 can store 3-byte UTF-8

utf8mb4 can store 4-byte UTF-8

Best practice: Always use utf8mb4, don't use utf8

Where to set the encoding?

It is set on a per-column basis

There is a per-table default

There is a per-database default

There is a per-server default: character_set_server

Connections also have a character set

Drawbacks of UTF-8

So just set everything to utf8mb4?

It depends

Does your application support it?

CHAR(10) suddenly needs 40 bytes!

TINYTEXT has a size limit in bytes

The MEMORY storage engine expands VARCHAR(10) to 40 bytes

With InnoDB your index grows over 767 bytes.

Best practice: Use latin1 for server, database and table default. Enable Unicode on columns which need it.

Converting your data

How to convert from latin1 to utf8mb4?

ALTER TABLE t1 MODIFY COLUMN c1 VARCHAR(100) CHARACTER SET utf8mb4;

But I have many columns!

Use CONCAT() and information_schema to generate the statements

Or convert all columns: ALTER TABLE t1 CONVERT TO CHARACTER SET utf8mb4;

Change defaults

Set character_set_server

ALTER SCHEMA s1 DEFAULT CHARACTER SET utf8mb4;

ALTER TABLE t1 DEFAULT CHARACTER SET utf8mb4;

Common failures

Application looks okay, but in MySQL the data looks wrong

The latin1 column was holding utf8 data already

Wrong conversion == garbage

Change column to varbinary and then to utf8mb4 to not convert the data.

The conversion fails and eats your data

Use sql_mode='strict_all_tables'

Now the operation will fail instead of truncate your data

Connection set to utf8, but data is 4-byte UTF-8.

Collation support There is no utf8mb4_general_cs (case sensitive) There is utf8mb4_unicode_ci And utf8mb4_unicode_520_ci And utf8mb4_bin

Special collations get lost during conversion

ALTER TABLE CONVERT TO only supports one collation

Safe collation before the ALTER and then restore it for columns which have a nondefault collation.

Collation mismatch

Use COLLATE to set the desired collation for the operation.

é and + e are not identical

Unicode normalization forms NFC Composed NFD Decomposed NFKC Composed NFKD Decomposed NFK removes compatibility distinction and will lose information. But this is useful for search etc.

Best practice: Normalize strings in your application

4-byte characters get silently lost on dump/restore

Set utf8mb4 as default charset for the connection mysqldump uses

QuestionsU Daniel.vanEeden@booking.com @dveeden

Did you know a BOM can be in the middle of a string?

Fulltext search & CJK

Not every character has the same width

Not even if we use a monospace font

A character can have a width of 0, 1, 2 or -1 positions

Punycode

Ligatures, Glyphs, Characters

Using Unicode

Fonts

Typing

Virtual keyboards

Control characters

Charset is not a constraint

Replacement characters

MySQL and Unicode Daniël van Eeden Percona Live Amsterdam 23 September 2015 Welcome. My name is Daniël and I work for Booking.com This presentation is about MySQL and Unicode.

Booking.com is available in more than 40 languages So Unicode is important to us. Booking.com is available in more that 40 languages so Unicode is of critical importance to us

This is the booking.com website in Arabic Note that the text is also flowing from right to left

Also my name is Daniël, not Daniel My name is Daniël. There are dots on the 'e'. Those are important to me.

Also my name is Daniël, not Daniël The image here shows examples from letters I got in the mail. This happens quite often. Also on websites. Marking every special character as illegal is not a solution.

First some history Let's first start with some history about character sets

Before ASCII: baudot (5-bit, 1870) ASCII Let's start with ASCII. ASCII was invented in 1963 to allow comunication between systems of different vendors. A little known fact is that ASCII was not made for computers, but for teleprinters. One of the interesting decisions of ASCII was that it does not require state. It does not encode the 'shift'

Encodes characters as 7-bit In ASCII 1 character equals 1 byte.

The 8 th bit can be used as parity, but that was never common.

The 3568 ASCII astroid is named after it. Fun fact

ISO-8859 ISO-8859 was created in 1985 and is a set of 16 character sets. The most known one is ISO-8859-1 This included more than just english When the euro was introduced they replaced with and named it ISO- 8859-15

Uses the extra bit to be able to store a second set of 127 characters ISO-8859 Replaces ISO 646 (1972) which was a 7-bit mess. ECMA, the European Computer Manufacturers Association

The base characters (<127) are shared between ASCII and ISO-8859-? The base characters are shared between ASCII and ISO-8859

The other characters differ per country/region

Windows-1252 (CP1252) is mostly identical with ISO-8859-1

ISO-8859-1 is also known as Latin1

Latin1 in MySQL is not ISO-8859-1, but CP1252. mysql> SHOW CHARSET LIKE 'latin1'; +---------+---------------------- +-------------------+--------+ Charset Description Default collation Maxlen +---------+---------------------- +-------------------+--------+ latin1 cp1252 West European latin1_swedish_ci 1 +---------+---------------------- +-------------------+--------+ 1 row in set (0.00 sec)

Unicode The work on Unicode started in 1987

Allows you to store text in any language Both alive and dead

Allows you to store text combining multiple languages in the same file

Each character gets a number (a.k.a. code point) and a description.

That doesn't guarantee your font will display it. You might see a replacement character instead. This can be a question mark or some square.

UTF-8 Unicode Transformation Format

This is an character encoding for unicode. It is not the only unicode enconding. UTF-16 (fixed: ucs2) UTF-32 (ucs4)

This translates from code points to a binary string.

UTF-8 and ASCII share the same characters for 0<127.

Non-ASCII characters are stored as 2, 3 or 4-bytes.

UTF-32 UTF-16 UTF-8 ISO-8859-1 ASCII Baudot 0 8 16 24 32 Here you can see the minimum and maximum number of bytes required to store one character. The blue show minimum and the red shows the variable part. This shows that UTF-8 is efficient in terms of storage for latin scripts

If a byte starts with '0xxxxxxx' then it is a 1-byte character

If a byte starts with '110' it is a start of a 2- byte character.

If a byte starts with '10' then it is a continuation of a multibyte character.

If a byte starts with '1110' it is the start of a 3-byte character.

If a byte starts with '11110' it is the start of a 4-byte character.

Examples: a = 01100001 ë = 11000011 10101011 Here you can see the letter a and the letter e with the dots (diaeresis, trema)

UTF-8 And MySQL Now we get into MySQL specifics

Some reasons to use UTF-8 in MySQL

Non-english scripts like Chinese, Cyrillic or Greek. Names Comments URL's E-mail addresses

Emoji (including in the help text of your mobile app) Hamburger icon

utf8 in MySQL is an alias for utf8mb3

utf8mb3 can store 3-byte UTF-8

utf8mb4 can store 4-byte UTF-8 utf8mb4 exists since 5.5.3

Best practice: Always use utf8mb4, don't use utf8

Where to set the encoding?

It is set on a per-column basis

There is a per-table default

There is a per-database default Stored in db.opt Use ALTER DATABASE to change it

There is a per-server default: character_set_server

Connections also have a character set Set the character set in your connection properties If that isn't possible: Use SET NAMES utf8mb4

Drawbacks of UTF-8

So just set everything to utf8mb4? The question we want to answer is...

It depends The answer is...

Does your application support it? Input validation Character length Security

CHAR(10) suddenly needs 40 bytes!

TINYTEXT has a size limit in bytes With utf8m4 you can store between 63 and 255 characters. This also happens to other TEXT types and BLOB types

The MEMORY storage engine expands VARCHAR(10) to 40 bytes This affects: - User created tables - Internal temporary tables

With InnoDB your index grows over 767 bytes. Use innodb_large_prefex with COMPRESSED or DYNAMIC

Best practice: Use latin1 for server, database and table default. Enable Unicode on columns which need it. Or use utf8mb4 all the way if you don't need the efficiency and performance of latin1 Changing everything to VARBINARY and BLOB will not solve your issue.

Converting your data

How to convert from latin1 to utf8mb4?

ALTER TABLE t1 MODIFY COLUMN c1 VARCHAR(100) CHARACTER SET utf8mb4;

But I have many columns!

Use CONCAT() and information_schema to generate the statements

Or convert all columns: ALTER TABLE t1 CONVERT TO CHARACTER SET utf8mb4; Also for INSERTs!

Change defaults

Set character_set_server

ALTER SCHEMA s1 DEFAULT CHARACTER SET utf8mb4;

ALTER TABLE t1 DEFAULT CHARACTER SET utf8mb4;

Common failures

Application looks okay, but in MySQL the data looks wrong 'Search' in the application might not function correctly

The latin1 column was holding utf8 data already

Wrong conversion == garbage Don't just convert this data. Run a latin1 to UTF-8 conversion on data which already was UTF-8 will result in garbage.

Change column to varbinary and then to utf8mb4 to not convert the data.

The conversion fails and eats your data MySQL tries really hard to convert your data but this might not be possible.

Use sql_mode='strict_all_tables'

Now the operation will fail instead of truncate your data Also for inserts

Connection set to utf8, but data is 4-byte UTF-8. You can't insert 4-byte or request 4- byte characters

Collation support There is no utf8mb4_general_cs (case sensitive) There is utf8mb4_unicode_ci And utf8mb4_unicode_520_ci And utf8mb4_bin unicode_ci = UCA 4.0.0 Unicode_520 UCA 5.2.0 Latest 8.0.0

Here we compare the sun and moon emoji.

Special collations get lost during conversion Collation = Sorting & Equality

ALTER TABLE CONVERT TO only supports one collation

Safe collation before the ALTER and then restore it for columns which have a nondefault collation.

Collation mismatch

Here MySQL does not now which collation to use.

Use COLLATE to set the desired collation for the operation.

é and + e are not identical Combining characters

Unicode normalization forms NFC Composed NFD Decomposed NFKC Composed NFKD Decomposed NFK removes compatibility distinction and will lose information. But this is useful for search etc.

Best practice: Normalize strings in your application

4-byte characters get silently lost on dump/restore

Set utf8mb4 as default charset for the connection mysqldump uses

This shows what we can do with a patched mysql client. This uses unicode drawing characters

This shows the unicode character database imported into MySQL

QuestionsU Daniel.vanEeden@booking.com @dveeden

Did you know a BOM can be in the middle of a string? Also MySQL doesn't handle BOM's well

Fulltext search & CJK

Not every character has the same width

Not even if we use a monospace font

A character can have a width of 0, 1, 2 or -1 positions

Punycode

Ligatures, Glyphs, Characters

Using Unicode

Fonts

Typing

Virtual keyboards

Control characters

Charset is not a constraint

Replacement characters