MySQL and Unicode Daniël van Eeden Percona Live Amsterdam 23 September 2015

Booking.com is available in more than 40 languages So Unicode is important to us.

Also my name is Daniël, not Daniel

Also my name is Daniël, not DaniÃ«l

First some history

Encodes characters as 7-bit

The 8 th bit can be used as parity, but that was never common.

The 3568 ASCII astroid is named after it.

ISO-8859

Uses the extra bit to be able to store a second set of 127 characters

The base characters (<127) are shared between ASCII and ISO-8859-?

The other characters differ per country/region

Windows-1252 (CP1252) is mostly identical with ISO-8859-1

ISO-8859-1 is also known as Latin1

Latin1 in MySQL is not ISO-8859-1, but CP1252.

Unicode

Allows you to store text in any language

Allows you to store text combining multiple languages in the same file

Each character gets a number (a.k.a. code point) and a description.

That doesn't guarantee your font will display it.

This is an character encoding for unicode.

This translates from code points to a binary string.

UTF-8 and ASCII share the same characters for 0<127.

Non-ASCII characters are stored as 2, 3 or 4-bytes.

UTF-32 UTF-16 UTF-8 ISO-8859-1 ASCII Baudot 0 8 16 24 32

If a byte starts with '0xxxxxxx' then it is a 1-byte character

If a byte starts with '110' it is a start of a 2- byte character.

If a byte starts with '10' then it is a continuation of a multibyte character.

If a byte starts with '1110' it is the start of a 3-byte character.

Examples: a = 01100001 ë = 11000011 10101011

UTF-8 And MySQL

Some reasons to use UTF-8 in MySQL

Non-english scripts like Chinese, Cyrillic or Greek.

Emoji (including in the help text of your mobile app)

utf8 in MySQL is an alias for utf8mb3

utf8mb3 can store 3-byte UTF-8

Best practice: Always use utf8mb4, don't use utf8

Where to set the encoding?

It is set on a per-column basis

There is a per-table default

There is a per-database default

There is a per-server default: character_set_server

Connections also have a character set

Drawbacks of UTF-8

So just set everything to utf8mb4?

It depends

Does your application support it?

CHAR(10) suddenly needs 40 bytes!

TINYTEXT has a size limit in bytes

The MEMORY storage engine expands VARCHAR(10) to 40 bytes

With InnoDB your index grows over 767 bytes.

Best practice: Use latin1 for server, database and table default. Enable Unicode on columns which need it.

Converting your data

How to convert from latin1 to utf8mb4?

ALTER TABLE t1 MODIFY COLUMN c1 VARCHAR(100) CHARACTER SET utf8mb4;

But I have many columns!

Use CONCAT() and information_schema to generate the statements

Or convert all columns: ALTER TABLE t1 CONVERT TO CHARACTER SET utf8mb4;

Change defaults

Set character_set_server

ALTER SCHEMA s1 DEFAULT CHARACTER SET utf8mb4;

ALTER TABLE t1 DEFAULT CHARACTER SET utf8mb4;

Common failures

Application looks okay, but in MySQL the data looks wrong

The latin1 column was holding utf8 data already

Wrong conversion == garbage

Change column to varbinary and then to utf8mb4 to not convert the data.

The conversion fails and eats your data

Use sql_mode='strict_all_tables'

Now the operation will fail instead of truncate your data

Connection set to utf8, but data is 4-byte UTF-8.

Collation support There is no utf8mb4_general_cs (case sensitive) There is utf8mb4_unicode_ci And utf8mb4_unicode_520_ci And utf8mb4_bin

Special collations get lost during conversion

ALTER TABLE CONVERT TO only supports one collation

Safe collation before the ALTER and then restore it for columns which have a nondefault collation.

Collation mismatch

Use COLLATE to set the desired collation for the operation.

é and + e are not identical

Unicode normalization forms NFC Composed NFD Decomposed NFKC Composed NFKD Decomposed NFK removes compatibility distinction and will lose information. But this is useful for search etc.

Best practice: Normalize strings in your application

4-byte characters get silently lost on dump/restore

Set utf8mb4 as default charset for the connection mysqldump uses

QuestionsU Daniel.vanEeden@booking.com @dveeden

Did you know a BOM can be in the middle of a string?

Fulltext search & CJK

Not every character has the same width

Not even if we use a monospace font

A character can have a width of 0, 1, 2 or -1 positions

Punycode

Ligatures, Glyphs, Characters

Using Unicode

Typing

Virtual keyboards

Control characters

Charset is not a constraint

Replacement characters

MySQL and Unicode Daniël van Eeden Percona Live Amsterdam 23 September 2015 Welcome. My name is Daniël and I work for Booking.com This presentation is about MySQL and Unicode.

Booking.com is available in more than 40 languages So Unicode is important to us. Booking.com is available in more that 40 languages so Unicode is of critical importance to us

This is the booking.com website in Arabic Note that the text is also flowing from right to left

Also my name is Daniël, not Daniel My name is Daniël. There are dots on the 'e'. Those are important to me.

Also my name is Daniël, not DaniÃ«l The image here shows examples from letters I got in the mail. This happens quite often. Also on websites. Marking every special character as illegal is not a solution.

First some history Let's first start with some history about character sets

Before ASCII: baudot (5-bit, 1870) ASCII Let's start with ASCII. ASCII was invented in 1963 to allow comunication between systems of different vendors. A little known fact is that ASCII was not made for computers, but for teleprinters. One of the interesting decisions of ASCII was that it does not require state. It does not encode the 'shift'

Encodes characters as 7-bit In ASCII 1 character equals 1 byte.

The 8 th bit can be used as parity, but that was never common.

The 3568 ASCII astroid is named after it. Fun fact

ISO-8859 ISO-8859 was created in 1985 and is a set of 16 character sets. The most known one is ISO-8859-1 This included more than just english When the euro was introduced they replaced with and named it ISO- 8859-15

Uses the extra bit to be able to store a second set of 127 characters ISO-8859 Replaces ISO 646 (1972) which was a 7-bit mess. ECMA, the European Computer Manufacturers Association

The base characters (<127) are shared between ASCII and ISO-8859-? The base characters are shared between ASCII and ISO-8859

The other characters differ per country/region

Windows-1252 (CP1252) is mostly identical with ISO-8859-1

ISO-8859-1 is also known as Latin1

Latin1 in MySQL is not ISO-8859-1, but CP1252. mysql> SHOW CHARSET LIKE 'latin1'; +---------+---------------------- +-------------------+--------+ Charset Description Default collation Maxlen +---------+---------------------- +-------------------+--------+ latin1 cp1252 West European latin1_swedish_ci 1 +---------+---------------------- +-------------------+--------+ 1 row in set (0.00 sec)

Unicode The work on Unicode started in 1987

Allows you to store text in any language Both alive and dead

Allows you to store text combining multiple languages in the same file

Each character gets a number (a.k.a. code point) and a description.

That doesn't guarantee your font will display it. You might see a replacement character instead. This can be a question mark or some square.

UTF-8 Unicode Transformation Format

This is an character encoding for unicode. It is not the only unicode enconding. UTF-16 (fixed: ucs2) UTF-32 (ucs4)

This translates from code points to a binary string.

UTF-8 and ASCII share the same characters for 0<127.

Non-ASCII characters are stored as 2, 3 or 4-bytes.

UTF-32 UTF-16 UTF-8 ISO-8859-1 ASCII Baudot 0 8 16 24 32 Here you can see the minimum and maximum number of bytes required to store one character. The blue show minimum and the red shows the variable part. This shows that UTF-8 is efficient in terms of storage for latin scripts

If a byte starts with '0xxxxxxx' then it is a 1-byte character

If a byte starts with '110' it is a start of a 2- byte character.

If a byte starts with '10' then it is a continuation of a multibyte character.

Examples: a = 01100001 ë = 11000011 10101011 Here you can see the letter a and the letter e with the dots (diaeresis, trema)

UTF-8 And MySQL Now we get into MySQL specifics

Some reasons to use UTF-8 in MySQL

Non-english scripts like Chinese, Cyrillic or Greek. Names Comments URL's E-mail addresses

Emoji (including in the help text of your mobile app) Hamburger icon

utf8 in MySQL is an alias for utf8mb3

utf8mb4 can store 4-byte UTF-8 utf8mb4 exists since 5.5.3

Best practice: Always use utf8mb4, don't use utf8

Where to set the encoding?

It is set on a per-column basis

There is a per-table default

There is a per-database default Stored in db.opt Use ALTER DATABASE to change it

There is a per-server default: character_set_server

Connections also have a character set Set the character set in your connection properties If that isn't possible: Use SET NAMES utf8mb4

Drawbacks of UTF-8

So just set everything to utf8mb4? The question we want to answer is...

It depends The answer is...

Does your application support it? Input validation Character length Security

CHAR(10) suddenly needs 40 bytes!

TINYTEXT has a size limit in bytes With utf8m4 you can store between 63 and 255 characters. This also happens to other TEXT types and BLOB types

The MEMORY storage engine expands VARCHAR(10) to 40 bytes This affects: - User created tables - Internal temporary tables

With InnoDB your index grows over 767 bytes. Use innodb_large_prefex with COMPRESSED or DYNAMIC

Best practice: Use latin1 for server, database and table default. Enable Unicode on columns which need it. Or use utf8mb4 all the way if you don't need the efficiency and performance of latin1 Changing everything to VARBINARY and BLOB will not solve your issue.

Converting your data

How to convert from latin1 to utf8mb4?

ALTER TABLE t1 MODIFY COLUMN c1 VARCHAR(100) CHARACTER SET utf8mb4;

But I have many columns!

Use CONCAT() and information_schema to generate the statements

Or convert all columns: ALTER TABLE t1 CONVERT TO CHARACTER SET utf8mb4; Also for INSERTs!

Change defaults

Set character_set_server

ALTER SCHEMA s1 DEFAULT CHARACTER SET utf8mb4;

ALTER TABLE t1 DEFAULT CHARACTER SET utf8mb4;

Common failures

Application looks okay, but in MySQL the data looks wrong 'Search' in the application might not function correctly

The latin1 column was holding utf8 data already

Wrong conversion == garbage Don't just convert this data. Run a latin1 to UTF-8 conversion on data which already was UTF-8 will result in garbage.

Change column to varbinary and then to utf8mb4 to not convert the data.

The conversion fails and eats your data MySQL tries really hard to convert your data but this might not be possible.

Use sql_mode='strict_all_tables'

Now the operation will fail instead of truncate your data Also for inserts

Connection set to utf8, but data is 4-byte UTF-8. You can't insert 4-byte or request 4- byte characters

Collation support There is no utf8mb4_general_cs (case sensitive) There is utf8mb4_unicode_ci And utf8mb4_unicode_520_ci And utf8mb4_bin unicode_ci = UCA 4.0.0 Unicode_520 UCA 5.2.0 Latest 8.0.0

Here we compare the sun and moon emoji.

Special collations get lost during conversion Collation = Sorting & Equality

ALTER TABLE CONVERT TO only supports one collation

Safe collation before the ALTER and then restore it for columns which have a nondefault collation.

Collation mismatch

Here MySQL does not now which collation to use.

Use COLLATE to set the desired collation for the operation.

é and + e are not identical Combining characters

Unicode normalization forms NFC Composed NFD Decomposed NFKC Composed NFKD Decomposed NFK removes compatibility distinction and will lose information. But this is useful for search etc.

Best practice: Normalize strings in your application

4-byte characters get silently lost on dump/restore

Set utf8mb4 as default charset for the connection mysqldump uses

This shows what we can do with a patched mysql client. This uses unicode drawing characters

This shows the unicode character database imported into MySQL

QuestionsU Daniel.vanEeden@booking.com @dveeden

Did you know a BOM can be in the middle of a string? Also MySQL doesn't handle BOM's well

Fulltext search & CJK

Not every character has the same width

Not even if we use a monospace font

A character can have a width of 0, 1, 2 or -1 positions

Punycode

Ligatures, Glyphs, Characters

Using Unicode

Typing

Virtual keyboards

Control characters

Charset is not a constraint

Replacement characters