MySQL and Unicode Daniël van Eeden Percona Live Amsterdam 23 September 2015
Booking.com is available in more than 40 languages So Unicode is important to us.
Also my name is Daniël, not Daniel
Also my name is Daniël, not Daniël
First some history
ASCII
Encodes characters as 7-bit
The 8 th bit can be used as parity, but that was never common.
The 3568 ASCII astroid is named after it.
ISO-8859
Uses the extra bit to be able to store a second set of 127 characters
The base characters (<127) are shared between ASCII and ISO-8859-?
The other characters differ per country/region
Windows-1252 (CP1252) is mostly identical with ISO-8859-1
ISO-8859-1 is also known as Latin1
Latin1 in MySQL is not ISO-8859-1, but CP1252.
Unicode
Allows you to store text in any language
Allows you to store text combining multiple languages in the same file
Each character gets a number (a.k.a. code point) and a description.
That doesn't guarantee your font will display it.
UTF-8
This is an character encoding for unicode.
This translates from code points to a binary string.
UTF-8 and ASCII share the same characters for 0<127.
Non-ASCII characters are stored as 2, 3 or 4-bytes.
UTF-32 UTF-16 UTF-8 ISO-8859-1 ASCII Baudot 0 8 16 24 32
If a byte starts with '0xxxxxxx' then it is a 1-byte character
If a byte starts with '110' it is a start of a 2- byte character.
If a byte starts with '10' then it is a continuation of a multibyte character.
If a byte starts with '1110' it is the start of a 3-byte character.
If a byte starts with '11110' it is the start of a 4-byte character.
Examples: a = 01100001 ë = 11000011 10101011
UTF-8 And MySQL
Some reasons to use UTF-8 in MySQL
Non-english scripts like Chinese, Cyrillic or Greek.
Emoji (including in the help text of your mobile app)
utf8 in MySQL is an alias for utf8mb3
utf8mb3 can store 3-byte UTF-8
utf8mb4 can store 4-byte UTF-8
Best practice: Always use utf8mb4, don't use utf8
Where to set the encoding?
It is set on a per-column basis
There is a per-table default
There is a per-database default
There is a per-server default: character_set_server
Connections also have a character set
Drawbacks of UTF-8
So just set everything to utf8mb4?
It depends
Does your application support it?
CHAR(10) suddenly needs 40 bytes!
TINYTEXT has a size limit in bytes
The MEMORY storage engine expands VARCHAR(10) to 40 bytes
With InnoDB your index grows over 767 bytes.
Best practice: Use latin1 for server, database and table default. Enable Unicode on columns which need it.
Converting your data
How to convert from latin1 to utf8mb4?
ALTER TABLE t1 MODIFY COLUMN c1 VARCHAR(100) CHARACTER SET utf8mb4;
But I have many columns!
Use CONCAT() and information_schema to generate the statements
Or convert all columns: ALTER TABLE t1 CONVERT TO CHARACTER SET utf8mb4;
Change defaults
Set character_set_server
ALTER SCHEMA s1 DEFAULT CHARACTER SET utf8mb4;
ALTER TABLE t1 DEFAULT CHARACTER SET utf8mb4;
Common failures
Application looks okay, but in MySQL the data looks wrong
The latin1 column was holding utf8 data already
Wrong conversion == garbage
Change column to varbinary and then to utf8mb4 to not convert the data.
The conversion fails and eats your data
Use sql_mode='strict_all_tables'
Now the operation will fail instead of truncate your data
Connection set to utf8, but data is 4-byte UTF-8.
Collation support There is no utf8mb4_general_cs (case sensitive) There is utf8mb4_unicode_ci And utf8mb4_unicode_520_ci And utf8mb4_bin
Special collations get lost during conversion
ALTER TABLE CONVERT TO only supports one collation
Safe collation before the ALTER and then restore it for columns which have a nondefault collation.
Collation mismatch
Use COLLATE to set the desired collation for the operation.
é and + e are not identical
Unicode normalization forms NFC Composed NFD Decomposed NFKC Composed NFKD Decomposed NFK removes compatibility distinction and will lose information. But this is useful for search etc.
Best practice: Normalize strings in your application
4-byte characters get silently lost on dump/restore
Set utf8mb4 as default charset for the connection mysqldump uses
QuestionsU Daniel.vanEeden@booking.com @dveeden
Did you know a BOM can be in the middle of a string?
Fulltext search & CJK
Not every character has the same width
Not even if we use a monospace font
A character can have a width of 0, 1, 2 or -1 positions
Punycode
Ligatures, Glyphs, Characters
Using Unicode
Fonts
Typing
Virtual keyboards
Control characters
Charset is not a constraint
Replacement characters
MySQL and Unicode Daniël van Eeden Percona Live Amsterdam 23 September 2015 Welcome. My name is Daniël and I work for Booking.com This presentation is about MySQL and Unicode.
Booking.com is available in more than 40 languages So Unicode is important to us. Booking.com is available in more that 40 languages so Unicode is of critical importance to us
This is the booking.com website in Arabic Note that the text is also flowing from right to left
Also my name is Daniël, not Daniel My name is Daniël. There are dots on the 'e'. Those are important to me.
Also my name is Daniël, not Daniël The image here shows examples from letters I got in the mail. This happens quite often. Also on websites. Marking every special character as illegal is not a solution.
First some history Let's first start with some history about character sets
Before ASCII: baudot (5-bit, 1870) ASCII Let's start with ASCII. ASCII was invented in 1963 to allow comunication between systems of different vendors. A little known fact is that ASCII was not made for computers, but for teleprinters. One of the interesting decisions of ASCII was that it does not require state. It does not encode the 'shift'
Encodes characters as 7-bit In ASCII 1 character equals 1 byte.
The 8 th bit can be used as parity, but that was never common.
The 3568 ASCII astroid is named after it. Fun fact
ISO-8859 ISO-8859 was created in 1985 and is a set of 16 character sets. The most known one is ISO-8859-1 This included more than just english When the euro was introduced they replaced with and named it ISO- 8859-15
Uses the extra bit to be able to store a second set of 127 characters ISO-8859 Replaces ISO 646 (1972) which was a 7-bit mess. ECMA, the European Computer Manufacturers Association
The base characters (<127) are shared between ASCII and ISO-8859-? The base characters are shared between ASCII and ISO-8859
The other characters differ per country/region
Windows-1252 (CP1252) is mostly identical with ISO-8859-1
ISO-8859-1 is also known as Latin1
Latin1 in MySQL is not ISO-8859-1, but CP1252. mysql> SHOW CHARSET LIKE 'latin1'; +---------+---------------------- +-------------------+--------+ Charset Description Default collation Maxlen +---------+---------------------- +-------------------+--------+ latin1 cp1252 West European latin1_swedish_ci 1 +---------+---------------------- +-------------------+--------+ 1 row in set (0.00 sec)
Unicode The work on Unicode started in 1987
Allows you to store text in any language Both alive and dead
Allows you to store text combining multiple languages in the same file
Each character gets a number (a.k.a. code point) and a description.
That doesn't guarantee your font will display it. You might see a replacement character instead. This can be a question mark or some square.
UTF-8 Unicode Transformation Format
This is an character encoding for unicode. It is not the only unicode enconding. UTF-16 (fixed: ucs2) UTF-32 (ucs4)
This translates from code points to a binary string.
UTF-8 and ASCII share the same characters for 0<127.
Non-ASCII characters are stored as 2, 3 or 4-bytes.
UTF-32 UTF-16 UTF-8 ISO-8859-1 ASCII Baudot 0 8 16 24 32 Here you can see the minimum and maximum number of bytes required to store one character. The blue show minimum and the red shows the variable part. This shows that UTF-8 is efficient in terms of storage for latin scripts
If a byte starts with '0xxxxxxx' then it is a 1-byte character
If a byte starts with '110' it is a start of a 2- byte character.
If a byte starts with '10' then it is a continuation of a multibyte character.
If a byte starts with '1110' it is the start of a 3-byte character.
If a byte starts with '11110' it is the start of a 4-byte character.
Examples: a = 01100001 ë = 11000011 10101011 Here you can see the letter a and the letter e with the dots (diaeresis, trema)
UTF-8 And MySQL Now we get into MySQL specifics
Some reasons to use UTF-8 in MySQL
Non-english scripts like Chinese, Cyrillic or Greek. Names Comments URL's E-mail addresses
Emoji (including in the help text of your mobile app) Hamburger icon
utf8 in MySQL is an alias for utf8mb3
utf8mb3 can store 3-byte UTF-8
utf8mb4 can store 4-byte UTF-8 utf8mb4 exists since 5.5.3
Best practice: Always use utf8mb4, don't use utf8
Where to set the encoding?
It is set on a per-column basis
There is a per-table default
There is a per-database default Stored in db.opt Use ALTER DATABASE to change it
There is a per-server default: character_set_server
Connections also have a character set Set the character set in your connection properties If that isn't possible: Use SET NAMES utf8mb4
Drawbacks of UTF-8
So just set everything to utf8mb4? The question we want to answer is...
It depends The answer is...
Does your application support it? Input validation Character length Security
CHAR(10) suddenly needs 40 bytes!
TINYTEXT has a size limit in bytes With utf8m4 you can store between 63 and 255 characters. This also happens to other TEXT types and BLOB types
The MEMORY storage engine expands VARCHAR(10) to 40 bytes This affects: - User created tables - Internal temporary tables
With InnoDB your index grows over 767 bytes. Use innodb_large_prefex with COMPRESSED or DYNAMIC
Best practice: Use latin1 for server, database and table default. Enable Unicode on columns which need it. Or use utf8mb4 all the way if you don't need the efficiency and performance of latin1 Changing everything to VARBINARY and BLOB will not solve your issue.
Converting your data
How to convert from latin1 to utf8mb4?
ALTER TABLE t1 MODIFY COLUMN c1 VARCHAR(100) CHARACTER SET utf8mb4;
But I have many columns!
Use CONCAT() and information_schema to generate the statements
Or convert all columns: ALTER TABLE t1 CONVERT TO CHARACTER SET utf8mb4; Also for INSERTs!
Change defaults
Set character_set_server
ALTER SCHEMA s1 DEFAULT CHARACTER SET utf8mb4;
ALTER TABLE t1 DEFAULT CHARACTER SET utf8mb4;
Common failures
Application looks okay, but in MySQL the data looks wrong 'Search' in the application might not function correctly
The latin1 column was holding utf8 data already
Wrong conversion == garbage Don't just convert this data. Run a latin1 to UTF-8 conversion on data which already was UTF-8 will result in garbage.
Change column to varbinary and then to utf8mb4 to not convert the data.
The conversion fails and eats your data MySQL tries really hard to convert your data but this might not be possible.
Use sql_mode='strict_all_tables'
Now the operation will fail instead of truncate your data Also for inserts
Connection set to utf8, but data is 4-byte UTF-8. You can't insert 4-byte or request 4- byte characters
Collation support There is no utf8mb4_general_cs (case sensitive) There is utf8mb4_unicode_ci And utf8mb4_unicode_520_ci And utf8mb4_bin unicode_ci = UCA 4.0.0 Unicode_520 UCA 5.2.0 Latest 8.0.0
Here we compare the sun and moon emoji.
Special collations get lost during conversion Collation = Sorting & Equality
ALTER TABLE CONVERT TO only supports one collation
Safe collation before the ALTER and then restore it for columns which have a nondefault collation.
Collation mismatch
Here MySQL does not now which collation to use.
Use COLLATE to set the desired collation for the operation.
é and + e are not identical Combining characters
Unicode normalization forms NFC Composed NFD Decomposed NFKC Composed NFKD Decomposed NFK removes compatibility distinction and will lose information. But this is useful for search etc.
Best practice: Normalize strings in your application
4-byte characters get silently lost on dump/restore
Set utf8mb4 as default charset for the connection mysqldump uses
This shows what we can do with a patched mysql client. This uses unicode drawing characters
This shows the unicode character database imported into MySQL
QuestionsU Daniel.vanEeden@booking.com @dveeden
Did you know a BOM can be in the middle of a string? Also MySQL doesn't handle BOM's well
Fulltext search & CJK
Not every character has the same width
Not even if we use a monospace font
A character can have a width of 0, 1, 2 or -1 positions
Punycode
Ligatures, Glyphs, Characters
Using Unicode
Fonts
Typing
Virtual keyboards
Control characters
Charset is not a constraint
Replacement characters