MySQL/MariaDB Unicode Issues

TLDR: use utf8mb4 as the character set for tables because utf8 is broken in MySQL.

Recently while attempting to load the Unihan character database into a MySQL database using Django, but I found that I was getting encoding errors. To cut a long story short, it turns out that in MySQL, the character encoding utf8 != utf8!

The long version of the story is that when creating the database, I had used the default “utf8” encoding, thinking that this would enable the full use of unicode. Unfortunately this is not the case, as in MySQL “utf8” does not fully implemnet UTF8.

The solution to this problem is to use the “utf8mb4” encoding instead.

CREATE DATABASE blog CHARACTER SET utf8mb4;

But this is not enough, you also need to inform Django to use utf8mb4 when connecting to MySQL. To do this add the following to Django database options

'OPTIONS': {'charset': 'utf8mb4'},

One more problem happened, I had set the “hanzi” field to be unique but then part way though loading in the data, the script returned a “duplicate entry” error for hanzi field (this was for the 𠀁 character). This is due to the collation settings for MySQL, which sets the rules MySQL uses for comparing characters.

The collation setting I needed is utf8mb4_bin, which compares the bytes of the character.

I did not want to change the collation setting for the whole database, as this could break other things. So I decided to just change that column. This means I needed to create a custom migration in Django. The first step is to create an empty migration.

python3 manage.py makemigrations --empty zhongwen

Then add the following code to the list of operations to run for that migration.

migrations.RunSQL(
    'ALTER TABLE `zhongwen_hanzi` CHANGE `hanzi` `hanzi` VARCHAR(1) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL;'
)

Then we can run the migration, and it will change the hanzi field to use utf8mb4_bin for the collation.