database – James North

MySQL/MariaDB Unicode Issues

Posted 2022-07-26 in Software

TLDR: use utf8mb4 as the character set for tables because utf8 is broken in MySQL.

Recently while attempting to load the Unihan character database into a MySQL database using Django, but I found that I was getting encoding errors. To cut a long story short, it turns out that in MySQL, the character encoding utf8 != utf8!

The long version of the story is that when creating the database, I had used the default “utf8” encoding, thinking that this would enable the full use of unicode. Unfortunately this is not the case, as in MySQL “utf8” does not fully implemnet UTF8.

The solution to this problem is to use the “utf8mb4” encoding instead.

CREATE DATABASE blog CHARACTER SET utf8mb4;

But this is not enough, you also need to inform Django to use utf8mb4 when connecting to MySQL. To do this add the following to Django database options

'OPTIONS': {'charset': 'utf8mb4'},

One more problem happened, I had set the “hanzi” field to be unique but then part way though loading in the data, the script returned a “duplicate entry” error for hanzi field (this was for the 𠀁 character). This is due to the collation settings for MySQL, which sets the rules MySQL uses for comparing characters.

The collation setting I needed is utf8mb4_bin, which compares the bytes of the character.

I did not want to change the collation setting for the whole database, as this could break other things. So I decided to just change that column. This means I needed to create a custom migration in Django. The first step is to create an empty migration.

python3 manage.py makemigrations --empty zhongwen

Then add the following code to the list of operations to run for that migration.

migrations.RunSQL(
    'ALTER TABLE `zhongwen_hanzi` CHANGE `hanzi` `hanzi` VARCHAR(1) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL;'
)

Then we can run the migration, and it will change the hanzi field to use utf8mb4_bin for the collation.

database django mysql python software engineering sql unicode

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.