MySQL/MariaDB Unicode Issues

TLDR: use utf8mb4 as the character set for tables because utf8 is broken in MySQL.

Recently while attempting to load the Unihan character database into a MySQL database using Django, but I found that I was getting encoding errors. To cut a long story short, it turns out that in MySQL, the character encoding utf8 != utf8!

The long version of the story is that when creating the database, I had used the default “utf8” encoding, thinking that this would enable the full use of unicode. Unfortunately this is not the case, as in MySQL “utf8” does not fully implemnet UTF8.

The solution to this problem is to use the “utf8mb4” encoding instead.

CREATE DATABASE blog CHARACTER SET utf8mb4;

But this is not enough, you also need to inform Django to use utf8mb4 when connecting to MySQL. To do this add the following to Django database options

'OPTIONS': {'charset': 'utf8mb4'},

One more problem happened, I had set the “hanzi” field to be unique but then part way though loading in the data, the script returned a “duplicate entry” error for hanzi field (this was for the 𠀁 character). This is due to the collation settings for MySQL, which sets the rules MySQL uses for comparing characters.

The collation setting I needed is utf8mb4_bin, which compares the bytes of the character.

I did not want to change the collation setting for the whole database, as this could break other things. So I decided to just change that column. This means I needed to create a custom migration in Django. The first step is to create an empty migration.

python3 manage.py makemigrations --empty zhongwen

Then add the following code to the list of operations to run for that migration.

migrations.RunSQL(
    'ALTER TABLE `zhongwen_hanzi` CHANGE `hanzi` `hanzi` VARCHAR(1) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL;'
)

Then we can run the migration, and it will change the hanzi field to use utf8mb4_bin for the collation.

             

Profiling Python Applications

I’ve been working on a project that’s written in Python, it continuously communicates with some external industrial equipment. It will poll the status of this equipment 4 times per second and also sends commands to them when requested to. My job this week was to raise the update rate to 5Hz…. I needed to make sure I had enough time to do this!

I decided before doing anything I should profile the code to find out how much time the main loop needs to run and what methods take the longest time. That way I’d know if the code can support 5Hz and if not what I can do about it.

Once again the Python standard library comes to the rescue, the cProfile module will monitor the execution of your program and generate a report. Below is an example of how to use it.

import cProfile
cProfile.run("main()")

The next thing I did is write a simple bit of code that will print to stdout the current update rate of my application every second. It’s pretty much a Python port of the JavaScript library stats.js.

from __future__ import division
import time

class stats(object):
    def __init__(self):
        self.msMin = 1000
        self.msMax = 0
        self.msTime = 0
        self.fpsMin = 1000
        self.fpsMax = 0
        self.fps = 0;
        self.updates = 0
        self.startTime = int((time.time()+0.5)*1000)
        self.prevTime = int((time.time()+0.5)*1000)

    def begin(self):
        """Calling the method signifies the start of a frame
        """
        self.startTime = int((time.time()+0.5)*1000)

    def end(self):
        """Calling this method signifies the end of a frame
        """
        now = int((time.time()+0.5)*1000)

        self.msTime = now-self.startTime
        self.msMax = max(self.msMax, self.msTime)
        self.msMin = min(self.msMin, self.msTime)

        #print "ms: %i (%i - %i)" % (self.msTime, self.msMin, self.msMax)

        self.updates = self.updates + 1

        if now > (self.prevTime + 1000.0):
            self.fps = round((self.updates*1000.0)/float(now-self.prevTime))
            self.fpsMax = max(self.fpsMax, self.fps)
            self.fpsMin = min(self.fpsMin, self.fps)

            print "stats: %i fps (%.i fps - %i fps)" % (self.fps, self.fpsMin, self.fpsMax)

            self.prevTime = now
            self.updates = 0

        return now

        def update(self):
            self.startTime = self.end()

Simply add a call to begin() to the start of your loop and a corresponding call to end()  at the end of your loop. Now I get a nice counter that tells me how the code is performing. As I work on the code I can see how this affects the main loop performance.

       

Plotting a Text File with Matplotlib and IPython

Earlier I was converting a Scilab simulation into C and I had the code emit a load of text files containing the data because I didn’t want to do any plotting or UI stuff in C. But I still wanted to plot the data so I can quickly check everything’s working. I also wanted to do some post processing of the data too… well that’s Python really shines in my opinion. I fired up IPython and used Numpy and Matplotlib.

 import numpy as np
 import matplotlib.pyplot as pyplot
 pyplot.plot( np.loadtxt("data/somedatafile.dat") )
 pyplot.show()

Done! In four lines of code I have my plot. I love IPython, numpy and matplotlib, they allow you to get things done really fast. I also love that IPython even auto-completes file paths, so very handy :-).