MySQL/MariaDB Unicode Issues

TLDR: use utf8mb4 as the character set for tables because utf8 is broken in MySQL.

Recently while attempting to load the Unihan character database into a MySQL database using Django, but I found that I was getting encoding errors. To cut a long story short, it turns out that in MySQL, the character encoding utf8 != utf8!

The long version of the story is that when creating the database, I had used the default “utf8” encoding, thinking that this would enable the full use of unicode. Unfortunately this is not the case, as in MySQL “utf8” does not fully implemnet UTF8.

The solution to this problem is to use the “utf8mb4” encoding instead.

CREATE DATABASE blog CHARACTER SET utf8mb4;

But this is not enough, you also need to inform Django to use utf8mb4 when connecting to MySQL. To do this add the following to Django database options

'OPTIONS': {'charset': 'utf8mb4'},

One more problem happened, I had set the “hanzi” field to be unique but then part way though loading in the data, the script returned a “duplicate entry” error for hanzi field (this was for the 𠀁 character). This is due to the collation settings for MySQL, which sets the rules MySQL uses for comparing characters.

The collation setting I needed is utf8mb4_bin, which compares the bytes of the character.

I did not want to change the collation setting for the whole database, as this could break other things. So I decided to just change that column. This means I needed to create a custom migration in Django. The first step is to create an empty migration.

python3 manage.py makemigrations --empty zhongwen

Then add the following code to the list of operations to run for that migration.

migrations.RunSQL(
    'ALTER TABLE `zhongwen_hanzi` CHANGE `hanzi` `hanzi` VARCHAR(1) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL;'
)

Then we can run the migration, and it will change the hanzi field to use utf8mb4_bin for the collation.

             

Second Day of Learning Tagalog

Technically this is still day 1, yesterday I completed Core Lession 1.1 and 1.2. This time I completed Core Lesson 1.3.

New Words

  • siya – he/she
  • sila – they
  • nagluluto – cooking
  • lumalangoy – swimming
  • nagsusulat – studying
  • paalam – bye

Example Sentences

nagluluto siya – he/she is cooking

lumalangoy ang mga batang lalaki => the boys are swimming

     

First Lesson in Tagalog

I’m planning a trip out to the Philippines soon and want to have some knowelege of Tagalog before I go. A couple of years ago I purchased a license for Rosetta Stone for learning Chinese, and as it turns out it also supports Tagalog.

Rosetta Stone doesn’t always give a lot of explaination of the phrases it teaches, I think this is deliberate. I plan to write about each lesson, researching what I’m learning to help solidify it in my mind.

New Words

  • umiinom – to drink
  • kumain – to eat
  • tumatakbo – to run
  • nagbabasa – to read
  • babae – woman
  • lalaki – man
  • batang lalaki – young boy
  • batang babae – young girl
  • ang
  • ang mga

Example Sentances

kumain ang babae – the woman is eating

kumain ang batang lalaki – the boy is eating

nagbabasa ang mga batang babae – the girls are reading

Note that red means i’m unsure of the translation, Rosettsa Stone only uses pictures to represent the concept, there’s not translation.

Grammar Notes

In the 1.2 section of lesson 1, they introduced “mga”, it seems that mga is used to make common nouns plural.

       

Solving Problem 1 Using Windows Batch files

As crazy a it sounds, it turns out it is possible to solve Project Euler problem 1 using Windows Batch files! It is possible because modern versions for cmd support delayed expansion of variables, and arithmetic expressions using “SET /A”.

Delayed expansion means that variables are expanded at runtime, rather than parse time. Normally %VARIABLE% will be expanded once with whatever value it contains before the script is run, however with delayed expansion we can use !VARIABLE!. This variable is expanded whenever that line is executed, which with a for loop could be multiple times.

@echo off
setlocal enabledelayedexpansion

set /a sum = 0
FOR /L %%L IN (1, 1, 999) DO (
    set /a mod = %%L %% 3
    IF NOT !mod! EQU 0 set /a mod = %%L %% 5
    IF !mod! EQU 0 set /a sum = sum + %%L
)

echo %sum%
endlocal
       

Enabling Case Sensitivity in Windows

I recently needed to download a large number of files from a Linux server using SCP, which would take nearly 12 hours. When i checked the next day this transfer had stopped to ask if I wanted to overwrite a file. This was because two files had the same name, differing only in case.

I remembered there was a registry setting that could set the NTFS filesystem to be case sensitive[0], but I’d rather not do that on this workstation. I did a Google search anyway, but then I found there is a newer way to do this which is on a per-folder basis instead of globally[1].

There is a command called “fsutil.exe” that can adjust a lot of filesystem settings, one of them is “SetCaseSensitiveInfo”, which controls if the folder is case sensitive or not.

fsutil.exe file SetCaseSensitiveInfo C:\folder enable

When I restarted the download, it still didn’t work, because that command only applies it to the specified folder, and any new folders created after. It does not apply to existing sub-folders. The following PowerShell command will apply the setting to any existing folder.

Get-ChildItem C:\folder -Recurse | 
	? {$_.PSIsContainer} |
	% { fsutil.exe file SetCaseSensitiveInfo $_.FullName enable }

References

[0] https://superuser.com/questions/266110/how-do-you-make-windows-7-fully-case-sensitive-with-respect-to-the-filesystem

[1] https://docs.microsoft.com/en-us/windows/wsl/case-sensitivity

     

Managing Multiple Versions of Visual Studio

In my previous post, I described how to build an old version of GNU Make for Windows. While working on that I wanted to be able to test out different versions of Visual Studio to see if it builds successfully. Quickly switching between versions of Visual Studio was difficult so I created a batch file to help make it a lot easier.

This script takes a single argument that specifies which version of visual studio you want to set up and it then calls the appropriate vsvars32.bat file for that version of Visual Studio.

@echo off

if "%1" == "vs4" goto vs4
if "%1" == "vs6" goto vs6
if "%1" == "vs2003" goto vs2003

goto argerror

:vs4
pushd C:\msdev\bin
call vcvars32 x86
popd
goto done


:vs6
pushd C:\Program Files\Microsoft Visual Studio\VC98\Bin
call vcvars32
popd
goto done


:vs2003
pushd "C:\Program Files\Microsoft Visual Studio .NET 2003\Common7\Tools\"
call vsvars32.bat
popd
goto done

:argerror
echo no Visual Studio version specified!
echo usage: setupenv [version]
echo    where version is one of the following: vs4, vs6, vs2003

:done

             

Compiling Old Versions of GNU Make for Windows

I needed to build GNU Make v3.8 for Windows, turns out that this is not straightforward and I needed to patch the build script to get it to correctly build. GNU Make 3.8 is a very old version of make, the release note dates back to 2002.

The first issue is it needs an old version of Visual C++, trying to build using VS 2019 you will get a lot of warnings about deprecated flags. Reading the README.W32 file it mentions MSVC 5.x and MSVC 6.x. I opted for MSVC 6 which I happened to have a copy of in a Windows 2000 VM.

The first issue we need to resolve is a linking error caused by a missing library.

Looking through the build output we can see that it cannot find "config.h", which prevents the "subproc.lib" library from compiling which subsequently causes the error we just saw.

C:\BUILD\xxx\make-3.80\w32\subproc>cl.exe /nologo /MT /W3 /GX /Z7 /YX /Od /I .. /I . /I ../include /I ../.. /D WIN32 /D
WINDOWS32 /D _DEBUG /D _WINDOWS /FR.\WinDebug/ /Fp.\WinDebug/subproc.pch /Fo.\WinDebug/ /c sub_proc.c
sub_proc.c
sub_proc.c(9) : fatal error C1083: Cannot open include file: 'config.h': No such file or directory

The reason that "config.h" does not exist is that the file is not created by the build script. The line in the build script that creates it doesn’t get run, this is because there is a “+” at the beginning of the line which stops this line from running.

set make=gnumake
+if not exist config.h copy config.h.W32 config.h
cd w32\subproc

Removing the plus at the beginning of the line allows "subproc.lib" to be compiled and linked, but we still get linker errors.

C:\BUILD\xxx\make-3.80>echo WinRel\pathstuff.obj  1>>link.rel

C:\BUILD\xxx\make-3.80>echo off
"Linking WinRel/gnumake.exe"
function.obj : error LNK2001: unresolved external symbol _hash_init
variable.obj : error LNK2001: unresolved external symbol _hash_init
file.obj : error LNK2001: unresolved external symbol _hash_init
dir.obj : error LNK2001: unresolved external symbol _hash_init
read.obj : error LNK2001: unresolved external symbol _hash_init
variable.obj : error LNK2001: unresolved external symbol _hash_insert_at
file.obj : error LNK2001: unresolved external symbol _hash_insert_at
dir.obj : error LNK2001: unresolved external symbol _hash_insert_at
read.obj : error LNK2001: unresolved external symbol _hash_insert_at
variable.obj : error LNK2001: unresolved external symbol _hash_deleted_item
file.obj : error LNK2001: unresolved external symbol _hash_deleted_item
dir.obj : error LNK2001: unresolved external symbol _hash_deleted_item
read.obj : error LNK2001: unresolved external symbol _hash_deleted_item
variable.obj : error LNK2001: unresolved external symbol _hash_find_slot
file.obj : error LNK2001: unresolved external symbol _hash_find_slot
dir.obj : error LNK2001: unresolved external symbol _hash_find_slot
read.obj : error LNK2001: unresolved external symbol _hash_find_slot
variable.obj : error LNK2001: unresolved external symbol _hash_find_item
file.obj : error LNK2001: unresolved external symbol _hash_find_item
dir.obj : error LNK2001: unresolved external symbol _hash_find_item
function.obj : error LNK2001: unresolved external symbol _hash_find_item
variable.obj : error LNK2001: unresolved external symbol _hash_free
read.obj : error LNK2001: unresolved external symbol _hash_free
function.obj : error LNK2001: unresolved external symbol _hash_free
variable.obj : error LNK2001: unresolved external symbol _hash_map
file.obj : error LNK2001: unresolved external symbol _hash_map
variable.obj : error LNK2001: unresolved external symbol _hash_delete
file.obj : error LNK2001: unresolved external symbol _hash_delete
variable.obj : error LNK2001: unresolved external symbol _hash_print_stats
file.obj : error LNK2001: unresolved external symbol _hash_print_stats
variable.obj : error LNK2001: unresolved external symbol _hash_map_arg
file.obj : error LNK2001: unresolved external symbol _hash_dump
dir.obj : error LNK2001: unresolved external symbol _hash_insert
function.obj : error LNK2001: unresolved external symbol _hash_insert
function.obj : error LNK2001: unresolved external symbol _hash_load
.\WinRel/gnumake.exe : fatal error LNK1120: 13 unresolved externals
"WinRel build failed"

C:\BUILD\xxx\make-3.80>

There’s a lot of unresolved symbols, I searched for "hash_insert_at" and found the definition for it in "hash.c". Looking through the build script it turns out that this file is not included in the build. I added the following two lines to "build_w32.bat" just after "implicit.c" is compiled.

cl.exe /nologo /MT /W3 /GX /YX /O2 /I . /I glob /I w32/include /D NDEBUG /D WINDOWS32 /D WIN32 /D _CONSOLE /D HAVE_CONFIG_H /FR.\WinRel/ /Fp.\WinRel/%make%.pch /Fo.\WinRel/ /c hash.c
echo WinRel\hash.obj >>link.rel

Running the build script again, and voilà…..

C:\BUILD\build\make-3.80>echo off
"Linking WinRel/gnumake.exe"
LINK : warning LNK4089: all references to "ADVAPI32.dll" discarded by /OPT:REF
"WinRel build succeeded!"

C:\BUILD\build\make-3.80>

YAY!!! It now compiles, and we have gnumake.exe.

         

Translating a Restaurant Sign

I visited Cambridge at the weekend with my friend, she noticed a sign in a Chinese restaurant. She told me the sign said that this restaurant is Stephen Hawking’s favourite restaurant. This is my attempt at translating it.

The Chinese on the sign is:

物理学家霍金先生最喜欢的中国餐厅

Wùlǐ xué jiā huòjīn xiānshēng zuì xǐhuān de zhōngguó cāntīng

I know most of the s characters in this sentence, however there are new characters and words for me. The new words for me are physicist and Hawking.

  • 物理学家 – Physicist
  • 霍金 – Hawking

The translation is:

The favourite Chinese restaurant of Physicist Mr Hawking.

     

Fixing Code Block Alignment

I’m using the Enlighter plugin for WordPress to syntax highlight code snippets in posts. When using the TwentyTwenty theme the code snippets are left-aligned instead of centered in the post. This is a known compatibility issue for this plugin, and there is a fix detailed on Github.

.enlighter-default{
    margin: 0 auto 1.25em auto;
}

I’ve already created a child theme based on TwentyTwenty so it was easy to add the above CSS to the stylesheet, and et viola, the code snippets are correctly aligned now.

   

Wasabi Skin Care!

芥末味的护肤品😱😂?英国人吃日料的时候都不吃芥末,他们会喜欢芥末味在护肤品里吗。我想知道销量怎样。
我喜欢图二的味道,这个牌子在英国有,但买不到这些亚洲口味,除非去中国超市买,大部分英国人都不知道这两个味道的存在。

       
1 2 3 5