Support for Emojis With Full UTF-8 Support in Moodle and MySQL

Leon Stringer
Revised 14 May 2020, originally published 3 September 2018).

Introduction

Moodle Docs excerpt

If you’re installing Moodle using MySQL (or MariaDB) following the documentation you may have encountered the following:

It’s recommended that you have full UTF-8 support configured in MySQL. If this is not done some character sets, notably emojis, cannot be used. It is possible to do this after your site is installed but it is much easier before installation.

Or when installing Moodle you may have run into some of the following messages:

The current setup of MySQL or MariaDB is using ‘utf8’. This character set does not support four byte characters which include some emoji. Trying to use these characters will result in an error when updating a record, and any information being sent to the database will be lost. Please consider changing your settings to ‘utf8mb4’.

Your database uses Antelope as the file format. Full UTF-8 support in MySQL and MariaDB requires the Barracuda file format. Please switch to the Barracuda file format.

For full support of UTF-8 both MySQL and MariaDB require you to change your MySQL setting ‘innodb_file_per_table’ to ‘ON’.

For full support of UTF-8 both MySQL and MariaDB require you to change your MySQL setting ‘innodb_large_prefix’ to ‘ON’.

This blog post explains what these messages mean and the best practices to avoid them when installing your Moodle site.

Character sets

At the heart of this topic is how characters – A, B, a, b, 1, 2, @, #, etc. – are stored by computers. In olden times (the 1970s), storing all characters needed for the English language could fit into eight bits (binary digits): one byte, which allows up to 255 characters. With the limited computer storage of the time this was adequate for IT systems in English-speaking and western countries. But clearly this approach couldn’t cope with languages such as those requiring more than 255 characters, e.g. Chinese or Japanese, or displaying characters for multiple languages at once.

As markets opened up between countries and cultures such basic support for different languages became a barrier not just to sales but also for interoperability and cooperation. Not only was your product unattractive if it didn’t support the language for a target market, increasingly such support became a mandatory requirement in formal procurement processes.

Enter Unicode

To address these shortcomings a scheme was devised called Unicode to standardise support for characters from any language: Unicode. Of course our one byte per character couldn’t support this, additional bytes would be needed. Unicode separates the characters themselves (A, a, €, ç, Д, 서) from how bytes are used to store them. With Unicode you must decide on one of the available character encodings which will determine the number of bytes needed for each character. Two of the most common character encodings are UTF-8 and UTF-16. In UTF-16 storing ‘A’ uses two bytes and so does ‘£’. However with UTF-8, ‘£’ still uses two bytes, but ‘A’ now uses only one byte. Systems storing and exchanging Unicode data, e.g. Moodle and MySQL, must know which character encoding is being used.

MySQL uses UTF-8 to encode characters which uses a different number of bytes to store different characters. Storing “hello” uses five bytes (five 1-byte characters) whereas “你好” (nǐ hǎo – “hello” in Chinese) uses six (two 3-byte characters). The setting MySQL introduced for this was utf8 which allowed up to three bytes per character which is enough to store all characters needed for the majority of languages across the globe: the characters in Unicode’s Basic Multilingual Plane (BMP).

Despite this widespread language support MySQL’s approach proved to be inadequate in a surprisingly short space of time.

Adding emojis 🚀☕😀

New sets of characters are added to Unicode over time but what could happen to cause supporting the BMP and the majority of the world’s languages to suddenly be insufficient? The widespread adoption of emojis, particularly on mobile devices meant that, in a relatively short period of time, these pictographic representations of feelings and everyday objects are expected by users to be supported.

Emojis were consequently added to Unicode but are not in the BMP so three bytes aren’t enough to store these with UTF-8 encoding. So MySQL introduced a new setting, utf8mb4, supporting up to four bytes per character, the maximum UTF-8 can use (maybe they should have done this in the first place). With utf8mb4 you can store emojis in MySQL so when installing Moodle you should create your database using this option, e.g.: CREATE DATABASE moodle DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci.

Create database with utf8mb4_unicode_ci collation in phpMyAdmin

However even if you create you database with utf8mb4 you may still get an error saying you don’t have full UTF-8 support.

Database character encoding errors in Moodle environment checks

We’ve used the right setting so why do these messages appear?

More bytes, bigger indexes

Indexes are an important technique in databases used to retrieve values very quickly from database tables. When designing a system like Moodle you consider the searches needed to retrieve items from the database, e.g. users’ names or email addresses, and create indexes as appropriate. Indexes increase the storage required so you only add these when the improvement to performance justifies doing so.

As the amount of space needed to store a character has increased from up to three bytes to up to four bytes so the index’s size has increased. A Moodle course shortname can be up to 255 characters. Previous versions of MySQL (<= 5.7.6) allowed up to 767 bytes for an indexed field which is enough to store 255 characters up to three bytes each but not enough if we’re now using up to four bytes. So we need larger index sizes which MySQL supports but we need to make sure it’s enabled.

There’s a MySQL setting allowing indexes to handle up to 3,072 bytes. But this only applies if the newer “Barracuda” file format is in use. And this file format is only available if MySQL is configured to use a separate file for each database table. You can check these settings by running the following query in MySQL:

mysql> SHOW GLOBAL VARIABLES WHERE variable_name IN ('innodb_file_format', 'innodb_large_prefix', 'innodb_file_per_table');
+-----------------------+-----------+
| Variable_name         | Value     |
+-----------------------+-----------+
| innodb_file_format    | Barracuda |
| innodb_file_per_table | ON        |
| innodb_large_prefix   | ON        |
+-----------------------+-----------+
3 rows in set (0.02 sec)

SHOW GLOBAL VARIABLES in phpMyAdmin

If any of your settings do not match those shown in the Value column you need to edit your MySQL configuration file (typically my.cnf on Linux servers, my.ini on Microsoft Windows) with the following settings:

[mysqld]
innodb_file_format = Barracuda
innodb_file_per_table = 1
innodb_large_prefix

and then restart the MySQL service to apply these. You can also change these settings without restarting MySQL:

SET GLOBAL innodb_file_format = Barracuda;
SET GLOBAL innodb_file_per_table = 1;
SET GLOBAL innodb_large_prefix = 1;

but you need to make sure the configuration file has these settings as changes made with SET GLOBAL will be lost when MySQL or your server is restarted. After making any necessary changes you can re-run the SHOW GLOBAL VARIABLES query to verify the settings are now correct.

What if you can’t change MySQL?

But what if the required settings – innodb_file_format=Barracuda, innodb_file_per_table=1, innodb_large_prefix=1 – cannot be changed? Maybe you’re using a shared host managed by a third party or you don’t have the access needed to change MySQL’s configuration. Will you still be able to install Moodle? You can, but you must manually create your database before installing Moodle and specify the older utf8 character encoding, e.g.:

mysql> CREATE DATABASE moodle DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Query OK, 1 row affected (0.03 sec)

Create database with utf8_unicode_ci collation in phpMyAdmin

As your database will be using 3-byte character encoding your site will not be able to display emojis: this may result in characters being lost and an increase in support calls regarding this lack of functionality. You will also see a warning message on your site’s Environment report (Site administrationServerEnvironment). Moodle recommend against using utf8 when installing your site.

It’s possible to migrate to utf8mb4 after installation if you are able to apply the MySQL configuration at a later date, but this requires downtime for your site so it’s always better to make sure you meet the recommendation when installing rather than fixing it later.

Conclusion

Hopefully this blog post has given you a better understanding of the importance of getting full UTF-8 support configured in MySQL when installing Moodle: