Unicode#

Always use the unicode conversion functions offered in carb/extras/Unicode.h when you must convert between representations. Fortunately this is a rare occurrence because we follow the “Unicode sandwich” methodology in this code base. For more details continue reading.

Unicode was unfortunately late to the character encoding party. When it finally arrived there were already in existence all kinds of half-baked solutions, including code pages and wide character encodings. This is legacy that we must still deal with. There are many pitfalls. Even in newer code, like the C++ Standard Library, it’s rather trivial to generate an exception when converting between unicode representations.

Hopefully this introduction gets your attention. We’ve developed some simple rules for this project that should keep you out of harm’s way and make programming simple when dealing with text.

Unicode Sandwich#

There are multiple types of Unicode representations, like UTF-8, UTF-16LE, and UTF-32. For this project we have chosen to use the same representation everywhere. That is, all text is stored and therefore interpreted the same way. This approach is called “the Unicode sandwich” because when we need to interact with external APIs (OS or 3rd party libraries) that don’t offer the internal encoding flavor we will convert on the boundary from whatever encoding is being used to our internal encoding. This will happen on input and output boundaries which act as the two slices of bread in our analogy with the sandwich. Inside the sandwich everything is in the same encoding. This uniformity makes things simpler for all of us.

UTF-8 is our Encoding#

We use UTF-8 encoding for all of our text in Carbonite, both in interfaces and inside plugins. This means that all the text is stored in char arrays. The reasons for choosing UTF-8 over other encodings are the following:

  1. Existing tools support; browsers, editors, etc support it directly and it is most often the default.

  2. Existing format support; json, xml, etc are all by default UTF-8 encoded.

  3. Many string operations performed on 8-bit ASCII characters work equally well on UTF-8 characters.

  4. The English alphabet and many other special characters are directly represented in the first 7-bits (7-bit ASCII) making debugging easier when viewing raw memory.

  5. All codepoints are supported by UTF-8 because it’s inherently extendable

Entering UTF-8 Characters in Source Files#

We use .editorconfig to set the encoding of all text files to UTF-8. This means that the source files are already UTF-8 and you can therefore paste text directly into them and it just works!

const char* volcano = "Eyjafjallajökull";
const char* mountain = "Everest";

In C++20, the char8_t type was introduced as a distinct type from char. This changes the type created by using u8-prefixed string literal (added in C++17) from a char const[N] to a char8_t const[N]. Since there is not an implicit conversion from a char8_t const* to a char const*, that means code using u8-prefixed string literals will fail to compile when C++20 is enabled.

Because of this, it is not advised to use u8 prefixes. Instead, assume that every char represents a valid UTF-8 code unit. In other words, just write "kæstur hákarl" instead of u8"kæstur hákarl", even though æ and á are not ASCII.

Note

If you choose to use an editor or merge tool that either:

  • doesn’t support character encoding coming from .editorconfig or,

  • doesn’t use UTF-8 by default

then you are responsible for saving files in UTF-8 encoding if you push their content beyond 7-bit ASCII.

Viewing UTF-8 Characters in Windows Console#

You need to enable the right code page in your Windows console to get the UTF-8 encoded characters output by Carbonite FrameworkLogger to render correctly. You can do this temporarily by executing:

chcp 65001

You can also do it permanently by following these steps:

  1. Start -> Run -> regedit

  2. Go to [HKEY_LOCAL_MACHINE\Software\Microsoft\Command Processor

  3. Create string Autorun (if it doesn’t already exist)

  4. Change the value to @chcp 65001>nul

  5. Launch a new Windows console

Some Unicode characters–for example, Chinese characters–will still be displayed as boxes, question marks, or other placeholder symbols denoting that the font doesn’t have proper representation of the requested character. To fix this, you can select a different font for the console: right click on the console title bar -> Properties -> Font, and select NSimSun. Other fonts that have greater Unicode coverage are MS Gothic/MS Mincho, but they are known to cause artifacts when displaying certain ASCII symbols.

Converting To and From those Tricky wchar_t Types#

Yes, we used plural for wchar_t on purpose. There isn’t a single wchar_t type. On Windows it is actually a UTF-16LE encoding while on Linux it’s a UTF-32 encoding. They don’t have the same capabilities and are not the same size. This has some serious implications.

Don’t write code that converts between UTF-16 and UTF-8 when the input is wchar_t. That code will (probably) work on Windows but it will mysteriously fail on Linux. Instead you must be careful to convert from the actual type, not what you think it contains. The C++ Standard Library contains support for converting UTF-16 and wchar_t. Avoid UTF-16 unless you positively know the input is in UTF-16. The fact that the type is wchar_t does not guarantee this (unless you are writing Windows specific code). Even so, it is safer to convert from wchar_t, since this will work correctly on all operating systems, regardless of what the wchar_t type is on that particular OS.

The default settings for the C++ Standard Library provided conversion objects (<codecvt>) will throw range exceptions when code points are encountered that are outside the range for the source or target encoding. Few are aware of this and even fewer know how to work around it, so please use the utility functions provided in Unicode.h and never roll your own.

Don’t use the std::experimental_filesystem::path facilities because they are not supported on all of Carbonite’s target platforms (Tegra) and secondly are error prone to use because by default these facilities will use wide strings with backslashes on Windows but UTF-8 on Linux with forward slashes. You need to be careful to call the right u8string accessor and u8path constructors in order to not bungle it up. Just avoid this mess altogether and use Path.h.

Fortunately, we only need to deal with the wide characters on Windows so the conversion functions to/from those are only available on the Windows platform. In general just write:

#include <carb/extras/Unicode.h>

and use the functions provided for your platform, knowing that you are doing it right.