Transcoding algorithm

The algorithm is a generalization of Punycode, known as bootstring. There are references to how the algorithm work in the RFC-3492, the following document summarize it and may offer a more friendly explanation of the concepts.

Encoding procedure

Separation of basic codes

In general in Bootstring algorithm we need to differentiate between basic codes and extended codes:

Basic codes are known a priori and are the valid codes supported in our application domain.
Extended codes are every other UTF-8 code which is not in the basic code set.

For OpenUSD versions older than 24.03, the basic codes would be the alphanumerics and underscore. The extended code is virtually every other UTF-8 codes except those 63 characters.

The first step in encoding is separating the basic codes from the extended codes. The basic codes will be copied directly to the string, since they already belong to the domain of valid characters and will not cause any problem. If no extended code exists, then the algorithm finishes here.

Encoding of extended codes

If there are extended codes, we start by appending a delimiter character. A delimiter character is a character which belongs to the set of basic codes and help to differentiate between basic and extended codes. The original specification uses dash (-), however in our implementation we will use underscore (_).

Knowing all the extended codes and their positions in our original string we can proceed to encode them using a combination of:

Delta encoding
Variable length integer encoding
Mixed radix representation

Delta encoding

Delta encoding is the process of encoding differences of values, instead of encoding directly the values. In transcoding this is useful as it exploits the fact that UTF-8 characters of the same language appear close to each other. For example for japanese, Hiragana syllabary appears from 0x3040 to 0x309f while Katakana appears from 0x30a0 to 0x30ff. This helps to reduce the number of encoded bytes.

One difference in this implementation is that delta encoding starts at character 0. In Punycode, all ASCII characters are valid, as such delta encoding start with value 128. However, this is not true in OpenUSD where we still have invalid ASCII codes (i.e. / for separating paths).

Variable length integer encoding

Variable length integer encoding allow us to concatenate integers without having to mark the limits between each of them. UTF-8 itself is an example of a variable length encoding. The way variable length encoding works is as follows:

Take a value, if the value is smaller than threshold, print the value and exit.
if the value is bigger than threshold:
- get the least significant digit. In any base encoding, we would do this as digit = value % base, however we will adjust digit to be threshold <= adjusted digit < base. Print the adjusted digit.
- Decrease the value. In any other base encoding we would do this as value = value / base, however we just removed an adjusted amount, which we should take into consideration.
Repeat.

In this specific implementation we increase the base from the original 36 to 62, with this we can also set our threshold to be a constant without affecting performance.

Mixed radix representation

Above encodings let us represent single numbers, however we intend to store both the extended code and its position. Although we could store a sequence of two integers, that would expand our encoding representation. Another way is to represent the extended code and position as a single number. We can do this by expressing it as extendedCode * (string size + 1) + position (the string size + 1 is to take into account that 0 <= position <= string size). If we know the string size we can find the extended code and position by simply:

extendedCode = value / (string size + 1)
position = value % (string size + 1)

Mixed radix representation is also used in variable length integer encoding, since every position of our integer has a different base determined by the threshold.

Decoding procedure

The decoding procedure follows the reverse process:

Copy the basic codes.
For the encoded part:
- Let value be the variable length integer read.
- Increase code value code by value / (string size + 1) (due to delta encoding).
- Let position pos be value % (string size + 1).
- Insert code value code at position pos.
- Increase the size of the string by the size of the code.

Differences to Punycode

A summary of the differences against Bootstring are:

The separating character changes from - to _. Since - is invalid character in OpenUSD.
Delta encoding starts from 0 instead of 128. This to account the fact not all ASCII’s are allowed in OpenUSD.
The base representation changes from 36 in Punycode to 62 in this implementation (to represent more information in less characters).
Threshold is constant in this implementation. There is no loss of performance or memory representation, since our base is also increased, and simplifies implementation.