About Transcoding

Defined in RFC-3492, we find PunyCode. PunyCode is a specialization of Bootstring algorithms. Bootstring would perform better than base encodings and url encodings. However a basic Punycode implementation have an initial limitations for OpenUSD:

Punycode (and Bootstring in general), involves an initial basic code segregation. In the definition of Punycode, basic codes are all ASCII characters whose value is less equal than the parameter initial_n. For Punycode this value is 128. However, OpenUSD does not accept all ASCII characters less/equal than 128 as valid. Even more, the valid set of characters are non-contiguous (0-9 are within range 48-57, A-Z ranges from 65-90 and a-z goes from 97-122). For example, --> $1.00 <-- will be converted to --> $1.00 <-- in Punycode (i.e., no change), however this identifier is invalid for multiple reasons.

An implementation of Punycode (or another Bootstring algorithm) would require a specific IsBasicCode function to account for this, also adapting different steps in the algorithm to account for the situation where initial_n is a bit more fuzzy.

Another less impactful problem with Punycode (but not with Bootstring), is the way the non-basic code characters are encoded/decoded. To encode, Punycode uses base 36 encoding by default. Punycode processing for non-basic code characters is less efficient than a base 62 encoding.

Finally, just as other encodings, special care must be taken to address leading digits. In the case of punycode, this should be easier as we can treat any leading digits as non basic codes and add them after the delimiter.

These concerns can be addressed with a custom implementation of Bootstring. This strategy has the following advantages:

  • Efficiency, 100% for basic code characters, at worst 72% for non-basic code characters. This is because variable length encoding is way more efficient than simple bit shifting encoding (i.e. base 62).

  • Readability, a valid identifier will be encoded without any change, i.e. hello will be encoded to hello; non-valid identifiers consisting mostly of valid characters will be partially encoded, i.e. hello world will be encoded to tn__helloworld_lA, i.e. only the space is encoded; and only non valid identifiers consisting mostly of invalid characters will be non readable, i.e. ->$.<- will be encoded to tn__a0I26g1D. This is improved over other encoding methods, i.e. base62, where every case is obfuscated.

  • Querying, for basic code characters, querying is the same as before.

Disadvantages:

  • Querying, unfortunately encoding the search term and doing character comparison will not work as this is not a byte-aligned encoding.This will require all paths to be decoded as they are traversed.