carb::extras::Utf8Parser

class Utf8Parser

Static helper class to allow for the processing of UTF-8 strings.

This can walk individual codepoints in a string, decode and encode codepoints, and count the number of codepoints in a UTF-8 string. Minimal error checking is done in general and there is a common assumption that the input string is valid. The only failure points that will be checked are ones that prevent further decoding (ie: not enough space left in a buffer, unexpected sequences, misaligned strings, etc).

Public Types

enum class SurrogateMember

Names to classify a decoded codepoint’s membership in a UTF-16 surrogate pair.

The UTF-16 encoding allows a single Unicode codepoint to be represented using two 16-bit codepoints. These two codepoints are referred to as the ‘high’ and ‘low’ codepoints. The high codepoint always comes first in the pair, and the low codepoint is expected to follow immediately after. The high codepoint is always in the range 0xd800-0xdbff. The low codepoint is always in the range 0xdc00-0xdfff. These ranges are reserved in the Unicode set specifically for the encoding of UTF-16 surrogate pairs and will never appear in an encoded UTF-8 string that contains UTF-16 encoded pairs.

Values:

enumerator eNone

The codepoint is not part of a UTF-16 surrogate pair.

This codepoint is outside the 0xd800-0xdfff range and is either represented directly in UTF-8 or is small enough to not require a UTF-16 encoding.

enumerator eHigh

The codepoint is a high value in the surrogate pair.

This must be in the range reserved for UTF-16 high surrogate pairs (ie: 0xd800-0xdbff). This codepoint must come first in the pair.

enumerator eLow

The codepoint is a low value in the surrogate pair.

This must be in the range reserved for UTF-16 low surrogate pairs (ie: 0xdc00-0xdfff). This codepoint must come second in the pair.

using CodePoint = char32_t

The base type for a single Unicode codepoint value.

This represents a decoded UTF-8 codepoint.

using Utf16CodeUnit = char16_t

The base type for a single UTF-16 Unicode codepoint value.

This represents a decoded UTF-8 codepoint that fits in 16 bits, or a single member of a surrogate pair.

using CodeByte = char

The base type for a point in a UTF-8 string.

Ideally these values should point to the start of an encoded codepoint in a string.

using Flags = uint32_t: Base type for flags to various encoding and decoding functions.

Public Static Functions

static inline const CodeByte *nextCodePoint(const CodeByte *str, size_t lengthInBytes = kNullTerminated, CodePoint *codepoint = nullptr, Flags flags = 0)

Finds the start of the next UTF-8 codepoint in a string.

Remark

This attempts to walk a UTF-8 encoded string to find the start of the next valid codepoint. This can be used to walk from one codepoint to another and modify the string as needed, or to just count its length in codepoints.

Parameters

str – [in] The UTF-8 string to find the next codepoint in. This may not be nullptr. This is assumed to be a valid UTF-8 encoded string and the passed in address is assumed to be the start of another codepoint. If the fDecodeSkipInvalid flag is used in flags, an attempt will be made to find the start of the next valid codepoint in the string before failing. If a valid lead byte is found within the bounds of the string, that will be returned instead of failing in this case.
lengthInBytes – [in] The remaining number of bytes in the string. This may be kNullTerminated if the string is well known to be null terminated. This operation will not walk the string beyond this number of bytes. Note that the operation may still end before this many bytes have been scanned if a null terminator is encountered.
codepoint – [out] Receives the decoded codepoint. This may be nullptr if the decoded codepoint is not needed. If this is nullptr, none of the work to decode the codepoint will be done. If the next UTF-8 codepoint is part of a UTF-16 surrogate pair, the full codepoint will be decoded.
flags – [in] Flags to control the behavior of this operation. This may be 0 or one or more of the fDecode* flags.

Returns

The address of the start of the next codepoint in the string if one is found.

Returns

nullptr if the string is empty, a null terminator is found, or there are no more bytes remaining in the string.

static inline const CodeByte *lastCodePoint(const CodeByte *str, size_t lengthInBytes = kNullTerminated, CodePoint *codepoint = nullptr, Flags flags = fDecodeUseDefault)

Finds the start of the last UTF-8 codepoint in a string.

Remark

This function attempts to walk a UTF-8 encoded string to find the start of the last valid codepoint.

Parameters

str – [in] The UTF-8 string to find the last codepoint in. This is assumed to be a valid UTF-8 encoded string and the passed in address is assumed to be the start of another codepoint. If the fDecodeSkipInvalid flag is used in flags, an attempt will be made to find the start of the last valid codepoint in the string before failing. If a valid lead byte is found within the bounds of the string, that will be returned instead of failing in this case.
lengthInBytes – [in] The remaining number of bytes in the string. This may be kNullTerminated if the string is well known to be null terminated. This operation will not walk the string beyond this number of bytes. Note that the operation may still end before this many bytes have been scanned if a null terminator is encountered.
codepoint – [out] Receives the decoded codepoint. This may be nullptr if the decoded codepoint is not needed. If this is nullptr, none of the work to decode the codepoint will be done. If the last UTF-8 codepoint is part of a UTF-16 surrogate pair, the full codepoint will be decoded.
flags – [in] Flags to control the behavior of this operation. This may be 0 or one or more of the fDecode* flags.

Returns

The address of the start of the last codepoint in the string if one is found.

Returns

nullptr if the string is empty, a null terminator is found, or there are no more bytes remaining in the string.

static inline size_t getLengthInCodePoints(const CodeByte *str, size_t maxLengthInBytes = kNullTerminated, Flags flags = 0)

Calculates the length of a UTF-8 string in codepoints.

Remark

This can be used to count the number of codepoints in a UTF-8 string. The count will only include valid codepoints that are found in the string. It will not include a count for any invalid code bytes that are skipped over when the fDecodeSkipInvalid flag is used. This means that it’s not necessarily safe to use this result (when the flag is used) to allocate a decode buffer, then decode the codepoints in the string with getCodePoint() using the fDecodeUseDefault flag.

Parameters

str – [in] The string to count the number of codepoints in. This may not be nullptr. This is expected to be a valid UTF-8 string.
maxLengthInBytes – [in] The maximum number of bytes to parse in the string. This can be kNullTerminated if the string is well known to be null terminated.
flags – [in] Flags to control the behavior of this operation. This may be 0 or fDecodeSkipInvalid.

Returns

The number of valid codepoints in the given UTF-8 string.

Returns

0 if the string is empty or no valid codepoints are found.

static inline size_t getLengthInCodeBytes(const CodePoint *str, size_t maxLengthInCodePoints = kNullTerminated, Flags flags = 0)

Calculates the length of a Unicode string in UTF-8 code bytes.

Remark

This can be used to count the number of UTF-8 code bytes required to encode the given Unicode string. The count will only include valid codepoints that are found in the string. Note that if the fEncodeUseUtf16 flag is not used here to calculate the size of a buffer, it should also not be used when converting codepoints. Otherwise the buffer could overflow.

Note

For the 32-bit codepoint variant of this function, it is assumed that UTF-16 surrogate pairs do not exist in the source string. In the 16-bit codepoint variant, surrogate pairs are supported.

Parameters

str – [in] The string to count the number of code bytes that will be required to store it in UTF-8. This may not be nullptr. This is expected to be a valid Unicode string.
maxLengthInCodePoints – [in] The maximum number of codepoints to parse in the string. This can be kNullTerminated if the string is well known to be null terminated.
flags – [in] Flags to control the behavior of this operation. This may be 0 or fEncodeUseUtf16.

Returns

The number of UTF-8 code bytes required to encode this Unicode string, not including the null terminator.

Returns

0 if the string is empty or no valid codepoints are found.

static inline size_t getLengthInCodeBytes(const Utf16CodeUnit *str, size_t maxLengthInCodePoints = kNullTerminated, Flags flags = 0)

Calculates the length of a Unicode string in UTF-8 code bytes.

Remark

This can be used to count the number of UTF-8 code bytes required to encode the given Unicode string. The count will only include valid codepoints that are found in the string. Note that if the fEncodeUseUtf16 flag is not used here to calculate the size of a buffer, it should also not be used when converting codepoints. Otherwise the buffer could overflow.

Note

For the 32-bit codepoint variant of this function, it is assumed that UTF-16 surrogate pairs do not exist in the source string. In the 16-bit codepoint variant, surrogate pairs are supported.

Parameters

str – [in] The string to count the number of code bytes that will be required to store it in UTF-8. This may not be nullptr. This is expected to be a valid Unicode string.
maxLengthInCodePoints – [in] The maximum number of codepoints to parse in the string. This can be kNullTerminated if the string is well known to be null terminated.
flags – [in] Flags to control the behavior of this operation. This may be 0 or fEncodeUseUtf16.

Returns

The number of UTF-8 code bytes required to encode this Unicode string, not including the null terminator.

Returns

0 if the string is empty or no valid codepoints are found.

static inline CodePoint getCodePoint(const CodeByte *str, size_t lengthInBytes = kNullTerminated, Flags flags = 0)

Decodes a single codepoint from a UTF-8 string.

Remark

This decodes the next codepoint in a UTF-8 string. The returned codepoint may be part of a UTF-16 surrogate pair. The classifyUtf16SurrogateMember() function can be used to determine if this is the case. If this is part of a surrogate pair, the caller should decode the next codepoint then decode the full pair into a single codepoint using decodeUtf16CodePoint().

Parameters

str – [in] The string to decode the first codepoint from. This may not be nullptr. The string is expected to be aligned to the start of a valid codepoint.
lengthInBytes – [in] The number of bytes remaining in the string. This can be set to kNullTerminated if the string is well known to be null terminated.
flags – [in] Flags to control the behavior of this operation. This may be 0 or fDecodeUseDefault.

Return values

kDefaultCodePoint – if the end of the string is encountered, the string is empty, or there are not enough bytes left in the string to decode a full codepoint, and the fDecodeUseDefault flag is used.

Returns

The decoded codepoint if successful.

Returns

0 if the end of the string is encountered, the string is empty, or there are not enough bytes left in the string to decode a full codepoint, and the flags parameter is zero.

static inline CodeByte *getCodeBytes(CodePoint cp, CodeByte *str, size_t lengthInBytes, size_t *bytesWritten, Flags flags = 0)

Encodes a single Unicode codepoint to UTF-8.

Parameters

cp – [in] The codepoint to be encoded into UTF-8. This may be any valid Unicode codepoint.
str – [out] Receives the encoded UTF-8 codepoint. This may not be nullptr. This could need to be up to seven bytes to encode any possible Unicode codepoint.
lengthInBytes – [in] The size of the output buffer in bytes. No more than this many bytes will be written to the str buffer.
bytesWritten – [out] Receives the number of bytes that were written to the output buffer. This may not be nullptr.
flags – [in] Flags to control the behavior of this operation. This may be 0 or fEncodeUseUtf16.

Returns

The output buffer if the codepoint is successfully encoded.

Returns

nullptr if the output buffer was not large enough to hold the encoded codepoint.

static inline SurrogateMember classifyUtf16SurrogateMember(CodePoint cp)

Classifies a codepoint as being part of a UTF-16 surrogate pair or otherwise.

Parameters

cp – [in] The codepoint to classify. This may be any valid Unicode codepoint.

Return values

SurrogateMember::eNone – if the codepoint is not part of a UTF-16 surrogate pair.
SurrogateMember::eHigh – if the codepoint is a ‘high’ UTF-16 surrogate pair codepoint.
SurrogateMember::eLow – if the codepoint is a ‘low’ UTF-16 surrogate pair codepoint.

static inline CodePoint decodeUtf16CodePoint(CodePoint high, CodePoint low)

Decodes a UTF-16 surrogate pair to a Unicode codepoint.

Parameters

high – [in] The codepoint for the ‘high’ member of the UTF-16 surrogate pair.
low – [in] The codepoint for the ‘low’ member of the UTF-16 surrogate pair.

Returns

The decoded codepoint if the two input codepoints were a UTF-16 surrogate pair.

Returns

0 if either of the input codepoints were not part of a UTF-16 surrogate pair.

static inline size_t encodeUtf16CodePoint(CodePoint cp, CodePoint *out)

Encodes a Unicode codepoint into a UTF-16 codepoint.

Parameters

cp – [in] The UTF-32 codepoint to encode. This may be any valid codepoint.
out – [out] Receives the equivalent codepoint encoded in UTF-16. This will either be a single UTF-32 codepoint if its value is less than the UTF-16 encoding size of 16 bits, or it will be a UTF-16 surrogate pair for codepoint values larger than 16 bits. In the case of a single codepoint being written, it will occupy the lower 16 bits of this buffer. If two codepoints are written, the ‘high’ surrogate pair member will be in the lower 16 bits, and the ‘low’ surrogate pair member will be in the upper 16 bits. This is suitable for use as a UTF-16 buffer to pass to other functions that expect a surrogate pair. The number of codepoints written can be differentiated by the return value of this function. This may be nullptr if the encoded UTF-16 codepoint is not needed but only the number of codepoints is of interest. This may safely be the same buffer as cp.

Returns

1 if the requested codepoint was small enough for a direct encoding into UTF-16. This can be interpreted as a single codepoint being written to the output codepoint buffer.

Returns

2 if the requested codepoint was too big for direct encoding in UTF-16 and had to be encoded as a surrogate pair. This can be interpreted as two codepoints being written to the output codepoint buffer.

static inline bool isSpaceCodePoint(CodePoint cp)

Checks if the provided code point corresponds to a whitespace character.

Parameters: cp – [in] The UTF-32 codepoint to check
Returns: true if the codepoint is a whitespace character, false otherwise

Public Static Attributes

static constexpr Flags fDecodeUseDefault = 0x00000001: Flag to indicate that the default codepoint should be returned instead of just a zero when attempting to decode an invalid UTF-8 sequence.

static constexpr Flags fDecodeSkipInvalid = 0x00000002

Flag to indicate that invalid code bytes should be skipped over in a string when searching for the start of the next codepoint.

The default behavior is to fail the operation when an invalid sequence is encountered.

static constexpr Flags fEncodeUseUtf16 = 0x00000004

Flag to indicate that UTF-16 surrogate pairs should be used when encoding large codepoints instead of directly representing them in UTF-8.

Using this flag makes UTF-8 strings more friendly to other UTF-8 systems on Windows (though Windows decoding facilities would still be able to decode the directly stored codepoints as well).

static constexpr Flags fEncodeIgnoreSurrogatePairs = 0x00000008

Flag for nextCodePoint() which tells the function to ignore surrogate pairs when decoding and just return both elements of the pair as code points.

This exists mainly for internal use.

static constexpr size_t kNullTerminated = ~0ull

The string buffer is effectively null terminated.

This allows the various decoding functions to bypass some range checking with the assumption that there is a null terminating character at some point in the buffer.

static constexpr CodePoint kInvalidCodePoint = ~0u: An invalid Unicode codepoint.

static constexpr size_t kMaxSequenceLength = 7

The minimum buffer size that is guaranteed to be large enough to hold an encoded UTF-8 codepoint.

This does not include space for a null terminator codepoint.

static constexpr CodePoint kDefaultCodePoint = 0x0000fffd: The codepoint reserved in the Unicode standard to represent the decoded result of an invalid UTF-8 sequence.