carb::extras::Utf8Iterator

Defined in carb/extras/Utf8Parser.h

class Utf8Iterator

A simple iterator class for walking a UTF-8 string.

This is built on top of the UTF8Parser static class and uses its functionality. Strings can only be walked forward. Random access to codepoints in the string is not possible. If needed, the pointer to the start of the next codepoint or the codepoint index can be retrieved.

Unnamed Group

using CodeByte = Utf8Parser::CodeByte

Reference the types used in Utf8Parser for more convenient use locally.

The base type for a point in a UTF-8 string.

Ideally these values should point to the start of an encoded codepoint in a string.

using CodePoint = Utf8Parser::CodePoint

The base type for a single Unicode codepoint value.

This represents a decoded UTF-8 codepoint.

using Flags = Utf8Parser::Flags

Base type for flags to various encoding and decoding functions.

Public Functions

inline Utf8Iterator()
inline Utf8Iterator(const CodeByte *string, size_t lengthInBytes = kNullTerminated, Flags flags = 0)

Constructor: initializes a new iterator for a given string.

Parameters
  • string[in] The string to walk. This should be a UTF-8 encoded string. This can be nullptr, but the iterator will not be valid if so.

  • lengthInBytes[in] The maximum number of bytes to walk in the string. This may be kNullTerminated if the string is null terminated. If the string is unterminated or only a portion of it needs to be iterated over, this may be the size of the buffer in bytes.

  • flags[in] Flags to control the behavior of the UTF-8 parser. This may be zero or more of the Utf8Parser::fDecode* flags.

Returns

No return value.

inline Utf8Iterator(const Utf8Iterator &it)

Copy constructor: copies another iterator into this one.

Parameters

it[in] The iterator to be copied. Note that if it is invalid, this iterator will also become invalid.

Returns

No return value.

inline operator bool() const

Checks if this iterator is still valid.

Returns

true if this iterator still has at least one more codepoint to walk.

Returns

false if there is no more string data to walk and decode.

inline bool operator!() const

Check is this iterator is invalid.

Returns

true if there is no more string data to walk and decode.

Returns

false if this iterator still has at least one more codepoint to walk.

inline CodePoint operator*() const

Retrieves the codepoint at this iterator’s current location.

Returns

The codepoint at the current location in the string. Calling this multiple times does not cause the decoding work to be done multiple times. The decoded codepoint is cached once decoded.

Returns

0 if there are no more codepoints to walk in the string.

inline const CodeByte *operator&() const

Retrieves the address of the start of the current codepoint.

Returns

The address of the start of the current codepoint for this iterator. This can be used as a way of copying, editing, or reworking the string during iteration. It is the caller’s responsibility to ensure the string is still properly encoded after any change.

Returns

nullptr if there is no more string data to walk.

inline Utf8Iterator &operator++()

Pre increment operator: walk to the next codepoint in the string.

Returns

A reference to this iterator. Note that if the end of the string is reached, the new state of this iterator will first point to the null terminator in the string (for null terminated strings), then after another increment will return the address nullptr from the ‘&’ operator. For length limited strings, reaching the end will immediately return nullptr from the ‘&’ operator.

inline Utf8Iterator operator++(int32_t)

Post increment operator: walk to the next codepoint in the string.

Returns

A new iterator object representing the state of this object before the increment operation.

template<typename T>
inline Utf8Iterator &operator+=(T count)

Increment operator: skip over zero or more codepoints in the string.

Parameters

count[in] The number of codepoints to skip over. This may be zero or larger. Negative values will be ignored and the iterator will not advance.

Returns

A reference to this iterator.

template<typename T>
inline Utf8Iterator operator+(T count) const

Addition operator: create a new iterator that skips zero or more codepoints.

Parameters

count[in] The number of codepoints to skip over. This may be zero or larger. Negative values will be ignored and the iterator will not advance.

Returns

A new iterator that has skipped over the next count codepoints in the string starting from the location of this iterator.

inline bool operator==(const Utf8Iterator &it) const

Comparison operators.

Remark

This object is treated as the left side of the comparison. Only the offset into the string contributes to this result. It is the caller’s responsibility to ensure both iterators refer to the same string otherwise the results are undefined.

Parameters

it[in] The iterator to compare this one to.

Returns

true if the string position represented by it satisfies the requested comparison versus this object.

Returns

false if the string position represented by it does not satisfy the requested comparison versus this object.

inline bool operator!=(const Utf8Iterator &it) const

Comparison operators.

Remark

This object is treated as the left side of the comparison. Only the offset into the string contributes to this result. It is the caller’s responsibility to ensure both iterators refer to the same string otherwise the results are undefined.

Parameters

it[in] The iterator to compare this one to.

Returns

true if the string position represented by it satisfies the requested comparison versus this object.

Returns

false if the string position represented by it does not satisfy the requested comparison versus this object.

inline bool operator<(const Utf8Iterator &it) const

Comparison operators.

Remark

This object is treated as the left side of the comparison. Only the offset into the string contributes to this result. It is the caller’s responsibility to ensure both iterators refer to the same string otherwise the results are undefined.

Parameters

it[in] The iterator to compare this one to.

Returns

true if the string position represented by it satisfies the requested comparison versus this object.

Returns

false if the string position represented by it does not satisfy the requested comparison versus this object.

inline bool operator<=(const Utf8Iterator &it) const

Comparison operators.

Remark

This object is treated as the left side of the comparison. Only the offset into the string contributes to this result. It is the caller’s responsibility to ensure both iterators refer to the same string otherwise the results are undefined.

Parameters

it[in] The iterator to compare this one to.

Returns

true if the string position represented by it satisfies the requested comparison versus this object.

Returns

false if the string position represented by it does not satisfy the requested comparison versus this object.

inline bool operator>(const Utf8Iterator &it) const

Comparison operators.

Remark

This object is treated as the left side of the comparison. Only the offset into the string contributes to this result. It is the caller’s responsibility to ensure both iterators refer to the same string otherwise the results are undefined.

Parameters

it[in] The iterator to compare this one to.

Returns

true if the string position represented by it satisfies the requested comparison versus this object.

Returns

false if the string position represented by it does not satisfy the requested comparison versus this object.

inline bool operator>=(const Utf8Iterator &it) const

Comparison operators.

Remark

This object is treated as the left side of the comparison. Only the offset into the string contributes to this result. It is the caller’s responsibility to ensure both iterators refer to the same string otherwise the results are undefined.

Parameters

it[in] The iterator to compare this one to.

Returns

true if the string position represented by it satisfies the requested comparison versus this object.

Returns

false if the string position represented by it does not satisfy the requested comparison versus this object.

inline Utf8Iterator &operator=(const Utf8Iterator &it)

Copy assignment operator: copies another iterator into this one.

Parameters

it[in] The iterator to copy.

Returns

A reference to this object.

inline Utf8Iterator &operator=(const CodeByte *str)

String assignment operator: resets this iterator to the start of a new string.

Parameters

str[in] The new string to start walking. This must be a null terminated string. If this is nullptr, the iterator will become invalid. Any previous flags and length limits on this iterator will be cleared out.

Returns

A reference to this object.

inline size_t getIndex() const

Retrieves the current codepoint index of the iterator.

Returns

The number of codepoints that have been walked so far by this iterator in the current string. This will always start at 0 and will only increase when a new codepoint is successfully decoded.

inline size_t getCodepointSize() const

Retrieves the size of the current codepoint in bytes.

Returns

The size of the current codepoint (ie: the one returned with the ‘*’ operator) in bytes. This can be used along with the results of the ‘&’ operator to copy this encoded codepoint into another buffer or modify the string in place.

Public Static Attributes

static constexpr size_t kNullTerminated = Utf8Parser::kNullTerminated

The string buffer is effectively null terminated.

This allows the various decoding functions to bypass some range checking with the assumption that there is a null terminating character at some point in the buffer.