プログラミングにおいて文字列の長さを数えるというのは単純なようでいて非常にややこしい。なにしろ数え方の候補が複数ある。Unicodeの利用は前提にしてしまうとして頻出なのはバイト、符号位置(コードポイント)、符号単位(コード単位)、書記素クラスタの4つだろうか。これらについての説明はGlossary of Unicode Termsを参照するのが手っ取り早い。

Byte. (1) The minimal unit of addressable storage for a particular computer architecture. (2) An octet. Note that many early computer architectures used bytes larger than 8 bits in size, but the industry has now standardized almost uniformly on 8-bit bytes. The Unicode Standard follows the current industry practice in equating the term byte with octet and using the more familiar term byte in all contexts. (See octet.)

Code Point. (1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. (See definition D10 in Section 3.4, Characters and Encoding.) Not all code points are assigned to encoded characters. See code point type. (2) A value, or position, for a character, in any coded character set.

Code Unit. The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. (See definition D77 in Section 3.9, Unicode Encoding Forms.)

Grapheme Cluster. The text between grapheme cluster boundaries as specified by Unicode Standard Annex #29, “Unicode Text Segmentation.” (See definition D60 in Section 3.6, Combination.) A grapheme cluster represents a horizontally segmentable unit of text, consisting of some grapheme base (which may consist of a Korean syllable) together with any number of nonspacing marks applied to it.

中で最もややこしいのは書記素クラスタだろう。これについてはUAX#29が詳しい。

It is important to recognize that what the user thinks of as a “character”—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + grave-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically.

Grapheme Cluster Boundaries

なお日本語での呼び方についてはUnicode Terminology English - Japaneseを参照するとよい。ただしMDN Web Docs 用語集ではcode pointにはコードポイント、code unitにはコード単位という訳語が当てられているようで、そちらの方が通りがよいかもしれない。