Xchars and Unicode (Gforth Manual)

Next: String and character literals, Previous: String representations, Up: Strings and Characters [Contents][Index]

6.9.2 Xchars and Unicode ¶

An xchar is represented as a single cell on the stack and as a sequence of one or more chars (as string) in memory.¹⁴

The actual supported xchars depend on the encoding, which is determined automatically from the environment variable LC_CTYPE, LC_ALL, or LANG (see Environment variables). If any of them contains “UTF-8” on Gforth startup, Gforth uses UTF-8 encoding (for Unicode), otherwise it uses fixed-width 8-bit encoding. The encoding cannot be changed after Gforth startup, and if any non-ASCII characters are stored in an image, the image must be invoked in a way that produces the same encoding and charset setting.

Any text I/O is expected to happen in the encoding and character set specified in the environment variables, so if any encoding or character set conversions are needed, perform them outside Gforth, with tools such as recode or iconv.

The fixed-width 8-bit encoding is used for 8-bit encodings of legacy character sets, such as ISO Latin-1, ISO Latin-2, or KOI8-R.

Whatever the environment, an xchar corresponds to a code point of the character set. In some character sets, each code point represents a character, but in Unicode a user-perceived character may consist of a sequence of code points. Gforth currently provides no facilities for dealing with user-perceived characters, and dealing with code points rarely provides any benefit, so the usual way to deal with text is as strings of chars.

The only thing that’s in common between Unicode and all the charsets for which Gforth supports a fixed-width encoding is ASCII: These character sets all have the code points 0–127 which have the same meaning, have the same on-stack representation (a number in the range 0–127), and have the same in-memory representation (a single byte (aka char) that has a value in the range 0–127).

When using UTF-8 encoding, all other codepoints take more than one byte/char. In most cases, you can just treat such characters as strings in memory and don’t need to use the following words, but if you want to deal with individual codepoints, the following words are useful.

When using the fixed-width encoding, all other code points take only one byte, but you can still use the xchar words to access the code points, and your code will also continue to work when you switch to UTF-8 encoding.

The xchar words add a few data types:

xc is an extended char (xchar) on the stack. It occupies one cell, and is a subset of unsigned cell.
xc-addr is the address of an xchar in memory. Alignment requirements are the same as c-addr. The memory representation of an xchar differs from the stack representation, and depends on the encoding used. An xchar may use a variable number of chars in memory.
xc-addr u is a string (or buffer) of xchars in memory, starting at xc-addr, u chars (i.e., bytes, not xchars) long.

xc-size ( xc – u  ) xchar “x-c-size”

The xchar xc occupies u chars in memory.

x-size ( xc-addr u1 – u2  ) xchar

The first xchar at xc-addr occupies u2 chars; if xc-addr u1 does not contain a complete xchar, u2 is u1.

xc@ ( xc-addr – xc  ) xchar-ext “x-c-fetch”

xc is the xchar starting at xc-addr1.

xc@+ ( xc-addr1 – xc-addr2 xc  ) xchar “x-c-fetch-plus”

xc is the xchar starting at xc-addr1. xc-addr2 points to the first memory location after xc.

xc@+? ( xc-addr1 u1 – xc-addr2 u2 xc  ) gforth-experimental “x-c-fetch-plus-query”

xc is the xchar starting at xc-addr1. xc-addr2 u2 is the remaining string behind xc. If the start of xc-addr1 u1 contains no valid xchar, xc is invalid-char, and xc-addr2 u2 is the remaining string after skipping at least one byte. If u1=0, the current behaviour does not make much sense and may change in the future: xc-addr2=xc-addr1+1, u2=MAX-U, and xc is either 0 or invalid-char.

xc!+? ( xc xc-addr1 u1 – xc-addr2 u2 f  ) xchar “x-c-store-plus-query”

Stores the xchar xc into the buffer starting at address xc-addr1, u1 chars large. xc-addr2 points to the first memory location after xc, u2 is the remaining size of the buffer. If the xchar xc did fit into the buffer, f is true, otherwise f is false, and xc-addr2 u2 equal xc-addr1 u1. XC!+? is safe against buffer overflows, and therefore preferred over XC!+.

xc!+ ( xc xc-addr1 – xc-addr2  ) xchar “x-c-store”

Stores the xchar xc at xc-addr1. xc-addr2 is the next unused address in the buffer. Note that this writes up to 4 bytes, so you need at least 3 bytes of padding after the end of the buffer to avoid overwriting useful data if you only check the address against the end of the buffer.

xchar+ ( xc-addr1 – xc-addr2  ) xchar “x-char-plus”

xc-addr2 is the address of the next xchar behind the one pointed to by xc-addr.

xchar- ( xc-addr1 – xc-addr2  ) xchar-ext “x-char-minus”

xc-addr2 is the address of the previous xchar in front of the one pointed to by xc-addr.

+x/string ( xc-addr1 u1 – xc-addr2 u2  ) xchar-ext “plus-x-slash-string”

xc-addr1 u1 is a string of u1 chars. xc-addr2 is the address of the next xchar behind the one pointed to by xc-addr. u2 is the size (in chars) of the rest of the string.

x\string- ( xc-addr u1 – xc-addr u2  ) xchar-ext “x-backslash-string-minus”

xc-addr1 u1 is a string of u1 chars. u2 is the size of the string without its last xchar.

-trailing-garbage ( xc-addr u1 – xc-addr u2  ) xchar-ext “minus-trailing-garbage”

xc-addr1 u1 is a string of u1 chars. u2 is the size of the string after removing the chars from the end that do not constitute a complete, valid xchar.
The idea here is that if you read a fixed number of chars, e.g., with read-file, there may be an incomplete xchar at the end; you eliminate that with -trainling-garbage, leaving a valid xchar string for processing (if the string starts with a complete xchar and only contains valid xchars). You prepend the eliminated chars to the next read block of chars so you do not miss any parts.

x-width ( xc-addr u – n  ) xchar-ext

n is the number of monospace ASCII chars that take the same space to display as xc-addr u needs on a monospaced display.

xc-width ( xc – n  ) xchar-ext “x-c-width”

xc has a width of n times the width of a normal fixed-width glyph.

xhold ( xc –  ) xchar-ext “x-hold”

Used between <<# and #>. Prepend xc to the pictured numeric output string. We recommend that you use holds instead.

xc, ( xc –  ) xchar “x-c-comma”

Reserve data space for xc, and store xc in that space.

invalid-char ( – xc  ) gforth-experimental

Unicode code point returned for cases where the string does not contain a valid Unicode encoding. Current value: the Unicode replacement character U+FFFD.

toupper ( xc1 – xc2 ) gforth-0.2 “toupper”

If xc1 is a lower-case ASCII character, xc2 is the equivalent upper-case character, otherwise xc2 is xc1.

See also xemit (see Displaying characters and strings) and xkey (see Single-key input).

Footnotes

(14)

Of course, you can also store the xchar cell in memory, but Gforth has no words for dealing with sequences of such cells.