[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63. MULE

MULE is the name originally given to the version of GNU Emacs extended for multi-lingual (and in particular Asian-language) support. "MULE" is short for "MUlti-Lingual Emacs". It is an extension and complete rewrite of Nemacs ("Nihon Emacs" where "Nihon" is the Japanese word for "Japan"), which only provided support for Japanese. XEmacs refers to its multi-lingual support as MULE support since it is based on MULE.

63.1 Internationalization Terminology  Definition of various internationalization terms.
63.2 Charsets  Sets of related characters.
63.3 MULE Characters  Working with characters in XEmacs/MULE.
63.4 Composite Characters  Making new characters by overstriking other ones.
63.5 Coding Systems  Ways of representing a string of chars using integers.
63.7 CCL  A special language for writing fast converters.
63.8 Category Tables  Subdividing charsets into groups.
63.9 Unicode Support  The universal coded character set.
63.10 Character Set Unification  Handling overlapping character sets.
63.12.5 Charsets and Coding Systems  Tables and reference information.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.1 Internationalization Terminology

In internationalization terminology, a string of text is divided up into characters, which are the printable units that make up the text. A single character is (for example) a capital `A', the number `2', a Katakana character, a Hangul character, a Kanji ideograph (an ideograph is a "picture" character, such as is used in Japanese Kanji, Chinese Hanzi, and Korean Hanja; typically there are thousands of such ideographs in each language), etc. The basic property of a character is that it is the smallest unit of text with semantic significance in text processing--i.e., characters are abstract units defined by their meaning, not by their exact appearance.

Human beings normally process text visually, so to a first approximation a character may be identified with its shape. Note that the same character may be drawn by two different people (or in two different fonts) in slightly different ways, although the "basic shape" will be the same. But consider the works of Scott Kim; human beings can recognize hugely variant shapes as the "same" character. Sometimes, especially where characters are extremely complicated to write, completely different shapes may be defined as the "same" character in national standards. The Taiwanese variant of Hanzi is generally the most complicated; over the centuries, the Japanese, Koreans, and the People's Republic of China have adopted simplifications of the shape, but the line of descent from the original shape is recorded, and the meanings and pronunciation of different forms of the same character are considered to be identical within each language. (Of course, it may take a specialist to recognize the related form; the point is that the relations are standardized, despite the differing shapes.)

In some cases, the differences will be significant enough that it is actually possible to identify two or more distinct shapes that both represent the same character. For example, the lowercase letters `a' and `g' each have two distinct possible shapes--the `a' can optionally have a curved tail projecting off the top, and the `g' can be formed either of two loops, or of one loop and a tail hanging off the bottom. Such distinct possible shapes of a character are called glyphs. The important characteristic of two glyphs making up the same character is that the choice between one or the other is purely stylistic and has no linguistic effect on a word (this is the reason why a capital `A' and lowercase `a' are different characters rather than different glyphs--e.g. `Aspen' is a city while `aspen' is a kind of tree).

Note that character and glyph are used differently here than elsewhere in XEmacs.

A character set is essentially a set of related characters. ASCII, for example, is a set of 94 characters (or 128, if you count non-printing characters). Other character sets are ISO8859-1 (ASCII plus various accented characters and other international symbols), JIS X 0201 (ASCII, more or less, plus half-width Katakana), JIS X 0208 (Japanese Kanji), JIS X 0212 (a second set of less-used Japanese Kanji), GB2312 (Mainland Chinese Hanzi), etc.

The definition of a character set will implicitly or explicitly give it an ordering, a way of assigning a number to each character in the set. For many character sets, there is a natural ordering, for example the "ABC" ordering of the Roman letters. But it is not clear whether digits should come before or after the letters, and in fact different European languages treat the ordering of accented characters differently. It is useful to use the natural order where available, of course. The number assigned to any particular character is called the character's code point. (Within a given character set, each character has a unique code point. Thus the word "set" is ill-chosen; different orderings of the same characters are different character sets. Identifying characters is simple enough for alphabetic character sets, but the difference in ordering can cause great headaches when the same thousands of characters are used by different cultures as in the Hanzi.)

It's important to understand that a character is defined not by any number attached to it, but by its meaning. For example, ASCII and EBCDIC are two charsets containing exactly the same characters (lowercase and uppercase letters, numbers 0 through 9, particular punctuation marks) but with different numberings. The `comma' character in ASCII and EBCDIC, for instance, is the same character despite having a different numbering. Conversely, when comparing ASCII and JIS-Roman, which look the same except that the latter has a yen sign substituted for the backslash, we would say that the backslash and yen sign are not the same characters, despite having the same number (95) and despite the fact that all other characters are present in both charsets, with the same numbering. ASCII and JIS-Roman, then, do not have exactly the same characters in them (ASCII has a backslash character but no yen-sign character, and vice-versa for JIS-Roman), unlike ASCII and EBCDIC, even though the numberings in ASCII and JIS-Roman are closer.

Sometimes, a code point is not a single number, but instead a group of numbers, called position codes. In such cases, the number of position codes required to index a particular character in a character set is called the dimension of the character set. Character sets indexed by more than one position code typically use byte-sized position codes. Small character sets, e.g. ASCII, invariably use a single position code, but for larger character sets, the choice of whether to use multiple position codes or a single large (16-bit or 32-bit) number is arbitrary. Unicode typically uses a single large number, but language-specific or "national" character sets often use multiple (usually two) position codes. For example, JIS X 0208, i.e. Japanese Kanji, has thousands of characters, and is of dimension two -- every character is indexed by two position codes, each in the range 1 through 94. (This number "94" is not a coincidence; it is the same as the number of printable characters in ASCII, and was chosen so that JIS characters could be directly encoded using two printable ASCII characters.) Note that the choice of the range here is somewhat arbitrary -- it could just as easily be 0 through 93, 2 through 95, etc. In fact, the range for JIS position codes (and for other character sets modeled after it) is often given as range 33 through 126, so as to directly match ASCII printing characters.

An encoding is a way of numerically representing characters from one or more character sets into a stream of like-sized numerical values called words -- typically 8-bit bytes, but sometimes 16-bit or 32-bit quantities. In a context where dealing with Japanese motivates much of XEmacs' design in this area, it's important to clearly distinguish between charsets and encodings. For a simple charset like ASCII, there is only one encoding normally used -- each character is represented by a single byte, with the same value as its code point. For more complicated charsets, however, or when a single encoding needs to represent more than charset, things are not so obvious. Unicode version 2, for example, is a large charset with thousands of characters, each indexed by a 16-bit number, often represented in hex, e.g. 0x05D0 for the Hebrew letter "aleph". One obvious encoding (actually two encodings, depending on which of the two possible byte orderings is chosen) simply uses two bytes per character. This encoding is convenient for internal processing of Unicode text; however, it's incompatible with ASCII, and thus external text (files, e-mail, etc.) that is encoded this way is completely uninterpretable by programs lacking Unicode support. For this reason, a different, ASCII-compatible encoding, e.g. UTF-8, is usually used for external text. UTF-8 represents Unicode characters with one to three bytes (often extended to six bytes to handle characters with up to 31-bit indices). Unicode characters 00 to 7F (identical with ASCII) are directly represented with one byte, and other characters with two or more bytes, each in the range 80 to FF. Applications that don't understand Unicode will still be able to process ASCII characters represented in UTF-8-encoded text, and will typically ignore (and hopefully preserve) the high-bit characters.

Similarly, Shift-JIS and EUC-JP are different encodings normally used to encode the same character set(s), these character sets being subsets of Unicode. However, the obvious approach of unifying XEmacs' internal encoding across character sets, as was part of the motivation behind Unicode, wasn't taken. This means that characters in these character sets that are identical to characters in other character sets--for example, the Greek alphabet is in the large Japanese character sets and at least one European character set--are unfortunately disjoint.

Naive use of code points is also not possible if more than one character set is to be used in the encoding. For example, printed Japanese text typically requires characters from multiple character sets -- ASCII, JIS X 0208, and JIS X 0212, to be specific. Each of these is indexed using one or more position codes in the range 1 through 94 (or 33 through 126), so the position codes could not be used directly or there would be no way to tell which character was meant. Different Japanese encodings handle this differently -- JIS uses special escape characters to denote different character sets; EUC sets the high bit of the position codes for JIS X 0208 and JIS X 0212, and puts a special extra byte before each JIS X 0212 character; etc.

The encodings described above are all 7-bit or 8-bit encodings. The fixed-width Unicode encoding previous described, however, is sometimes considered to be a 16-bit encoding, in which case the issue of byte ordering does not come up. (Imagine, for example, that the text is represented as an array of shorts.) Similarly, Unicode version 3 (which has characters with indices above 0xFFFF), and other very large character sets, may be represented internally as 32-bit encodings, i.e. arrays of ints. However, it does not make too much sense to talk about 16-bit or 32-bit encodings for external data, since nowadays 8-bit data is a universal standard -- the closest you can get is fixed-width encodings using two or four bytes to encode 16-bit or 32-bit values. (A "7-bit" encoding is used when it cannot be guaranteed that the high bit of 8-bit data will be correctly preserved. Some e-mail gateways, for example, strip the high bit of text passing through them. These same gateways often handle non-printable characters incorrectly, and so 7-bit encodings usually avoid using bytes with such values.)

A general method of handling text using multiple character sets (whether for multilingual text, or simply text in an extremely complicated single language like Japanese) is defined in the international standard ISO 2022. ISO 2022 will be discussed in more detail later (see section 63.6 ISO 2022), but for now suffice it to say that text needs control functions (at least spacing), and if escape sequences are to be used, an escape sequence introducer. It was decided to make all text streams compatible with ASCII in the sense that the codes 0--31 (and 128-159) would always be control codes, never graphic characters, and where defined by the character set the `SPC' character would be assigned code 32, and `DEL' would be assigned 127. Thus there are 94 code points remaining if 7 bits are used. This is the reason that most character sets are defined using position codes in the range 1 through 94. Then ISO 2022 compatible encodings are produced by shifting the position codes 1 to 94 into character codes 33 to 126, or (if 8 bit codes are available) into character codes 161 to 254.

Encodings are classified as either modal or non-modal. In a modal encoding, there are multiple states that the encoding can be in, and the interpretation of the values in the stream depends on the current global state of the encoding. Special values in the encoding, called escape sequences, are used to change the global state. JIS, for example, is a modal encoding. The bytes `ESC $ B' indicate that, from then on, bytes are to be interpreted as position codes for JIS X 0208, rather than as ASCII. This effect is cancelled using the bytes `ESC ( B', which mean "switch from whatever the current state is to ASCII". To switch to JIS X 0212, the escape sequence `ESC $ ( D'. (Note that here, as is common, the escape sequences do in fact begin with `ESC'. This is not necessarily the case, however. Some encodings use control characters called "locking shifts" (effect persists until cancelled) to switch character sets.)

A non-modal encoding has no global state that extends past the character currently being interpreted. EUC, for example, is a non-modal encoding. Characters in JIS X 0208 are encoded by setting the high bit of the position codes, and characters in JIS X 0212 are encoded by doing the same but also prefixing the character with the byte 0x8F.

The advantage of a modal encoding is that it is generally more space-efficient, and is easily extendible because there are essentially an arbitrary number of escape sequences that can be created. The disadvantage, however, is that it is much more difficult to work with if it is not being processed in a sequential manner. In the non-modal EUC encoding, for example, the byte 0x41 always refers to the letter `A'; whereas in JIS, it could either be the letter `A', or one of the two position codes in a JIS X 0208 character, or one of the two position codes in a JIS X 0212 character. Determining exactly which one is meant could be difficult and time-consuming if the previous bytes in the string have not already been processed, or impossible if they are drawn from an external stream that cannot be rewound.

Non-modal encodings are further divided into fixed-width and variable-width formats. A fixed-width encoding always uses the same number of words per character, whereas a variable-width encoding does not. EUC is a good example of a variable-width encoding: one to three bytes are used per character, depending on the character set. 16-bit and 32-bit encodings are nearly always fixed-width, and this is in fact one of the main reasons for using an encoding with a larger word size. The advantages of fixed-width encodings should be obvious. The advantages of variable-width encodings are that they are generally more space-efficient and allow for compatibility with existing 8-bit encodings such as ASCII. (For example, in Unicode ASCII characters are simply promoted to a 16-bit representation. That means that every ASCII character contains a `NUL' byte; evidently all of the standard string manipulation functions will lose badly in a fixed-width Unicode environment.)

The bytes in an 8-bit encoding are often referred to as octets rather than simply as bytes. This terminology dates back to the days before 8-bit bytes were universal, when some computers had 9-bit bytes, others had 10-bit bytes, etc.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.2 Charsets

A charset in MULE is an object that encapsulates a particular character set as well as an ordering of those characters. Charsets are permanent objects and are named using symbols, like faces.

Function: charsetp object
This function returns non-nil if object is a charset.

63.2.1 Charset Properties  Properties of a charset.
63.2.2 Basic Charset Functions  Functions for working with charsets.
63.2.3 Charset Property Functions  Functions for accessing charset properties.
63.2.4 Predefined Charsets  Predefined charset objects.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.2.1 Charset Properties

Charsets have the following properties:

name
A symbol naming the charset. Every charset must have a different name; this allows a charset to be referred to using its name rather than the actual charset object.
doc-string
A documentation string describing the charset.
registry
A regular expression matching the font registry field for this character set. For example, both the ascii and latin-iso8859-1 charsets use the registry "ISO8859-1". This field is used to choose an appropriate font when the user gives a general font specification such as `-*-courier-medium-r-*-140-*', i.e. a 14-point upright medium-weight Courier font.
dimension
Number of position codes used to index a character in the character set. XEmacs/MULE can only handle character sets of dimension 1 or 2. This property defaults to 1.
chars
Number of characters in each dimension. In XEmacs/MULE, the only allowed values are 94 or 96. (There are a couple of pre-defined character sets, such as ASCII, that do not follow this, but you cannot define new ones like this.) Defaults to 94. Note that if the dimension is 2, the character set thus described is 94x94 or 96x96.
columns
Number of columns used to display a character in this charset. Only used in TTY mode. (Under X, the actual width of a character can be derived from the font used to display the characters.) If unspecified, defaults to the dimension. (This is almost always the correct value, because character sets with dimension 2 are usually ideograph character sets, which need two columns to display the intricate ideographs.)
direction
A symbol, either l2r (left-to-right) or r2l (right-to-left). Defaults to l2r. This specifies the direction that the text should be displayed in, and will be left-to-right for most charsets but right-to-left for Hebrew and Arabic. (Right-to-left display is not currently implemented.)
final
Final byte of the standard ISO 2022 escape sequence designating this charset. Must be supplied. Each combination of (dimension, chars) defines a separate namespace for final bytes, and each charset within a particular namespace must have a different final byte. Note that ISO 2022 restricts the final byte to the range 0x30 - 0x7E if dimension == 1, and 0x30 - 0x5F if dimension == 2. Note also that final bytes in the range 0x30 - 0x3F are reserved for user-defined (not official) character sets. For more information on ISO 2022, see 63.5 Coding Systems.
graphic
0 (use left half of font on output) or 1 (use right half of font on output). Defaults to 0. This specifies how to convert the position codes that index a character in a character set into an index into the font used to display the character set. With graphic set to 0, position codes 33 through 126 map to font indices 33 through 126; with it set to 1, position codes 33 through 126 map to font indices 161 through 254 (i.e. the same number but with the high bit set). For example, for a font whose registry is ISO8859-1, the left half of the font (octets 0x20 - 0x7F) is the ascii charset, while the right half (octets 0xA0 - 0xFF) is the latin-iso8859-1 charset.
ccl-program
A compiled CCL program used to convert a character in this charset into an index into the font. This is in addition to the graphic property. If a CCL program is defined, the position codes of a character will first be processed according to graphic and then passed through the CCL program, with the resulting values used to index the font.

This is used, for example, in the Big5 character set (used in Taiwan). This character set is not ISO-2022-compliant, and its size (94x157) does not fit within the maximum 96x96 size of ISO-2022-compliant character sets. As a result, XEmacs/MULE splits it (in a rather complex fashion, so as to group the most commonly used characters together) into two charset objects (big5-1 and big5-2), each of size 94x94, and each charset object uses a CCL program to convert the modified position codes back into standard Big5 indices to retrieve a character from a Big5 font.

Most of the above properties can only be set when the charset is initialized, and cannot be changed later. See section 63.2.3 Charset Property Functions.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.2.2 Basic Charset Functions

Function: find-charset charset-or-name
This function retrieves the charset of the given name. If charset-or-name is a charset object, it is simply returned. Otherwise, charset-or-name should be a symbol. If there is no such charset, nil is returned. Otherwise the associated charset object is returned.

Function: get-charset name
This function retrieves the charset of the given name. Same as find-charset except an error is signalled if there is no such charset instead of returning nil.

Function: charset-list
This function returns a list of the names of all defined charsets.

Function: make-charset name doc-string props
This function defines a new character set. This function is for use with MULE support. name is a symbol, the name by which the character set is normally referred. doc-string is a string describing the character set. props is a property list, describing the specific nature of the character set. The recognized properties are registry, dimension, columns, chars, final, graphic, direction, and ccl-program, as previously described.

Function: make-reverse-direction-charset charset new-name
This function makes a charset equivalent to charset but which goes in the opposite direction. new-name is the name of the new charset. The new charset is returned.

Function: charset-from-attributes dimension chars final &optional direction
This function returns a charset with the given dimension, chars, final, and direction. If direction is omitted, both directions will be checked (left-to-right will be returned if character sets exist for both directions).

Function: charset-reverse-direction-charset charset
This function returns the charset (if any) with the same dimension, number of characters, and final byte as charset, but which is displayed in the opposite direction.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.2.3 Charset Property Functions

All of these functions accept either a charset name or charset object.

Function: charset-property charset prop
This function returns property prop of charset. See section 63.2.1 Charset Properties.

Convenience functions are also provided for retrieving individual properties of a charset.

Function: charset-name charset
This function returns the name of charset. This will be a symbol.

Function: charset-description charset
This function returns the documentation string of charset.

Function: charset-registry charset
This function returns the registry of charset.

Function: charset-dimension charset
This function returns the dimension of charset.

Function: charset-chars charset
This function returns the number of characters per dimension of charset.

Function: charset-width charset
This function returns the number of display columns per character (in TTY mode) of charset.

Function: charset-direction charset
This function returns the display direction of charset---either l2r or r2l.

Function: charset-iso-final-char charset
This function returns the final byte of the ISO 2022 escape sequence designating charset.

Function: charset-iso-graphic-plane charset
This function returns either 0 or 1, depending on whether the position codes of characters in charset map to the left or right half of their font, respectively.

Function: charset-ccl-program charset
This function returns the CCL program, if any, for converting position codes of characters in charset into font indices.

The two properties of a charset that can currently be set after the charset has been created are the CCL program and the font registry.

Function: set-charset-ccl-program charset ccl-program
This function sets the ccl-program property of charset to ccl-program.

Function: set-charset-registry charset registry
This function sets the registry property of charset to registry.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.2.4 Predefined Charsets

The following charsets are predefined in the C code.

 
Name                    Type  Fi Gr Dir Registry
--------------------------------------------------------------
ascii                    94    B  0  l2r ISO8859-1
control-1                94       0  l2r ---
latin-iso8859-1          94    A  1  l2r ISO8859-1
latin-iso8859-2          96    B  1  l2r ISO8859-2
latin-iso8859-3          96    C  1  l2r ISO8859-3
latin-iso8859-4          96    D  1  l2r ISO8859-4
cyrillic-iso8859-5       96    L  1  l2r ISO8859-5
arabic-iso8859-6         96    G  1  r2l ISO8859-6
greek-iso8859-7          96    F  1  l2r ISO8859-7
hebrew-iso8859-8         96    H  1  r2l ISO8859-8
latin-iso8859-9          96    M  1  l2r ISO8859-9
thai-tis620              96    T  1  l2r TIS620
katakana-jisx0201        94    I  1  l2r JISX0201.1976
latin-jisx0201           94    J  0  l2r JISX0201.1976
japanese-jisx0208-1978   94x94 @  0  l2r JISX0208.1978
japanese-jisx0208        94x94 B  0  l2r JISX0208.19(83|90)
japanese-jisx0212        94x94 D  0  l2r JISX0212
chinese-gb2312           94x94 A  0  l2r GB2312
chinese-cns11643-1       94x94 G  0  l2r CNS11643.1
chinese-cns11643-2       94x94 H  0  l2r CNS11643.2
chinese-big5-1           94x94 0  0  l2r Big5
chinese-big5-2           94x94 1  0  l2r Big5
korean-ksc5601           94x94 C  0  l2r KSC5601
composite                96x96    0  l2r ---

The following charsets are predefined in the Lisp code.

 
Name                     Type  Fi Gr Dir Registry
--------------------------------------------------------------
arabic-digit             94    2  0  l2r MuleArabic-0
arabic-1-column          94    3  0  r2l MuleArabic-1
arabic-2-column          94    4  0  r2l MuleArabic-2
sisheng                  94    0  0  l2r sisheng_cwnn\|OMRON_UDC_ZH
chinese-cns11643-3       94x94 I  0  l2r CNS11643.1
chinese-cns11643-4       94x94 J  0  l2r CNS11643.1
chinese-cns11643-5       94x94 K  0  l2r CNS11643.1
chinese-cns11643-6       94x94 L  0  l2r CNS11643.1
chinese-cns11643-7       94x94 M  0  l2r CNS11643.1
ethiopic                 94x94 2  0  l2r Ethio
ascii-r2l                94    B  0  r2l ISO8859-1
ipa                      96    0  1  l2r MuleIPA
vietnamese-viscii-lower  96    1  1  l2r VISCII1.1
vietnamese-viscii-upper  96    2  1  l2r VISCII1.1

For all of the above charsets, the dimension and number of columns are the same.

Note that ASCII, Control-1, and Composite are handled specially. This is why some of the fields are blank; and some of the filled-in fields (e.g. the type) are not really accurate.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.3 MULE Characters

Function: make-char charset arg1 &optional arg2
This function makes a multi-byte character from charset and octets arg1 and arg2.

Function: char-charset character
This function returns the character set of char character.

Function: char-octet character &optional n
This function returns the octet (i.e. position code) numbered n (should be 0 or 1) of char character. n defaults to 0 if omitted.

Function: find-charset-region start end &optional buffer
This function returns a list of the charsets in the region between start and end. buffer defaults to the current buffer if omitted.

Function: find-charset-string string
This function returns a list of the charsets in string.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.4 Composite Characters

Composite characters are not yet completely implemented.

Function: make-composite-char string
This function converts a string into a single composite character. The character is the result of overstriking all the characters in the string.

Function: composite-char-string character
This function returns a string of the characters comprising a composite character.

Function: compose-region start end &optional buffer
This function composes the characters in the region from start to end in buffer into one composite character. The composite character replaces the composed characters. buffer defaults to the current buffer if omitted.

Function: decompose-region start end &optional buffer
This function decomposes any composite characters in the region from start to end in buffer. This converts each composite character into one or more characters, the individual characters out of which the composite character was formed. Non-composite characters are left as-is. buffer defaults to the current buffer if omitted.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.5 Coding Systems

A coding system is an object that defines how text containing multiple character sets is encoded into a stream of (typically 8-bit) bytes. The coding system is used to decode the stream into a series of characters (which may be from multiple charsets) when the text is read from a file or process, and is used to encode the text back into the same format when it is written out to a file or process.

For example, many ISO-2022-compliant coding systems (such as Compound Text, which is used for inter-client data under the X Window System) use escape sequences to switch between different charsets -- Japanese Kanji, for example, is invoked with `ESC $ ( B'; ASCII is invoked with `ESC ( B'; and Cyrillic is invoked with `ESC - L'. See make-coding-system for more information.

Coding systems are normally identified using a symbol, and the symbol is accepted in place of the actual coding system object whenever a coding system is called for. (This is similar to how faces and charsets work.)

Function: coding-system-p object
This function returns non-nil if object is a coding system.

63.5.1 Coding System Types  Classifying coding systems.
63.6 ISO 2022  An international standard for charsets and encodings.
63.6.1 EOL Conversion  Dealing with different ways of denoting the end of a line.
63.6.2 Coding System Properties  Properties of a coding system.
63.6.3 Basic Coding System Functions  Working with coding systems.
63.6.4 Coding System Property Functions  Retrieving a coding system's properties.
63.6.5 Encoding and Decoding Text  Encoding and decoding text.
63.6.6 Detection of Textual Encoding  Determining how text is encoded.
63.6.7 Big5 and Shift-JIS Functions  Special functions for these non-standard encodings.
63.6.8 Coding Systems Implemented  Coding systems implemented by MULE.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.5.1 Coding System Types

The coding system type determines the basic algorithm XEmacs will use to decode or encode a data stream. Character encodings will be converted to the MULE encoding, escape sequences processed, and newline sequences converted to XEmacs's internal representation. There are three basic classes of coding system type: no-conversion, ISO-2022, and special.

No conversion allows you to look at the file's internal representation. Since XEmacs is basically a text editor, "no conversion" does convert newline conventions by default. (Use the 'binary coding-system if this is not desired.)

ISO 2022 (see section 63.6 ISO 2022) is the basic international standard regulating use of "coded character sets for the exchange of data", ie, text streams. ISO 2022 contains functions that make it possible to encode text streams to comply with restrictions of the Internet mail system and de facto restrictions of most file systems (eg, use of the separator character in file names). Coding systems which are not ISO 2022 conformant can be difficult to handle. Perhaps more important, they are not adaptable to multilingual information interchange, with the obvious exception of ISO 10646 (Unicode). (Unicode is partially supported by XEmacs with the addition of the Lisp package ucs-conv.)

The special class of coding systems includes automatic detection, CCL (a "little language" embedded as an interpreter, useful for translating between variants of a single character set), non-ISO-2022-conformant encodings like Unicode, Shift JIS, and Big5, and MULE internal coding. (NB: this list is based on XEmacs 21.2. Terminology may vary slightly for other versions of XEmacs and for GNU Emacs 20.)

no-conversion
No conversion, for binary files, and a few special cases of non-ISO-2022 coding systems where conversion is done by hook functions (usually implemented in CCL). On output, graphic characters that are not in ASCII or Latin-1 will be replaced by a `?'. (For a no-conversion-encoded buffer, these characters will only be present if you explicitly insert them.)
iso2022
Any ISO-2022-compliant encoding. Among others, this includes JIS (the Japanese encoding commonly used for e-mail), national variants of EUC (the standard Unix encoding for Japanese and other languages), and Compound Text (an encoding used in X11). You can specify more specific information about the conversion with the flags argument.
ucs-4
ISO 10646 UCS-4 encoding. A 31-bit fixed-width superset of Unicode.
utf-8
ISO 10646 UTF-8 encoding. A "file system safe" transformation format that can be used with both UCS-4 and Unicode.
undecided
Automatic conversion. XEmacs attempts to detect the coding system used in the file.
shift-jis
Shift-JIS (a Japanese encoding commonly used in PC operating systems).
big5
Big5 (the encoding commonly used for Taiwanese).
ccl
The conversion is performed using a user-written pseudo-code program. CCL (Code Conversion Language) is the name of this pseudo-code. For example, CCL is used to map KOI8-R characters (an encoding for Russian Cyrillic) to ISO8859-5 (the form used internally by MULE).
internal
Write out or read in the raw contents of the memory representing the buffer's text. This is primarily useful for debugging purposes, and is only enabled when XEmacs has been compiled with DEBUG_XEMACS set (the `--debug' configure option). Warning: Reading in a file using internal conversion can result in an internal inconsistency in the memory representing a buffer's text, which will produce unpredictable results and may cause XEmacs to crash. Under normal circumstances you should never use internal conversion.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.6 ISO 2022

This section briefly describes the ISO 2022 encoding standard. A more thorough treatment is available in the original document of ISO 2022 as well as various national standards (such as JIS X 0202).

Character sets (charsets) are classified into the following four categories, according to the number of characters in the charset: 94-charset, 96-charset, 94x94-charset, and 96x96-charset. This means that although an ISO 2022 coding system may have variable width characters, each charset used is fixed-width (in contrast to the MULE character set and UTF-8, for example).

ISO 2022 provides for switching between character sets via escape sequences. This switching is somewhat complicated, because ISO 2022 provides for both legacy applications like Internet mail that accept only 7 significant bits in some contexts (RFC 822 headers, for example), and more modern "8-bit clean" applications. It also provides for compact and transparent representation of languages like Japanese which mix ASCII and a national script (even outside of computer programs).

First, ISO 2022 codified prevailing practice by dividing the code space into "control" and "graphic" regions. The code points 0x00-0x1F and 0x80-0x9F are reserved for "control characters", while "graphic characters" must be assigned to code points in the regions 0x20-0x7F and 0xA0-0xFF. The positions 0x20 and 0x7F are special, and under some circumstances must be assigned the graphic character "ASCII SPACE" and the control character "ASCII DEL" respectively.

The various regions are given the name C0 (0x00-0x1F), GL (0x20-0x7F), C1 (0x80-0x9F), and GR (0xA0-0xFF). GL and GR stand for "graphic left" and "graphic right", respectively, because of the standard method of displaying graphic character sets in tables with the high byte indexing columns and the low byte indexing rows. I don't find it very intuitive, but these are called "registers".

An ISO 2022-conformant encoding for a graphic character set must use a fixed number of bytes per character, and the values must fit into a single register; that is, each byte must range over either 0x20-0x7F, or 0xA0-0xFF. It is not allowed to extend the range of the repertoire of a character set by using both ranges at the same. This is why a standard character set such as ISO 8859-1 is actually considered by ISO 2022 to be an aggregation of two character sets, ASCII and LATIN-1, and why it is technically incorrect to refer to ISO 8859-1 as "Latin 1". Also, a single character's bytes must all be drawn from the same register; this is why Shift JIS (for Japanese) and Big 5 (for Chinese) are not ISO 2022-compatible encodings.

The reason for this restriction becomes clear when you attempt to define an efficient, robust encoding for a language like Japanese. Like ISO 8859, Japanese encodings are aggregations of several character sets. In practice, the vast majority of characters are drawn from the "JIS Roman" character set (a derivative of ASCII; it won't hurt to think of it as ASCII) and the JIS X 0208 standard "basic Japanese" character set including not only ideographic characters ("kanji") but syllabic Japanese characters ("kana"), a wide variety of symbols, and many alphabetic characters (Roman, Greek, and Cyrillic) as well. Although JIS X 0208 includes the whole Roman alphabet, as a 2-byte code it is not suited to programming; thus the inclusion of ASCII in the standard Japanese encodings.

For normal Japanese text such as in newspapers, a broad repertoire of approximately 3000 characters is used. Evidently this won't fit into one byte; two must be used. But much of the text processed by Japanese computers is computer source code, nearly all of which is ASCII. A not insignificant portion of ordinary text is English (as such or as borrowed Japanese vocabulary) or other languages which can represented at least approximately in ASCII, as well. It seems reasonable then to represent ASCII in one byte, and JIS X 0208 in two. And this is exactly what the Extended Unix Code for Japanese (EUC-JP) does. ASCII is invoked to the GL register, and JIS X 0208 is invoked to the GR register. Thus, each byte can be tested for its character set by looking at the high bit; if set, it is Japanese, if clear, it is ASCII. Furthermore, since control characters like newline can never be part of a graphic character, even in the case of corruption in transmission the stream will be resynchronized at every line break, on the order of 60-80 bytes. This coding system requires no escape sequences or special control codes to represent 99.9% of all Japanese text.

Note carefully the distinction between the character sets (ASCII and JIS X 0208), the encoding (EUC-JP), and the coding system (ISO 2022). The JIS X 0208 character set is used in three different encodings for Japanese, but in ISO-2022-JP it is invoked into GL (so the high bit is always clear), in EUC-JP it is invoked into GR (setting the high bit in the process), and in Shift JIS the high bit may be set or reset, and the significant bits are shifted within the 16-bit character so that the two main character sets can coexist with a third (the "halfwidth katakana" of JIS X 0201). As the name implies, the ISO-2022-JP encoding is also a version of the ISO-2022 coding system.

In order to systematically treat subsidiary character sets (like the "halfwidth katakana" already mentioned, and the "supplementary kanji" of JIS X 0212), four further registers are defined: G0, G1, G2, and G3. Unlike GL and GR, they are not logically distinguished by internal format. Instead, the process of "invocation" mentioned earlier is broken into two steps: first, a character set is designated to one of the registers G0-G3 by use of an escape sequence of the form:

 
        ESC [I] I F

where I is an intermediate character or characters in the range 0x20 - 0x3F, and F, from the range 0x30-0x7Fm is the final character identifying this charset. (Final characters in the range 0x30-0x3F are reserved for private use and will never have a publicly registered meaning.)

Then that register is invoked to either GL or GR, either automatically (designations to G0 normally involve invocation to GL as well), or by use of shifting (affecting only the following character in the data stream) or locking (effective until the next designation or locking) control sequences. An encoding conformant to ISO 2022 is typically defined by designating the initial contents of the G0-G3 registers, specifying a 7 or 8 bit environment, and specifying whether further designations will be recognized.

Some examples of character sets and the registered final characters F used to designate them:

94-charset
ASCII (B), left (J) and right (I) half of JIS X 0201, ...
96-charset
Latin-1 (A), Latin-2 (B), Latin-3 (C), ...
94x94-charset
GB2312 (A), JIS X 0208 (B), KSC5601 (C), ...
96x96-charset
none for the moment

The meanings of the various characters in these sequences, where not specified by the ISO 2022 standard (such as the ESC character), are assigned by ECMA, the European Computer Manufacturers Association.

The meaning of intermediate characters are:

 
        $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96).
        ( [0x28]: designate to G0 a 94-charset whose final byte is F.
        ) [0x29]: designate to G1 a 94-charset whose final byte is F.
        * [0x2A]: designate to G2 a 94-charset whose final byte is F.
        + [0x2B]: designate to G3 a 94-charset whose final byte is F.
        , [0x2C]: designate to G0 a 96-charset whose final byte is F.
        - [0x2D]: designate to G1 a 96-charset whose final byte is F.
        . [0x2E]: designate to G2 a 96-charset whose final byte is F.
        / [0x2F]: designate to G3 a 96-charset whose final byte is F.

The comma may be used in files read and written only by MULE, as a MULE extension, but this is illegal in ISO 2022. (The reason is that in ISO 2022 G0 must be a 94-member character set, with 0x20 assigned the value SPACE, and 0x7F assigned the value DEL.)

Here are examples of designations:

 
        ESC ( B :              designate to G0 ASCII
        ESC - A :              designate to G1 Latin-1
        ESC $ ( A or ESC $ A : designate to G0 GB2312
        ESC $ ( B or ESC $ B : designate to G0 JISX0208
        ESC $ ) C :            designate to G1 KSC5601

(The short forms used to designate GB2312 and JIS X 0208 are for backwards compatibility; the long forms are preferred.)

To use a charset designated to G2 or G3, and to use a charset designated to G1 in a 7-bit environment, you must explicitly invoke G1, G2, or G3 into GL. There are two types of invocation, Locking Shift (forever) and Single Shift (one character only).

Locking Shift is done as follows:

 
        LS0 or SI (0x0F): invoke G0 into GL
        LS1 or SO (0x0E): invoke G1 into GL
        LS2:  invoke G2 into GL
        LS3:  invoke G3 into GL
        LS1R: invoke G1 into GR
        LS2R: invoke G2 into GR
        LS3R: invoke G3 into GR

Single Shift is done as follows:

 
        SS2 or ESC N: invoke G2 into GL
        SS3 or ESC O: invoke G3 into GL

The shift functions (such as LS1R and SS3) are represented by control characters (from C1) in 8 bit environments and by escape sequences in 7 bit environments.

(#### Ben says: I think the above is slightly incorrect. It appears that SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and ESC O behave as indicated. The above definitions will not parse EUC-encoded text correctly, and it looks like the code in mule-coding.c has similar problems.)

Evidently there are a lot of ISO-2022-compliant ways of encoding multilingual text. Now, in the world, there exist many coding systems such as X11's Compound Text, Japanese JUNET code, and so-called EUC (Extended UNIX Code); all of these are variants of ISO 2022.

In MULE, we characterize a version of ISO 2022 by the following attributes:

  1. The character sets initially designated to G0 thru G3.
  2. Whether short form designations are allowed for Japanese and Chinese.
  3. Whether ASCII should be designated to G0 before control characters.
  4. Whether ASCII should be designated to G0 at the end of line.
  5. 7-bit environment or 8-bit environment.
  6. Whether Locking Shifts are used or not.
  7. Whether to use ASCII or the variant JIS X 0201-1976-Roman.
  8. Whether to use JIS X 0208-1983 or the older version JIS X 0208-1976.

(The last two are only for Japanese.)

By specifying these attributes, you can create any variant of ISO 2022.

Here are several examples:

 
ISO-2022-JP -- Coding system used in Japanese email (RFC 1463 #### check).
        1. G0 <- ASCII, G1..3 <- never used
        2. Yes.
        3. Yes.
        4. Yes.
        5. 7-bit environment
        6. No.
        7. Use ASCII
        8. Use JIS X 0208-1983

ctext -- X11 Compound Text
        1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used.
        2. No.
        3. No.
        4. Yes.
        5. 8-bit environment.
        6. No.
        7. Use ASCII.
        8. Use JIS X 0208-1983.

euc-china -- Chinese EUC.  Often called the "GB encoding", but that is
technically incorrect.
        1. G0 <- ASCII, G1 <- GB 2312, G2,3 <- never used.
        2. No.
        3. Yes.
        4. Yes.
        5. 8-bit environment.
        6. No.
        7. Use ASCII.
        8. Use JIS X 0208-1983.

ISO-2022-KR -- Coding system used in Korean email.
        1. G0 <- ASCII, G1 <- KSC 5601, G2,3 <- never used.
        2. No.
        3. Yes.
        4. Yes.
        5. 7-bit environment.
        6. Yes.
        7. Use ASCII.
        8. Use JIS X 0208-1983.

MULE creates all of these coding systems by default.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.6.1 EOL Conversion

nil
Automatically detect the end-of-line type (LF, CRLF, or CR). Also generate subsidiary coding systems named name-unix, name-dos, and name-mac, that are identical to this coding system but have an EOL-TYPE value of lf, crlf, and cr, respectively.
lf
The end of a line is marked externally using ASCII LF. Since this is also the way that XEmacs represents an end-of-line internally, specifying this option results in no end-of-line conversion. This is the standard format for Unix text files.
crlf
The end of a line is marked externally using ASCII CRLF. This is the standard format for MS-DOS text files.
cr
The end of a line is marked externally using ASCII CR. This is the standard format for Macintosh text files.
t
Automatically detect the end-of-line type but do not generate subsidiary coding systems. (This value is converted to nil when stored internally, and coding-system-property will return nil.)


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.6.2 Coding System Properties

mnemonic
String to be displayed in the modeline when this coding system is active.

eol-type
End-of-line conversion to be used. It should be one of the types listed in 63.6.1 EOL Conversion.

eol-lf
The coding system which is the same as this one, except that it uses the Unix line-breaking convention.

eol-crlf
The coding system which is the same as this one, except that it uses the DOS line-breaking convention.

eol-cr
The coding system which is the same as this one, except that it uses the Macintosh line-breaking convention.

post-read-conversion
Function called after a file has been read in, to perform the decoding. Called with two arguments, start and end, denoting a region of the current buffer to be decoded.

pre-write-conversion
Function called before a file is written out, to perform the encoding. Called with two arguments, start and end, denoting a region of the current buffer to be encoded.

The following additional properties are recognized if type is iso2022:

charset-g0
charset-g1
charset-g2
charset-g3
The character set initially designated to the G0 - G3 registers. The value should be one of

force-g0-on-output
force-g1-on-output
force-g2-on-output
force-g3-on-output
If non-nil, send an explicit designation sequence on output before using the specified register.

short
If non-nil, use the short forms `ESC $ @', `ESC $ A', and `ESC $ B' on output in place of the full designation sequences `ESC $ ( @', `ESC $ ( A', and `ESC $ ( B'.

no-ascii-eol
If non-nil, don't designate ASCII to G0 at each end of line on output. Setting this to non-nil also suppresses other state-resetting that normally happens at the end of a line.

no-ascii-cntl
If non-nil, don't designate ASCII to G0 before control chars on output.

seven
If non-nil, use 7-bit environment on output. Otherwise, use 8-bit environment.

lock-shift
If non-nil, use locking-shift (SO/SI) instead of single-shift or designation by escape sequence.

no-iso6429
If non-nil, don't use ISO6429's direction specification.

escape-quoted
If non-nil, literal control characters that are the same as the beginning of a recognized ISO 2022 or ISO 6429 escape sequence (in particular, ESC (0x1B), SO (0x0E), SI (0x0F), SS2 (0x8E), SS3 (0x8F), and CSI (0x9B)) are "quoted" with an escape character so that they can be properly distinguished from an escape sequence. (Note that doing this results in a non-portable encoding.) This encoding flag is used for byte-compiled files. Note that ESC is a good choice for a quoting character because there are no escape sequences whose second byte is a character from the Control-0 or Control-1 character sets; this is explicitly disallowed by the ISO 2022 standard.

input-charset-conversion
A list of conversion specifications, specifying conversion of characters in one charset to another when decoding is performed. Each specification is a list of two elements: the source charset, and the destination charset.

output-charset-conversion
A list of conversion specifications, specifying conversion of characters in one charset to another when encoding is performed. The form of each specification is the same as for input-charset-conversion.

The following additional properties are recognized (and required) if type is ccl:

decode
CCL program used for decoding (converting to internal format).

encode
CCL program used for encoding (converting to external format).

The following properties are used internally: eol-cr, eol-crlf, eol-lf, and base.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.6.3 Basic Coding System Functions

Function: find-coding-system coding-system-or-name
This function retrieves the coding system of the given name.

If coding-system-or-name is a coding-system object, it is simply returned. Otherwise, coding-system-or-name should be a symbol. If there is no such coding system, nil is returned. Otherwise the associated coding system object is returned.

Function: get-coding-system name
This function retrieves the coding system of the given name. Same as find-coding-system except an error is signalled if there is no such coding system instead of returning nil.

Function: coding-system-list
This function returns a list of the names of all defined coding systems.

Function: coding-system-name coding-system
This function returns the name of the given coding system.

Function: coding-system-base coding-system
Returns the base coding system (undecided EOL convention) coding system.

Function: make-coding-system name type &optional doc-string props
This function registers symbol name as a coding system.

type describes the conversion method used and should be one of the types listed in 63.5.1 Coding System Types.

doc-string is a string describing the coding system.

props is a property list, describing the specific nature of the character set. Recognized properties are as in 63.6.2 Coding System Properties.

Function: copy-coding-system old-coding-system new-name
This function copies old-coding-system to new-name. If new-name does not name an existing coding system, a new one will be created.

Function: subsidiary-coding-system coding-system eol-type
This function returns the subsidiary coding system of coding-system with eol type eol-type.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.6.4 Coding System Property Functions

Function: coding-system-doc-string coding-system
This function returns the doc string for coding-system.

Function: coding-system-type coding-system
This function returns the type of coding-system.

Function: coding-system-property coding-system prop
This function returns the prop property of coding-system.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.6.5 Encoding and Decoding Text

Function: decode-coding-region start end coding-system &optional buffer
This function decodes the text between start and end which is encoded in coding-system. This is useful if you've read in encoded text from a file without decoding it (e.g. you read in a JIS-formatted file but used the binary or no-conversion coding system, so that it shows up as `^[$B!<!+^[(B'). The length of the encoded text is returned. buffer defaults to the current buffer if unspecified.

Function: encode-coding-region start end coding-system &optional buffer
This function encodes the text between start and end using coding-system. This will, for example, convert Japanese characters into stuff such as `^[$B!<!+^[(B' if you use the JIS encoding. The length of the encoded text is returned. buffer defaults to the current buffer if unspecified.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.6.6 Detection of Textual Encoding

Function: coding-category-list
This function returns a list of all recognized coding categories.

Function: set-coding-priority-list list
This function changes the priority order of the coding categories. list should be a list of coding categories, in descending order of priority. Unspecified coding categories will be lower in priority than all specified ones, in the same relative order they were in previously.

Function: coding-priority-list
This function returns a list of coding categories in descending order of priority.

Function: set-coding-category-system coding-category coding-system
This function changes the coding system associated with a coding category.

Function: coding-category-system coding-category
This function returns the coding system associated with a coding category.

Function: detect-coding-region start end &optional buffer
This function detects coding system of the text in the region between start and end. Returned value is a list of possible coding systems ordered by priority. If only ASCII characters are found, it returns autodetect or one of its subsidiary coding systems according to a detected end-of-line type. Optional arg buffer defaults to the current buffer.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.6.7 Big5 and Shift-JIS Functions

These are special functions for working with the non-standard Shift-JIS and Big5 encodings.

Function: decode-shift-jis-char code
This function decodes a JIS X 0208 character of Shift-JIS coding-system. code is the character code in Shift-JIS as a cons of type bytes. The corresponding character is returned.

Function: encode-shift-jis-char character
This function encodes a JIS X 0208 character character to SHIFT-JIS coding-system. The corresponding character code in SHIFT-JIS is returned as a cons of two bytes.

Function: decode-big5-char code
This function decodes a Big5 character code of BIG5 coding-system. code is the character code in BIG5. The corresponding character is returned.

Function: encode-big5-char character
This function encodes the Big5 character character to BIG5 coding-system. The corresponding character code in Big5 is returned.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.6.8 Coding Systems Implemented

MULE initializes most of the commonly used coding systems at XEmacs's startup. A few others are initialized only when the relevant language environment is selected and support libraries are loaded. (NB: The following list is based on XEmacs 21.2.19, the development branch at the time of writing. The list may be somewhat different for other versions. Recent versions of GNU Emacs 20 implement a few more rare coding systems; work is being done to port these to XEmacs.)

Unfortunately, there is not a consistent naming convention for character sets, and for practical purposes coding systems often take their name from their principal character sets (ASCII, KOI8-R, Shift JIS). Others take their names from the coding system (ISO-2022-JP, EUC-KR), and a few from their non-text usages (internal, binary). To provide for this, and for the fact that many coding systems have several common names, an aliasing system is provided. Finally, some effort has been made to use names that are registered as MIME charsets (this is why the name 'shift_jis contains that un-Lisp-y underscore).

There is a systematic naming convention regarding end-of-line (EOL) conventions for different systems. A coding system whose name ends in "-unix" forces the assumptions that lines are broken by newlines (0x0A). A coding system whose name ends in "-mac" forces the assumptions that lines are broken by ASCII CRs (0x0D). A coding system whose name ends in "-dos" forces the assumptions that lines are broken by CRLF sequences (0x0D 0x0A). These subsidiary coding systems are automatically derived from a base coding system. Use of the base coding system implies autodetection of the text file convention. (The fact that the -unix, -mac, and -dos are derived from a base system results in them showing up as "aliases" in `list-coding-systems'.) These subsidiaries have a consistent modeline indicator as well. "-dos" coding systems have ":T" appended to their modeline indicator, while "-mac" coding systems have ":t" appended (eg, "ISO8:t" for iso-2022-8-mac).

In the following table, each coding system is given with its mode line indicator in parentheses. Non-textual coding systems are listed first, followed by textual coding systems and their aliases. (The coding system subsidiary modeline indicators ":T" and ":t" will be omitted from the table of coding systems.)

### SJT 1999-08-23 Maybe should order these by language? Definitely need language usage for the ISO-8859 family.

Note that although true coding system aliases have been implemented for XEmacs 21.2, the coding system initialization has not yet been converted as of 21.2.19. So coding systems described as aliases have the same properties as the aliased coding system, but will not be equal as Lisp objects.

automatic-conversion
undecided
undecided-dos
undecided-mac
undecided-unix

Modeline indicator: Auto. A type undecided coding system. Attempts to determine an appropriate coding system from file contents or the environment.

raw-text
no-conversion
raw-text-dos
raw-text-mac
raw-text-unix
no-conversion-dos
no-conversion-mac
no-conversion-unix

Modeline indicator: Raw. A type no-conversion coding system, which converts only line-break-codes. An implementation quirk means that this coding system is also used for ISO8859-1.

binary
Modeline indicator: Binary. A type no-conversion coding system which does no character coding or EOL conversions. An alias for raw-text-unix.

alternativnyj
alternativnyj-dos
alternativnyj-mac
alternativnyj-unix

Modeline indicator: Cy.Alt. A type ccl coding system used for Alternativnyj, an encoding of the Cyrillic alphabet.

big5
big5-dos
big5-mac
big5-unix

Modeline indicator: Zh/Big5. A type big5 coding system used for BIG5, the most common encoding of traditional Chinese as used in Taiwan.

cn-gb-2312
cn-gb-2312-dos
cn-gb-2312-mac
cn-gb-2312-unix

Modeline indicator: Zh-GB/EUC. A type iso2022 coding system used for simplified Chinese (as used in the People's Republic of China), with the ascii (G0), chinese-gb2312 (G1), and sisheng (G2) character sets initially designated. Chinese EUC (Extended Unix Code).

ctext-hebrew
ctext-hebrew-dos
ctext-hebrew-mac
ctext-hebrew-unix

Modeline indicator: CText/Hbrw. A type iso2022 coding system with the ascii (G0) and hebrew-iso8859-8 (G1) character sets initially designated for Hebrew.

ctext
ctext-dos
ctext-mac
ctext-unix

Modeline indicator: CText. A type iso2022 8-bit coding system with the ascii (G0) and latin-iso8859-1 (G1) character sets initially designated. X11 Compound Text Encoding. Often mistakenly recognized instead of EUC encodings; usual cause is inappropriate setting of coding-priority-list.

escape-quoted

Modeline indicator: ESC/Quot. A type iso2022 8-bit coding system with the ascii (G0) and latin-iso8859-1 (G1) character sets initially designated and escape quoting. Unix EOL conversion (ie, no conversion). It is used for .ELC files.

euc-jp
euc-jp-dos
euc-jp-mac
euc-jp-unix

Modeline indicator: Ja/EUC. A type iso2022 8-bit coding system with ascii (G0), japanese-jisx0208 (G1), katakana-jisx0201 (G2), and japanese-jisx0212 (G3) initially designated. Japanese EUC (Extended Unix Code).

euc-kr
euc-kr-dos
euc-kr-mac
euc-kr-unix

Modeline indicator: ko/EUC. A type iso2022 8-bit coding system with ascii (G0) and korean-ksc5601 (G1) initially designated. Korean EUC (Extended Unix Code).

hz-gb-2312
Modeline indicator: Zh-GB/Hz. A type no-conversion coding system with Unix EOL convention (ie, no conversion) using post-read-decode and pre-write-encode functions to translate the Hz/ZW coding system used for Chinese.

iso-2022-7bit
iso-2022-7bit-unix
iso-2022-7bit-dos
iso-2022-7bit-mac
iso-2022-7

Modeline indicator: ISO7. A type iso2022 7-bit coding system with ascii (G0) initially designated. Other character sets must be explicitly designated to be used.

iso-2022-7bit-ss2
iso-2022-7bit-ss2-dos
iso-2022-7bit-ss2-mac
iso-2022-7bit-ss2-unix

Modeline indicator: ISO7/SS. A type iso2022 7-bit coding system with ascii (G0) initially designated. Other character sets must be explicitly designated to be used. SS2 is used to invoke a 96-charset, one character at a time.

iso-2022-8
iso-2022-8-dos
iso-2022-8-mac
iso-2022-8-unix

Modeline indicator: ISO8. A type iso2022 8-bit coding system with ascii (G0) and latin-iso8859-1 (G1) initially designated. Other character sets must be explicitly designated to be used. No single-shift or locking-shift.

iso-2022-8bit-ss2
iso-2022-8bit-ss2-dos
iso-2022-8bit-ss2-mac
iso-2022-8bit-ss2-unix

Modeline indicator: ISO8/SS. A type iso2022 8-bit coding system with ascii (G0) and latin-iso8859-1 (G1) initially designated. Other character sets must be explicitly designated to be used. SS2 is used to invoke a 96-charset, one character at a time.

iso-2022-int-1
iso-2022-int-1-dos
iso-2022-int-1-mac
iso-2022-int-1-unix

Modeline indicator: INT-1. A type iso2022 7-bit coding system with ascii (G0) and korean-ksc5601 (G1) initially designated. ISO-2022-INT-1.

iso-2022-jp-1978-irv
iso-2022-jp-1978-irv-dos
iso-2022-jp-1978-irv-mac
iso-2022-jp-1978-irv-unix

Modeline indicator: Ja-78/7bit. A type iso2022 7-bit coding system. For compatibility with old Japanese terminals; if you need to know, look at the source.

iso-2022-jp
iso-2022-jp-2 (ISO7/SS)
iso-2022-jp-dos
iso-2022-jp-mac
iso-2022-jp-unix
iso-2022-jp-2-dos
iso-2022-jp-2-mac
iso-2022-jp-2-unix

Modeline indicator: MULE/7bit. A type iso2022 7-bit coding system with ascii (G0) initially designated, and complex specifications to insure backward compatibility with old Japanese systems. Used for communication with mail and news in Japan. The "-2" versions also use SS2 to invoke a 96-charset one character at a time.

iso-2022-kr
Modeline indicator: Ko/7bit A type iso2022 7-bit coding system with ascii (G0) and korean-ksc5601 (G1) initially designated. Used for e-mail in Korea.

iso-2022-lock
iso-2022-lock-dos
iso-2022-lock-mac
iso-2022-lock-unix

Modeline indicator: ISO7/Lock. A type iso2022 7-bit coding system with ascii (G0) initially designated, using Locking-Shift to invoke a 96-charset.

iso-8859-1
iso-8859-1-dos
iso-8859-1-mac
iso-8859-1-unix

Due to implementation, this is not a type iso2022 coding system, but rather an alias for the raw-text coding system.

iso-8859-2
iso-8859-2-dos
iso-8859-2-mac
iso-8859-2-unix

Modeline indicator: MIME/Ltn-2. A type iso2022 coding system with ascii (G0) and latin-iso8859-2 (G1) initially invoked.

iso-8859-3
iso-8859-3-dos
iso-8859-3-mac
iso-8859-3-unix

Modeline indicator: MIME/Ltn-3. A type iso2022 coding system with ascii (G0) and latin-iso8859-3 (G1) initially invoked.

iso-8859-4
iso-8859-4-dos
iso-8859-4-mac
iso-8859-4-unix

Modeline indicator: MIME/Ltn-4. A type iso2022 coding system with ascii (G0) and latin-iso8859-4 (G1) initially invoked.

iso-8859-5
iso-8859-5-dos
iso-8859-5-mac
iso-8859-5-unix

Modeline indicator: ISO8/Cyr. A type iso2022 coding system with ascii (G0) and cyrillic-iso8859-5 (G1) initially invoked.

iso-8859-7
iso-8859-7-dos
iso-8859-7-mac
iso-8859-7-unix

Modeline indicator: Grk. A type iso2022 coding system with ascii (G0) and greek-iso8859-7 (G1) initially invoked.

iso-8859-8
iso-8859-8-dos
iso-8859-8-mac
iso-8859-8-unix

Modeline indicator: MIME/Hbrw. A type iso2022 coding system with ascii (G0) and hebrew-iso8859-8 (G1) initially invoked.

iso-8859-9
iso-8859-9-dos
iso-8859-9-mac
iso-8859-9-unix

Modeline indicator: MIME/Ltn-5. A type iso2022 coding system with ascii (G0) and latin-iso8859-9 (G1) initially invoked.

koi8-r
koi8-r-dos
koi8-r-mac
koi8-r-unix

Modeline indicator: KOI8. A type ccl coding-system used for KOI8-R, an encoding of the Cyrillic alphabet.

shift_jis
shift_jis-dos
shift_jis-mac
shift_jis-unix

Modeline indicator: Ja/SJIS. A type shift-jis coding-system implementing the Shift-JIS encoding for Japanese. The underscore is to conform to the MIME charset implementing this encoding.

tis-620
tis-620-dos
tis-620-mac
tis-620-unix

Modeline indicator: TIS620. A type ccl encoding for Thai. The external encoding is defined by TIS620, the internal encoding is peculiar to MULE, and called thai-xtis.

viqr

Modeline indicator: VIQR. A type no-conversion coding system with Unix EOL convention (ie, no conversion) using post-read-decode and pre-write-encode functions to translate the VIQR coding system for Vietnamese.

viscii
viscii-dos
viscii-mac
viscii-unix

Modeline indicator: VISCII. A type ccl coding-system used for VISCII 1.1 for Vietnamese. Differs slightly from VSCII; VISCII is given priority by XEmacs.

vscii
vscii-dos
vscii-mac
vscii-unix

Modeline indicator: VSCII. A type ccl coding-system used for VSCII 1.1 for Vietnamese. Differs slightly from VISCII, which is given priority by XEmacs. Use (prefer-coding-system 'vietnamese-vscii) to give priority to VSCII.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.7 CCL

CCL (Code Conversion Language) is a simple structured programming language designed for character coding conversions. A CCL program is compiled to CCL code (represented by a vector of integers) and executed by the CCL interpreter embedded in Emacs. The CCL interpreter implements a virtual machine with 8 registers called r0, ..., r7, a number of control structures, and some I/O operators. Take care when using registers r0 (used in implicit set statements) and especially r7 (used internally by several statements and operations, especially for multiple return values and I/O operations).

CCL is used for code conversion during process I/O and file I/O for non-ISO2022 coding systems. (It is the only way for a user to specify a code conversion function.) It is also used for calculating the code point of an X11 font from a character code. However, since CCL is designed as a powerful programming language, it can be used for more generic calculation where efficiency is demanded. A combination of three or more arithmetic operations can be calculated faster by CCL than by Emacs Lisp.

Warning: The code in `src/mule-ccl.c' and `$packages/lisp/mule-base/mule-ccl.el' is the definitive description of CCL's semantics. The previous version of this section contained several typos and obsolete names left from earlier versions of MULE, and many may remain. (I am not an experienced CCL programmer; the few who know CCL well find writing English painful.)

A CCL program transforms an input data stream into an output data stream. The input stream, held in a buffer of constant bytes, is left unchanged. The buffer may be filled by an external input operation, taken from an Emacs buffer, or taken from a Lisp string. The output buffer is a dynamic array of bytes, which can be written by an external output operation, inserted into an Emacs buffer, or returned as a Lisp string.

A CCL program is a (Lisp) list containing two or three members. The first member is the buffer magnification, which indicates the required minimum size of the output buffer as a multiple of the input buffer. It is followed by the main block which executes while there is input remaining, and an optional EOF block which is executed when the input is exhausted. Both the main block and the EOF block are CCL blocks.

A CCL block is either a CCL statement or list of CCL statements. A CCL statement is either a set statement (either an integer or an assignment, which is a list of a register to receive the assignment, an assignment operator, and an expression) or a control statement (a list starting with a keyword, whose allowable syntax depends on the keyword).

63.7.1 CCL Syntax  CCL program syntax in BNF notation.
63.7.2 CCL Statements  Semantics of CCL statements.
63.7.3 CCL Expressions  Operators and expressions in CCL.
63.7.4 Calling CCL  Running CCL programs.
63.7.5 CCL Example  A trivial program to transform the Web's URL encoding.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.7.1 CCL Syntax

The full syntax of a CCL program in BNF notation:

 
CCL_PROGRAM :=
        (BUFFER_MAGNIFICATION
         CCL_MAIN_BLOCK
         [ CCL_EOF_BLOCK ])

BUFFER_MAGNIFICATION := integer
CCL_MAIN_BLOCK := CCL_BLOCK
CCL_EOF_BLOCK := CCL_BLOCK

CCL_BLOCK :=
        STATEMENT | (STATEMENT [STATEMENT ...])
STATEMENT :=
        SET | IF | BRANCH | LOOP | REPEAT | BREAK | READ | WRITE  | CALL
	| TRANSLATE | MAP | END

SET :=
        (REG = EXPRESSION)
        | (REG ASSIGNMENT_OPERATOR EXPRESSION)
        | INT-OR-CHAR

EXPRESSION := ARG | (EXPRESSION OPERATOR ARG)

IF := (if EXPRESSION CCL_BLOCK [CCL_BLOCK])
BRANCH := (branch EXPRESSION CCL_BLOCK [CCL_BLOCK ...])
LOOP := (loop STATEMENT [STATEMENT ...])
BREAK := (break)
REPEAT :=
        (repeat)
        | (write-repeat [REG | INT-OR-CHAR | string])
        | (write-read-repeat REG [INT-OR-CHAR | ARRAY])
READ :=
        (read REG ...)
        | (read-if (REG OPERATOR ARG) CCL_BLOCK [CCL_BLOCK])
        | (read-branch REG CCL_BLOCK [CCL_BLOCK ...])
WRITE :=
        (write REG ...)
        | (write EXPRESSION)
        | (write INT-OR-CHAR) | (write string) | (write REG ARRAY)
        | string
CALL := (call ccl-program-name)


TRANSLATE := ;; Not implemented under XEmacs, except mule-to-unicode and
	     ;; unicode-to-mule.
	     (translate-character REG(table) REG(charset) REG(codepoint)) 
	     | (translate-character SYMBOL REG(charset) REG(codepoint)) 
	     | (mule-to-unicode REG(charset) REG(codepoint))
	     | (unicode-to-mule REG(unicode,code) REG(CHARSET))

END := (end)

REG := r0 | r1 | r2 | r3 | r4 | r5 | r6 | r7
ARG := REG | INT-OR-CHAR
OPERATOR :=
        + | - | * | / | % | & | '|' | ^ | << | >> | <8 | >8 | //
        | < | > | == | <= | >= | != | de-sjis | en-sjis
ASSIGNMENT_OPERATOR :=
        += | -= | *= | /= | %= | &= | '|=' | ^= | <<= | >>=
ARRAY := '[' INT-OR-CHAR ... ']'
INT-OR-CHAR := integer | character


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.7.2 CCL Statements

The Emacs Code Conversion Language provides the following statement types: set, if, branch, loop, repeat, break, read, write, call, translate and end.

Set statement:

The set statement has three variants with the syntaxes `(reg = expression)', `(reg assignment_operator expression)', and `integer'. The assignment operator variation of the set statement works the same way as the corresponding C expression statement does. The assignment operators are +=, -=, *=, /=, %=, &=, |=, ^=, <<=, and >>=, and they have the same meanings as in C. A "naked integer" integer is equivalent to a set statement of the form (r0 = integer).

I/O statements:

The read statement takes one or more registers as arguments. It reads one byte (a C char) from the input into each register in turn.

The write takes several forms. In the form `(write reg ...)' it takes one or more registers as arguments and writes each in turn to the output. The integer in a register (interpreted as an Ichar) is encoded to multibyte form (ie, Ibytes) and written to the current output buffer. If it is less than 256, it is written as is. The forms `(write expression)' and `(write integer)' are treated analogously. The form `(write string)' writes the constant string to the output. A "naked string" `string' is equivalent to the statement `(write string)'. The form `(write reg array)' writes the regth element of the array to the output.

Conditional statements:

The if statement takes an expression, a CCL block, and an optional second CCL block as arguments. If the expression evaluates to non-zero, the first CCL block is executed. Otherwise, if there is a second CCL block, it is executed.

The read-if variant of the if statement takes an expression, a CCL block, and an optional second CCL block as arguments. The expression must have the form (reg operator operand) (where operand is a register or an integer). The read-if statement first reads from the input into the first register operand in the expression, then conditionally executes a CCL block just as the if statement does.

The branch statement takes an expression and one or more CCL blocks as arguments. The CCL blocks are treated as a zero-indexed array, and the branch statement uses the expression as the index of the CCL block to execute. Null CCL blocks may be used as no-ops, continuing execution with the statement following the branch statement in the containing CCL block. Out-of-range values for the expression are also treated as no-ops.

The read-branch variant of the branch statement takes an register, a CCL block, and an optional second CCL block as arguments. The read-branch statement first reads from the input into the register, then conditionally executes a CCL block just as the branch statement does.

Loop control statements:

The loop statement creates a block with an implied jump from the end of the block back to its head. The loop is exited on a break statement, and continued without executing the tail by a repeat statement.

The break statement, written `(break)', terminates the current loop and continues with the next statement in the current block.

The repeat statement has three variants, repeat, write-repeat, and write-read-repeat. Each continues the current loop from its head, possibly after performing I/O. repeat takes no arguments and does no I/O before jumping. write-repeat takes a single argument (a register, an integer, or a string), writes it to the output, then jumps. write-read-repeat takes one or two arguments. The first must be a register. The second may be an integer or an array; if absent, it is implicitly set to the first (register) argument. write-read-repeat writes its second argument to the output, then reads from the input into the register, and finally jumps. See the write and read statements for the semantics of the I/O operations for each type of argument.

Other statements:

The call statement, written `(call ccl-program-name)', executes a CCL program as a subroutine. It does not return a value to the caller, but can modify the register status.

The mule-to-unicode statement translates an XEmacs character into a UCS code point, using U+FFFD REPLACEMENT CHARACTER if the given XEmacs character has no known corresponding code point. It takes two arguments; the first is a register in which is stored the character set ID of the character to be translated, and into which the UCS code is stored. The second is a register which stores the XEmacs code of the character in question; if it is from a multidimensional character set, like most of the East Asian national sets, it's stored as `((c1 << 8) & c2)', where `c1' is the first code, and `c2' the second. (That is, as a single integer, the high-order eight bits of which encode the first position code, and the low order bits of which encode the second.)

The unicode-to-mule statement translates a Unicode code point (an integer) into an XEmacs character. Its first argument is a register containing the UCS code point; the code for the correspond character will be written into this register, in the same format as for `mule-to-unicode' The second argument is a register into which will be written the character set ID of the converted character.

The end statement, written `(end)', terminates the CCL program successfully, and returns to caller (which may be a CCL program). It does not alter the status of the registers.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.7.3 CCL Expressions

CCL, unlike Lisp, uses infix expressions. The simplest CCL expressions consist of a single operand, either a register (one of r0, ..., r0) or an integer. Complex expressions are lists of the form ( expression operator operand ). Unlike C, assignments are not expressions.

In the following table, X is the target resister for a set. In subexpressions, this is implicitly r7. This means that >8, //, de-sjis, and en-sjis cannot be used freely in subexpressions, since they return parts of their values in r7. Y may be an expression, register, or integer, while Z must be a register or an integer.

Name Operator Code C-like Description
CCL_PLUS + 0x00 X = Y + Z
CCL_MINUS - 0x01 X = Y - Z
CCL_MUL * 0x02 X = Y * Z
CCL_DIV / 0x03 X = Y / Z
CCL_MOD % 0x04 X = Y % Z
CCL_AND & 0x05 X = Y & Z
CCL_OR | 0x06 X = Y | Z
CCL_XOR ^ 0x07 X = Y ^ Z
CCL_LSH << 0x08 X = Y << Z
CCL_RSH >> 0x09 X = Y >> Z
CCL_LSH8 <8 0x0A X = (Y << 8) | Z
CCL_RSH8 >8 0x0B X = Y >> 8, r[7] = Y & 0xFF
CCL_DIVMOD // 0x0C X = Y / Z, r[7] = Y % Z
CCL_LS < 0x10 X = (X < Y)
CCL_GT > 0x11 X = (X > Y)
CCL_EQ == 0x12 X = (X == Y)
CCL_LE <= 0x13 X = (X <= Y)
CCL_GE >= 0x14 X = (X >= Y)
CCL_NE != 0x15 X = (X != Y)
CCL_ENCODE_SJIS en-sjis 0x16 X = HIGHER_BYTE (SJIS (Y, Z))
r[7] = LOWER_BYTE (SJIS (Y, Z)
CCL_DECODE_SJIS de-sjis 0x17 X = HIGHER_BYTE (DE-SJIS (Y, Z))
r[7] = LOWER_BYTE (DE-SJIS (Y, Z))

The CCL operators are as in C, with the addition of CCL_LSH8, CCL_RSH8, CCL_DIVMOD, CCL_ENCODE_SJIS, and CCL_DECODE_SJIS. The CCL_ENCODE_SJIS and CCL_DECODE_SJIS treat their first and second bytes as the high and low bytes of a two-byte character code. (SJIS stands for Shift JIS, an encoding of Japanese characters used by Microsoft. CCL_ENCODE_SJIS is a complicated transformation of the Japanese standard JIS encoding to Shift JIS. CCL_DECODE_SJIS is its inverse.) It is somewhat odd to represent the SJIS operations in infix form.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.7.4 Calling CCL

CCL programs are called automatically during Emacs buffer I/O when the external representation has a coding system type of shift-jis, big5, or ccl. The program is specified by the coding system (see section 63.5 Coding Systems). You can also call CCL programs from other CCL programs, and from Lisp using these functions:

Function: ccl-execute ccl-program status
Execute ccl-program with registers initialized by status. ccl-program is a vector of compiled CCL code created by ccl-compile. It is an error for the program to try to execute a CCL I/O command. status must be a vector of nine values, specifying the initial value for the R0, R1 .. R7 registers and for the instruction counter IC. A nil value for a register initializer causes the register to be set to 0. A nil value for the IC initializer causes execution to start at the beginning of the program. When the program is done, status is modified (by side-effect) to contain the ending values for the corresponding registers and IC.

Function: ccl-execute-on-string ccl-program status string &optional continue
Execute ccl-program with initial status on string. ccl-program is a vector of compiled CCL code created by ccl-compile. status must be a vector of nine values, specifying the initial value for the R0, R1 .. R7 registers and for the instruction counter IC. A nil value for a register initializer causes the register to be set to 0. A nil value for the IC initializer causes execution to start at the beginning of the program. An optional fourth argument continue, if non-nil, causes the IC to remain on the unsatisfied read operation if the program terminates due to exhaustion of the input buffer. Otherwise the IC is set to the end of the program. When the program is done, status is modified (by side-effect) to contain the ending values for the corresponding registers and IC. Returns the resulting string.

To call a CCL program from another CCL program, it must first be registered:

Function: register-ccl-program name ccl-program
Register name for CCL program ccl-program in ccl-program-table. ccl-program should be the compiled form of a CCL program, or nil. Return index number of the registered CCL program.

Information about the processor time used by the CCL interpreter can be obtained using these functions:

Function: ccl-elapsed-time
Returns the elapsed processor time of the CCL interpreter as cons of user and system time, as floating point numbers measured in seconds. If only one overall value can be determined, the return value will be a cons of that value and 0.

Function: ccl-reset-elapsed-time
Resets the CCL interpreter's internal elapsed time registers.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.7.5 CCL Example

In this section, we describe the implementation of a trivial coding system to transform from the Web's URL encoding to XEmacs' internal coding. Many people will have been first exposed to URL encoding when they saw "%20" where they expected a space in a file's name on their local hard disk; this can happen when a browser saves a file from the web and doesn't encode the name, as passed from the server, properly.

URL encoding itself is underspecified with regard to encodings beyond ASCII. The relevant document, RFC 1738, explicitly doesn't give any information on how to encode non-ASCII characters, and the "obvious" way--use the %xx values for the octets of the eight bit MIME character set in which the page was served--breaks when a user types a character outside that character set. Best practice for web development is to serve all pages as UTF-8 and treat incoming form data as using that coding system. (Oh, and gamble that your clients won't ever want to type anything outside Unicode. But that's not so much of a gamble with today's client operating systems.) We don't treat non-ASCII in this example, as dealing with `(read-multibyte-character ...)' and errors therewith would make it much harder to understand.

Since CCL isn't a very rich language, we move much of the logic that would ordinarily be computed from operations like (member ..), (and ...) and (or ...) into tables, from which register values are read and written, and on which if statements are predicated. Much more of the implementation of this coding system is occupied with constructing these tables--in normal Emacs Lisp--than it is with actual CCL code.

All the defvar statements we deal with in the next few sections are surrounded by a (eval-and-compile ...), which means that the logic which initializes these variables executes at compile time, and if XEmacs loads the compiled version of the file, these variables are initialized as constants.

63.7.5.1 Four bits to ASCII  Two tables used for getting hex digits from ASCII.
63.7.5.2 URI Encoding constants  Useful predefined characters.
63.7.5.3 Numeric to ASCII-hexadecimal conversion  Trivial in Lisp, not so in CCL.
63.7.5.4 Characters to be preserved  No transformation needed for these characters.
63.7.5.5 The program to decode to internal format  .
63.7.5.6 The program to encode from internal format  .
63.7.5.7 The actual coding system  .


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.7.5.1 Four bits to ASCII

The first defvar is for url-coding-high-order-nybble-as-ascii, a 256-entry table that maps from an octet's value to the ASCII encoding for the hex value of its most significant four bits. That might sound complex, but it isn't; for decimal 65, hex value `#x41', the entry in the table is the ASCII encoding of `4'. For decimal 122, ASCII `z', hex value #x7a, (elt url-coding-high-order-nybble-as-ascii #x7a) after this file is loaded gives the ASCII encoding of 7.

 
(defvar url-coding-high-order-nybble-as-ascii
  (let ((val (make-vector 256 0))
	(i 0))
    (while (< i (length val))
      (aset val i (char-to-int (aref (format "%02X" i) 0)))
      (setq i (1+ i)))
    val)
  "Table to find an ASCII version of an octet's most significant 4 bits.")

The next table, url-coding-low-order-nybble-as-ascii is almost the same thing, but this time it has a map for the hex encoding of the low-order four bits. So the sixty-fifth entry (offset `#x41') is the ASCII encoding of `1', the hundred-and-twenty-second (offset `#x7a') is the ASCII encoding of `A'.

 
(defvar url-coding-low-order-nybble-as-ascii 
  (let ((val (make-vector 256 0))
	(i 0))
    (while (< i (length val))
      (aset val i (char-to-int (aref (format "%02X" i) 1)))
      (setq i (1+ i)))
    val)
  "Table to find an ASCII version of an octet's least significant 4 bits.")


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.7.5.2 URI Encoding constants

Next, we have a couple of variables that make the CCL code more readable. The first is the ASCII encoding of the percentage sign; this character is used as an escape code, to start the encoding of a non-printable character. For historical reasons, URL encoding allows the space character to be encoded as a plus sign--it does make typing URLs like `http://google.com/search?q=XEmacs+home+page' easier--and as such, we have to check when decoding for this value, and map it to the space character. When doing this in CCL, we use the url-coding-escaped-space-code variable.

 
(defvar url-coding-escape-character-code (char-to-int ?%)
  "The code point for the percentage sign, in ASCII.")

(defvar url-coding-escaped-space-code (char-to-int ?+)
  "The URL-encoded value of the space character, that is, +.")


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.7.5.3 Numeric to ASCII-hexadecimal conversion

Now, we have a couple of utility tables that wouldn't be necessary in a more expressive programming language than is CCL. The first is sixteen in length, and maps a hexadecimal number to the ASCII encoding of that number; so zero maps to ASCII `0', ten maps to ASCII `A.' The second does the reverse; that is, it maps an ASCII character to its value when interpreted as a hexadecimal digit. ('A' => 10, 'c' => 12, '2' => 2, as a few examples.)

 
(defvar url-coding-hex-digit-table 
  (let ((i 0)
	(val (make-vector 16 0)))
    (while (< i 16)
      (aset val i (char-to-int (aref (format "%X" i) 0)))
      (setq i (1+ i)))
    val)
  "A map from a hexadecimal digit's numeric value to its encoding in ASCII.")

(defvar url-coding-latin-1-as-hex-table
  (let ((val (make-vector 256 0))
	(i 0))
    (while (< i (length val))
      ;; Get a hex val for this ASCII character.
      (aset val i (string-to-int (format "%c" i) 16))
      (setq i (1+ i)))
    val)
  "A map from Latin 1 code points to their values as hexadecimal digits.")


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.7.5.4 Characters to be preserved

And finally, the last of these tables. URL encoding says that alphanumeric characters, the underscore, hyphen and the full stop (6) retain their ASCII encoding, and don't undergo transformation. url-coding-should-preserve-table is an array in which the entries are one if the corresponding ASCII character should be left as-is, and zero if they should be transformed. So the entries for all the control and most of the punctuation charcters are zero. Lisp programmers will observe that this initialization is particularly inefficient, but they'll also be aware that this is a long way from an inner loop where every nanosecond counts.

 
(defvar url-coding-should-preserve-table 
  (let ((preserve 
	 (list ?- ?_ ?. ?a ?b ?c ?d ?e ?f ?g ?h ?i ?j ?k ?l ?m ?n ?o 
	       ?p ?q ?r ?s ?t ?u ?v ?w ?x ?y ?z ?A ?B ?C ?D ?E ?F ?G
	       ?H ?I ?J ?K ?L ?M ?N ?O ?P ?Q ?R ?S ?T ?U ?V ?W ?X ?Y
	       ?Z ?0 ?1 ?2 ?3 ?4 ?5 ?6 ?7 ?8 ?9))
	(i 0)
	(res (make-vector 256 0)))
    (while (< i 256)
      (when (member (int-char i) preserve)
	(aset res i 1))
      (setq i (1+ i)))
    res)
  "A 256-entry array of flags, indicating whether or not to preserve an
octet as its ASCII encoding.")


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.7.5.5 The program to decode to internal format

After the almost interminable tables, we get to the CCL. The first CCL program, ccl-decode-urlcoding decodes from the URL coding to our internal format; since this version of CCL doesn't have support for error checking on the input, we don't do any verification on it.

The buffer magnification--approximate ratio of the size of the output buffer to the size of the input buffer--is declared as one, because fractional values aren't allowed. (Since all those %20's will map to ` ', the length of the output text will be less than that of the input text.)

So, first we read an octet from the input buffer into register `r0', to set up the loop. Next, we start the loop, with a (loop ...) statement, and we check if the value in `r0' is a percentage sign. (Note the comma before url-coding-escape-character-code; since CCL is a Lisp macro language, we can break out of the macro evaluation with a comman, and as such, ",url-coding-escape-character-code" will be evaluated as a literal `37.')

If it is a percentage sign, we read the next two octets into `r2' and `r3', and convert them into their hexadecimal numeric values, using the url-coding-latin-1-as-hex-table array declared above. (But again, it'll be interpreted as a literal array.) We then left shift the first by four bits, mask the two together, and write the result to the output buffer.

If it isn't a percentage sign, and it is a `+' sign, we write a space--hexadecimal 20--to the output buffer.

If none of those things are true, we pass the octet to the output buffer untransformed. (This could be a place to put error checking, in a more expressive language.) We then read one more octet from the input buffer, and move to the next iteration of the loop.

 
(define-ccl-program ccl-decode-urlcoding
  `(1	
    ((read r0)
     (loop
       (if (r0 == ,url-coding-escape-character-code)
	   ((read r2 r3)
	    ;; Assign the value at offset r2 in the url-coding-hex-digit-table
	    ;; to r3.
	    (r2 = r2 ,url-coding-latin-1-as-hex-table)
	    (r3 = r3 ,url-coding-latin-1-as-hex-table)
	    (r2 <<= 4)
	    (r3 |= r2)
	    (write r3))
	 (if (r0 == ,url-coding-escaped-space-code)
	     (write #x20)
	   (write r0)))
       (read r0)
       (repeat))))
  "CCL program to take URI-encoded ASCII text and transform it to our
internal encoding. ")


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.7.5.6 The program to encode from internal format

Next, we see the CCL program to encode ASCII text as URL coded text. Here, the buffer magnification is specified as three, to account for ` ' mapping to %20, etc. As before, we read an octet from the input into `r0', and move into the body of the loop. Next, we check if we should preserve the value of this octet, by reading from offset `r0' in the url-coding-should-preserve-table into `r1'. Then we have an `if' statement predicated on the value in `r1'; for the true branch, we write the input octet directly. For the false branch, we write a percentage sign, the ASCII encoding of the high four bits in hex, and then the ASCII encoding of the low four bits in hex.

We then read an octet from the input into `r0', and repeat the loop.

 
(define-ccl-program ccl-encode-urlcoding
  `(3
    ((read r0)
     (loop
       (r1 = r0 ,url-coding-should-preserve-table)
       ;; If we should preserve the value, just write the octet directly.
       (if r1
	   (write r0)
	 ;; else, write a percentage sign, and the hex value of the octet, in
	 ;; an ASCII-friendly format.
	 ((write ,url-coding-escape-character-code)
	  (write r0 ,url-coding-high-order-nybble-as-ascii)
	  (write r0 ,url-coding-low-order-nybble-as-ascii)))
       (read r0)
       (repeat))))
  "CCL program to encode octets (almost) according to RFC 1738")


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.7.5.7 The actual coding system

To actually create the coding system, we call `make-coding-system'. The first argument is the symbol that is to be the name of the coding system, in our case `url-coding'. The second specifies that the coding system is to be of type `ccl'---there are several other coding system types available, including, see the documentation for `make-coding-system' for the full list. Then there's a documentation string describing the wherefore and caveats of the coding system, and the final argument is a property list giving information about the CCL programs and the coding system's mnemonic.

 
(make-coding-system 
 'url-coding 'ccl 
 "The coding used by application/x-www-form-urlencoded HTTP applications.
This coding form doesn't specify anything about non-ASCII characters, so
make sure you've transformed to a seven-bit coding system first."
 '(decode ccl-decode-urlcoding
   encode ccl-encode-urlcoding
   mnemonic "URLenc"))

If you're lucky, the `url-coding' coding system describe here should be available in the XEmacs package system. Otherwise, downloading it from `http://www.parhasard.net/url-coding.el' should work for the foreseeable future.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.8 Category Tables

A category table is a type of char table used for keeping track of categories. Categories are used for classifying characters for use in regexps--you can refer to a category rather than having to use a complicated [] expression (and category lookups are significantly faster).

There are 95 different categories available, one for each printable character (including space) in the ASCII charset. Each category is designated by one such character, called a category designator. They are specified in a regexp using the syntax `\cX', where X is a category designator. (This is not yet implemented.)

A category table specifies, for each character, the categories that the character is in. Note that a character can be in more than one category. More specifically, a category table maps from a character to either the value nil (meaning the character is in no categories) or a 95-element bit vector, specifying for each of the 95 categories whether the character is in that category.

Special Lisp functions are provided that abstract this, so you do not have to directly manipulate bit vectors.

Function: category-table-p object
This function returns t if object is a category table.

Function: category-table &optional buffer
This function returns the current category table. This is the one specified by the current buffer, or by buffer if it is non-nil.

Function: standard-category-table
This function returns the standard category table. This is the one used for new buffers.

Function: copy-category-table &optional category-table
This function returns a new category table which is a copy of category-table, which defaults to the standard category table.

Function: set-category-table category-table &optional buffer
This function selects category-table as the new category table for buffer. buffer defaults to the current buffer if omitted.

Function: category-designator-p object
This function returns t if object is a category designator (a char in the range `' '' to `'~'').

Function: category-table-value-p object
This function returns t if object is a category table value. Valid values are nil or a bit vector of size 95.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.9 Unicode Support

Unicode support was added by Ben Wing to XEmacs 21.5.6.

Function: set-language-unicode-precedence-list list
Set the language-specific precedence list used for Unicode decoding. This is a list of charsets, which are consulted in order for a translation matching a given Unicode character. If no matches are found, the charsets in the default precedence list (see set-default-unicode-precedence-list) are consulted, and then all remaining charsets, in some arbitrary order.

The language-specific precedence list is meant to be set as part of the language environment initialization; the default precedence list is meant to be set by the user.

Function: language-unicode-precedence-list
Return the language-specific precedence list used for Unicode decoding. See set-language-unicode-precedence-list for more information.

Function: set-default-unicode-precedence-list list
Set the default precedence list used for Unicode decoding. This is meant to be set by the user. See `set-language-unicode-precedence-list' for more information.

Function: default-unicode-precedence-list
Return the default precedence list used for Unicode decoding. See set-language-unicode-precedence-list for more information.

Function: set-unicode-conversion character code
Add conversion information between Unicode codepoints and characters. character is one of the following:

-- A character (in which case code must be a non-negative integer) -- A vector of characters (in which case code must be a vector of non-negative integers of the same length)

Values of code above 2^20 - 1 are allowed for the purpose of specifying private characters, but will cause errors when converted to UTF-16 or UTF-32. UCS-4 and UTF-8 can handle values to 2^31 - 1, but XEmacs Lisp integers top out at 2^30 - 1.

Function: character-to-unicode character
Convert character to Unicode codepoint. When there is no international support (i.e. MULE is not defined), this function simply does char-to-int.

Function: unicode-to-character code [charsets]
Convert Unicode codepoint code to character. code should be a non-negative integer. If charsets is given, it should be a list of charsets, and only those charsets will be consulted, in the given order, for a translation. Otherwise, the default ordering of all charsets will be given (see set-unicode-charset-precedence).

When there is no international support (i.e. MULE is not defined), this function simply does int-to-char and ignores the charsets argument.

Function: parse-unicode-translation-table filename charset start end offset flags
Parse Unicode translation data in filename for MULE charset. Data is text, in the form of one translation per line -- charset codepoint followed by Unicode codepoint. Numbers are decimal or hex \(preceded by 0x). Comments are marked with a #. Charset codepoints for two-dimensional charsets should have the first octet stored in the high 8 bits of the hex number and the second in the low 8 bits.

If start and end are given, only charset codepoints within the given range will be processed. If offset is given, that value will be added to all charset codepoints in the file to obtain the internal charset codepoint. start and end apply to the codepoints in the file, before offset is applied.

(Note that, as usual, we assume that octets are in the range 32 to 127 or 33 to 126. If you have a table in kuten form, with octets in the range 1 to 94, you will have to use an offset of 5140, i.e. 0x2020.)

flags, if specified, control further how the tables are interpreted and are used to special-case certain known table weirdnesses in the Unicode tables:

ignore-first-column'
Exactly as it sounds. The JIS X 0208 tables have 3 columns of data instead of 2; the first is the Shift-JIS codepoint.

big5
The charset codepoint is a Big Five codepoint; convert it to the proper hacked-up codepoint in `chinese-big5-1' or `chinese-big5-2'.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.10 Character Set Unification

Mule suffers from a design defect that causes it to consider the ISO Latin character sets to be disjoint. This results in oddities such as files containing both ISO 8859/1 and ISO 8859/15 codes, and using ISO 2022 control sequences to switch between them, as well as more plausible but often unnecessary combinations like ISO 8859/1 with ISO 8859/2. This can be very annoying when sending messages or even in simple editing on a single host. Unification works around the problem by converting as many characters as possible to use a single Latin coded character set before saving the buffer.

This node and its children were ripp'd untimely from `latin-unity.texi', and have been quickly converted for use here. However as APIs are likely to diverge, beware of inaccuracies. Please report any you discover with M-x report-xemacs-bug RET, as well as any ambiguities or downright unintelligible passages.

A lot of the stuff here doesn't belong here; it belongs in the section `Top' in XEmacs User's Manual. Report those as bugs, too, preferably with patches.

63.10.1 An Overview of Unification  Unification history and general information.
63.10.2 Operation of Unification  An overview of the operation of Unification.
63.12.1 Configuring Unification for Use  Configuring Unification for use.
63.12.2 Theory of Operation  How Unification works.
63.12.3 What Unification Cannot Do for You  Inherent problems of 8-bit charsets.
63.12.5 Charsets and Coding Systems  Reference lists with annotations.
63.12.4 Internals  Utilities and implementation details.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.10.1 An Overview of Unification

Mule suffers from a design defect that causes it to consider the ISO Latin character sets to be disjoint. This manifests itself when a user enters characters using input methods associated with different coded character sets into a single buffer.

A very important example involves email. Many sites, especially in the U.S., default to use of the ISO 8859/1 coded character set (also called "Latin 1," though these are somewhat different concepts). However, ISO 8859/1 provides a generic CURRENCY SIGN character. Now that the Euro has become the official currency of most countries in Europe, this is unsatisfactory (and in practice, useless). So Europeans generally use ISO 8859/15, which is nearly identical to ISO 8859/1 for most languages, except that it substitutes EURO SIGN for CURRENCY SIGN.

Suppose a European user yanks text from a post encoded in ISO 8859/1 into a message composition buffer, and enters some text including the Euro sign. Then Mule will consider the buffer to contain both ISO 8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively programmed) send the message as a multipart mixed MIME body!

This is clearly stupid. What is not as obvious is that, just as any European can include American English in their text because ASCII is a subset of ISO 8859/15, most European languages which use Latin characters (eg, German and Polish) can typically be mixed while using only one Latin coded character set (in this case, ISO 8859/2). However, this often depends on exactly what text is to be encoded.

Unification works around the problem by converting as many characters as possible to use a single Latin coded character set before saving the buffer.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.10.2 Operation of Unification

Normally, Unification works in the background by installing unity-sanity-check on write-region-pre-hook. This is done by default for the ISO 8859 Latin family of character sets. The user activates this functionality for other character set families by invoking enable-unification, either interactively or in her init file. See section 57.1.2 The Init File: `.emacs'. Unification can be deactivated by invoking disable-unification.

Unification also provides a few functions for remapping or recoding the buffer by hand. To remap a character means to change the buffer representation of the character by using another coded character set. Remapping never changes the identity of the character, but may involve altering the code point of the character. To recode a character means to simply change the coded character set. Recoding never alters the code point of the character, but may change the identity of the character. See section 63.12.2 Theory of Operation.

There are a few variables which determine which coding systems are always acceptable to Unification: unity-ucs-list, unity-preferred-coding-system-list, and unity-preapproved-coding-system-list. The latter two default to (), and should probably be avoided because they short-circuit the sanity check. If you find you need to use them, consider reporting it as a bug or request for enhancement. Because they seem unsafe, the recommended interface is likely to change.

63.11 Basic Functionality  User interface and customization.
63.12 Interactive Usage  Treating text by hand. Also documents the hook function(s).


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.11 Basic Functionality

These functions and user options initialize and configure Unification. In normal use, none of these should be needed.

These APIs are certain to change.

Function: enable-unification
Set up hooks and initialize variables for latin-unity.

There are no arguments.

This function is idempotent. It will reinitialize any hooks or variables that are not in initial state.

Function: disable-unification
There are no arguments.

Clean up hooks and void variables used by latin-unity.

User Option: unity-ucs-list
List of coding systems considered to be universal.

The default value is '(utf-8 iso-2022-7 ctext escape-quoted).

Order matters; coding systems earlier in the list will be preferred when recommending a coding system. These coding systems will not be used without querying the user (unless they are also present in unity-preapproved-coding-system-list), and follow the unity-preferred-coding-system-list in the list of suggested coding systems.

If none of the preferred coding systems are feasible, the first in this list will be the default.

Notes on certain coding systems: escape-quoted is a special coding system used for autosaves and compiled Lisp in Mule. You should never delete this, although it is rare that a user would want to use it directly. Unification does not try to be \"smart\" about other general ISO 2022 coding systems, such as ISO-2022-JP. (They are not recognized as equivalent to iso-2022-7.) If your preferred coding system is one of these, you may consider adding it to unity-ucs-list. However, this will typically have the side effect that (eg) ISO 8859/1 files will be saved in 7-bit form with ISO 2022 escape sequences.

Coding systems which are not Latin and not in unity-ucs-list are handled by short circuiting checks of coding system against the next two variables.

User Option: unity-preapproved-coding-system-list
List of coding systems used without querying the user if feasible.

The default value is `(buffer-default preferred)'.

The first feasible coding system in this list is used. The special values `preferred' and `buffer-default' may be present:

buffer-default
Use the coding system used by `write-region', if feasible.

preferred
Use the coding system specified by `prefer-coding-system' if feasible.

"Feasible" means that all characters in the buffer can be represented by the coding system. Coding systems in `unity-ucs-list' are always considered feasible. Other feasible coding systems are computed by `unity-representations-feasible-region'.

Note that the first universal coding system in this list shadows all other coding systems. In particular, if your preferred coding system is a universal coding system, and preferred is a member of this list, unification will blithely convert all your files to that coding system. This is considered a feature, but it may surprise most users. Users who don't like this behavior should put preferred in unity-preferred-coding-system-list.

User Option: unity-preferred-coding-system-list
List of coding systems suggested to the user if feasible.

The default value is `(iso-8859-1 iso-8859-15 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-9)'.

If none of the coding systems in unity-preapproved-coding-system-list are feasible, this list will be recommended to the user, followed by the unity-ucs-list. The first coding system in this list is default. The special values `preferred' and `buffer-default' may be present:

buffer-default
Use the coding system used by `write-region', if feasible.

preferred
Use the coding system specified by `prefer-coding-system' if feasible.

"Feasible" means that all characters in the buffer can be represented by the coding system. Coding systems in `unity-ucs-list' are always considered feasible. Other feasible coding systems are computed by `unity-representations-feasible-region'.

Variable: unity-iso-8859-1-aliases
List of coding systems to be treated as aliases of ISO 8859/1.

The default value is '(iso-8859-1).

This is not a user variable; to customize input of coding systems or charsets, `unity-coding-system-alias-alist' or `unity-charset-alias-alist'.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.12 Interactive Usage

First, the hook function unity-sanity-check is documented. (It is placed here because it is not an interactive function, and there is not yet a programmer's section of the manual.)

These functions provide access to internal functionality (such as the remapping function) and to extra functionality (the recoding functions and the test function).

Function: unity-sanity-check begin end filename append visit lockname &optional coding-system

Check if coding-system can represent all characters between begin and end.

For compatibility with old broken versions of write-region, coding-system defaults to buffer-file-coding-system. filename, append, visit, and lockname are ignored.

Return nil if buffer-file-coding-system is not (ISO-2022-compatible) Latin. If buffer-file-coding-system is safe for the charsets actually present in the buffer, return it. Otherwise, ask the user to choose a coding system, and return that.

This function does not do the safe thing when buffer-file-coding-system is nil (aka no-conversion). It considers that "non-Latin," and passes it on to the Mule detection mechanism.

This function is intended for use as a write-region-pre-hook. It does nothing except return coding-system if write-region handlers are inhibited.

Function: unity-buffer-representations-feasible

There are no arguments.

Apply unity-region-representations-feasible to the current buffer.

Function: unity-region-representations-feasible begin end &optional buf

Return character sets that can represent the text from begin to end in buf.

buf defaults to the current buffer. Called interactively, will be applied to the region. Function assumes begin <= end.

The return value is a cons. The car is the list of character sets that can individually represent all of the non-ASCII portion of the buffer, and the cdr is the list of character sets that can individually represent all of the ASCII portion.

The following is taken from a comment in the source. Please refer to the source to be sure of an accurate description.

The basic algorithm is to map over the region, compute the set of charsets that can represent each character (the "feasible charset"), and take the intersection of those sets.

The current implementation takes advantage of the fact that ASCII characters are common and cannot change asciisets. Then using skip-chars-forward makes motion over ASCII subregions very fast.

This same strategy could be applied generally by precomputing classes of characters equivalent according to their effect on latinsets, and adding a whole class to the skip-chars-forward string once a member is found.

Probably efficiency is a function of the number of characters matched, or maybe the length of the match string? With skip-category-forward over a precomputed category table it should be really fast. In practice for Latin character sets there are only 29 classes.

Function: unity-remap-region begin end character-set &optional coding-system

Remap characters between begin and end to equivalents in character-set. Optional argument coding-system may be a coding system name (a symbol) or nil. Characters with no equivalent are left as-is.

When called interactively, begin and end are set to the beginning and end, respectively, of the active region, and the function prompts for character-set. The function does completion, knows how to guess a character set name from a coding system name, and also provides some common aliases. See unity-guess-charset. There is no way to specify coding-system, as it has no useful function interactively.

Return coding-system if coding-system can encode all characters in the region, t if coding-system is nil and the coding system with G0 = 'ascii and G1 = character-set can encode all characters, and otherwise nil. Note that a non-null return does not mean it is safe to write the file, only the specified region. (This behavior is useful for multipart MIME encoding and the like.)

Note: by default this function is quite fascist about universal coding systems. It only admits `utf-8', `iso-2022-7', and `ctext'. Customize unity-approved-ucs-list to change this.

This function remaps characters that are artificially distinguished by Mule internal code. It may change the code point as well as the character set. To recode characters that were decoded in the wrong coding system, use unity-recode-region.

Function: unity-recode-region begin end wrong-cs right-cs

Recode characters between begin and end from wrong-cs to right-cs.

wrong-cs and right-cs are character sets. Characters retain the same code point but the character set is changed. Only characters from wrong-cs are changed to right-cs. The identity of the character may change. Note that this could be dangerous, if characters whose identities you do not want changed are included in the region. This function cannot guess which characters you want changed, and which should be left alone.

When called interactively, begin and end are set to the beginning and end, respectively, of the active region, and the function prompts for wrong-cs and right-cs. The function does completion, knows how to guess a character set name from a coding system name, and also provides some common aliases. See unity-guess-charset.

Another way to accomplish this, but using coding systems rather than character sets to specify the desired recoding, is `unity-recode-coding-region'. That function may be faster but is somewhat more dangerous, because it may recode more than one character set.

To change from one Mule representation to another without changing identity of any characters, use `unity-remap-region'.

Function: unity-recode-coding-region begin end wrong-cs right-cs

Recode text between begin and end from wrong-cs to right-cs.

wrong-cs and right-cs are coding systems. Characters retain the same code point but the character set is changed. The identity of characters may change. This is an inherently dangerous function; multilingual text may be recoded in unexpected ways. #### It's also dangerous because the coding systems are not sanity-checked in the current implementation.

When called interactively, begin and end are set to the beginning and end, respectively, of the active region, and the function prompts for wrong-cs and right-cs. The function does completion, knows how to guess a coding system name from a character set name, and also provides some common aliases. See unity-guess-coding-system.

Another, safer, way to accomplish this, using character sets rather than coding systems to specify the desired recoding, is to use unity-recode-region.

To change from one Mule representation to another without changing identity of any characters, use unity-remap-region.

Helper functions for input of coding system and character set names.

Function: unity-guess-charset candidate
Guess a charset based on the symbol candidate.

candidate itself is not tried as the value.

Uses the natural mapping in `unity-cset-codesys-alist', and the values in `unity-charset-alias-alist'."

Function: unity-guess-coding-system candidate
Guess a coding system based on the symbol candidate.

candidate itself is not tried as the value.

Uses the natural mapping in `unity-cset-codesys-alist', and the values in `unity-coding-system-alias-alist'."

Function: unity-example

A cheesy example for Unification.

At present it just makes a multilingual buffer. To test, setq buffer-file-coding-system to some value, make the buffer dirty (eg with RET BackSpace), and save.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.12.1 Configuring Unification for Use

If you want Unification to be automatically initialized, invoke `enable-unification' with no arguments in your init file. See section 57.1.2 The Init File: `.emacs'. If you are using GNU Emacs or an XEmacs earlier than 21.1, you should also load `auto-autoloads' using the full path (never `require' `auto-autoloads' libraries).

You may wish to define aliases for commonly used character sets and coding systems for convenience in input.

User Option: unity-charset-alias-alist
Alist mapping aliases to Mule charset names (symbols)."

The default value is

 
   ((latin-1 . latin-iso8859-1)
    (latin-2 . latin-iso8859-2)
    (latin-3 . latin-iso8859-3)
    (latin-4 . latin-iso8859-4)
    (latin-5 . latin-iso8859-9)
    (latin-9 . latin-iso8859-15)
    (latin-10 . latin-iso8859-16))

If a charset does not exist on your system, it will not complete and you will not be able to enter it in response to prompts. A real charset with the same name as an alias in this list will shadow the alias.

User Option: unity-coding-system-alias-alist nil
Alist mapping aliases to Mule coding system names (symbols).

The default value is `nil'.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.12.2 Theory of Operation

Standard encodings suffer from the design defect that they do not provide a reliable way to recognize which coded character sets in use. See section 63.12.3 What Unification Cannot Do for You. There are scores of character sets which can be represented by a single octet (8-bit byte), whose union contains many hundreds of characters. Obviously this results in great confusion, since you can't tell the players without a scorecard, and there is no scorecard.

There are two ways to solve this problem. The first is to create a universal coded character set. This is the concept behind Unicode. However, there have been satisfactory (nearly) universal character sets for several decades, but even today many Westerners resist using Unicode because they consider its space requirements excessive. On the other hand, Asians dislike Unicode because they consider it to be incomplete. (This is partly, but not entirely, political.)

In any case, Unicode only solves the internal representation problem. Many data sets will contain files in "legacy" encodings, and Unicode does not help distinguish among them.

The second approach is to embed information about the encodings used in a document in its text. This approach is taken by the ISO 2022 standard. This would solve the problem completely from the users' of view, except that ISO 2022 is basically not implemented at all, in the sense that few applications or systems implement more than a small subset of ISO 2022 functionality. This is due to the fact that mono-literate users object to the presence of escape sequences in their texts (which they, with some justification, consider data corruption). Programmers are more than willing to cater to these users, since implementing ISO 2022 is a painstaking task.

In fact, Emacs/Mule adopts both of these approaches. Internally it uses a universal character set, Mule code. Externally it uses ISO 2022 techniques both to save files in forms robust to encoding issues, and as hints when attempting to "guess" an unknown encoding. However, Mule suffers from a design defect, namely it embeds the character set information that ISO 2022 attaches to runs of characters by introducing them with a control sequence in each character. That causes Mule to consider the ISO Latin character sets to be disjoint. This manifests itself when a user enters characters using input methods associated with different coded character sets into a single buffer.

There are two problems stemming from this design. First, Mule represents the same character in different ways. Abstractly, '' (LATIN SMALL LETTER O WITH ACUTE) can get represented as [latin-iso8859-1 #x73] or as [latin-iso8859-2 #x73]. So what looks like '' in the display might actually be represented [latin-iso8859-1 #x73][latin-iso8859-2 #x73] in the buffer, and saved as [#xF3 ESC - B #xF3 ESC - A] in the file. In some cases this treatment would be appropriate (consider HYPHEN, MINUS SIGN, EN DASH, EM DASH, and U+4E00 (the CJK ideographic character meaning "one")), and although arguably incorrect it is convenient when mixing the CJK scripts. But in the case of the Latin scripts this is wrong.

Worse yet, it is very likely to occur when mixing "different" encodings (such as ISO 8859/1 and ISO 8859/15) that differ only in a few code points that are almost never used. A very important example involves email. Many sites, especially in the U.S., default to use of the ISO 8859/1 coded character set (also called "Latin 1," though these are somewhat different concepts). However, ISO 8859/1 provides a generic CURRENCY SIGN character. Now that the Euro has become the official currency of most countries in Europe, this is unsatisfactory (and in practice, useless). So Europeans generally use ISO 8859/15, which is nearly identical to ISO 8859/1 for most languages, except that it substitutes EURO SIGN for CURRENCY SIGN.

Suppose a European user yanks text from a post encoded in ISO 8859/1 into a message composition buffer, and enters some text including the Euro sign. Then Mule will consider the buffer to contain both ISO 8859/1 and ISO 8859/15 text, and MUAs such as Gnus will (if naively programmed) send the message as a multipart mixed MIME body!

This is clearly stupid. What is not as obvious is that, just as any European can include American English in their text because ASCII is a subset of ISO 8859/15, most European languages which use Latin characters (eg, German and Polish) can typically be mixed while using only one Latin coded character set (in the case of German and Polish, ISO 8859/2). However, this often depends on exactly what text is to be encoded (even for the same pair of languages).

Unification works around the problem by converting as many characters as possible to use a single Latin coded character set before saving the buffer.

Because the problem is rarely noticeable in editing a buffer, but tends to manifest when that buffer is exported to a file or process, the Unification package uses the strategy of examining the buffer prior to export. If use of multiple Latin coded character sets is detected, Unification attempts to unify them by finding a single coded character set which contains all of the Latin characters in the buffer.

The primary purpose of Unification is to fix the problem by giving the user the choice to change the representation of all characters to one character set and give sensible recommendations based on context. In the '' example, either ISO 8859/1 or ISO 8859/2 is satisfactory, and both will be suggested. In the EURO SIGN example, only ISO 8859/15 makes sense, and that is what will be recommended. In both cases, the user will be reminded that there are universal encodings available.

I call this remapping (from the universal character set to a particular ISO 8859 coded character set). It is mere accident that this letter has the same code point in both character sets. (Not entirely, but there are many examples of Latin characters that have different code points in different Latin-X sets.)

Note that, in the '' example, that treating the buffer in this way will result in a representation such as [latin-iso8859-2 #x73][latin-iso8859-2 #x73], and the file will be saved as [#xF3 #xF3]. This is guaranteed to occasionally result in the second problem you observed, to which we now turn.

This problem is that, although the file is intended to be an ISO-8859/2-encoded file, in an ISO 8859/1 locale Mule (and every POSIX compliant program--this is required by the standard, obvious if you think a bit, see section 63.12.3 What Unification Cannot Do for You) will read that file as [latin-iso8859-1 #x73] [latin-iso8859-1 #x73]. Of course this is no problem if all of the characters in the file are contained in ISO 8859/1, but suppose there are some which are not, but are contained in the (intended) ISO 8859/2.

You now want to fix this, but not by finding the same character in another set. Instead, you want to simply change the character set that Mule associates with that buffer position without changing the code. (This is conceptually somewhat distinct from the first problem, and logically ought to be handled in the code that defines coding systems. However, unification is not an unreasonable place for it.) Unification provides two functions (one fast and dangerous, the other slow and careful) to handle this. I call this recoding, because the transformation actually involves encoding the buffer to file representation, then decoding it to buffer representation (in a different character set). This cannot be done automatically because Mule can have no idea what the correct encoding is--after all, it already gave you its best guess. See section 63.12.3 What Unification Cannot Do for You. So these functions must be invoked by the user. See section 63.12 Interactive Usage.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.12.3 What Unification Cannot Do for You

Unification cannot save you if you insist on exporting data in 8-bit encodings in a multilingual environment. You will eventually corrupt data if you do this. It is not Mule's, or any application's, fault. You will have only yourself to blame; consider yourself warned. (It is true that Mule has bugs, which make Mule somewhat more dangerous and inconvenient than some naive applications. We're working to address those, but no application can remedy the inherent defect of 8-bit encodings.)

Use standard universal encodings, preferably Unicode (UTF-8) unless applicable standards indicate otherwise. The most important such case is Internet messages, where MIME should be used, whether or not the subordinate encoding is a universal encoding. (Note that since one of the important provisions of MIME is the `Content-Type' header, which has the charset parameter, MIME is to be considered a universal encoding for the purposes of this manual. Of course, technically speaking it's neither a coded character set nor a coding extension technique compliant with ISO 2022.)

As mentioned earlier, the problem is that standard encodings suffer from the design defect that they do not provide a reliable way to recognize which coded character sets are in use. There are scores of character sets which can be represented by a single octet (8-bit byte), whose union contains many hundreds of characters. Thus any 8-bit coded character set must contain characters that share code points used for different characters in other coded character sets.

This means that a given file's intended encoding cannot be identified with 100% reliability unless it contains encoding markers such as those provided by MIME or ISO 2022.

Unification actually makes it more likely that you will have problems of this kind. Traditionally Mule has been "helpful" by simply using an ISO 2022 universal coding system when the current buffer coding system cannot handle all the characters in the buffer. This has the effect that, because the file contains control sequences, it is not recognized as being in the locale's normal 8-bit encoding. It may be annoying if you are not a Mule expert, but your data is automatically recoverable with a tool you already have: Mule.

However, with unification, Mule converts to a single 8-bit character set when possible. But typically this will not be in your usual locale. Ie, the times that an ISO 8859/1 user will need Unification is when there are ISO 8859/2 characters in the buffer. But then most likely the file will be saved in a pure 8-bit encoding that is not ISO 8859/1, ie, ISO 8859/2. Mule's autorecognizer (which is probably the most sophisticated yet available) cannot tell the difference between ISO 8859/1 and ISO 8859/2, and in a Western European locale will choose the former even though the latter was intended. Even the extension ("statistical recognition") planned for XEmacs 22 is unlikely to be at all accurate in the case of mixed codes.

So now consider adding some additional ISO 8859/1 text to the buffer. If it includes any ISO 8859/1 codes that are used by different characters in ISO 8859/2, you now have a file that cannot be mechanically disentangled. You need a human being who can recognize that this is German and Swedish and stays in Latin-1, while that is Polish and needs to be recoded to Latin-2.

Moral: switch to a universal coded character set, preferably Unicode using the UTF-8 transformation format. If you really need the space, compress your files.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.12.4 Internals

No internals documentation yet.

`unity-utils.el' provides one utility function.

Function: unity-dump-tables

Dump the temporary table created by loading `unity-utils.el' to `unity-tables.el'. Loading the latter file initializes `unity-equivalences'.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

63.12.5 Charsets and Coding Systems

This section provides reference lists of Mule charsets and coding systems. Mule charsets are typically named by character set and standard.

ASCII variants

Identification of equivalent characters in these sets is not properly implemented. Unification does not distinguish the two charsets.

`ascii' `latin-jisx0201'

Extended Latin

Characters from the following ISO 2022 conformant charsets are identified with equivalents in other charsets in the group by Unification.

`latin-iso8859-1' `latin-iso8859-15' `latin-iso8859-2' `latin-iso8859-3' `latin-iso8859-4' `latin-iso8859-9' `latin-iso8859-13' `latin-iso8859-16'

The follow charsets are Latin variants which are not understood by Unification. In addition, many of the Asian language standards provide ASCII, at least, and sometimes other Latin characters. None of these are identified with their ISO 8859 equivalents.

`vietnamese-viscii-lower' `vietnamese-viscii-upper'

Other character sets

`arabic-1-column' `arabic-2-column' `arabic-digit' `arabic-iso8859-6' `chinese-big5-1' `chinese-big5-2' `chinese-cns11643-1' `chinese-cns11643-2' `chinese-cns11643-3' `chinese-cns11643-4' `chinese-cns11643-5' `chinese-cns11643-6' `chinese-cns11643-7' `chinese-gb2312' `chinese-isoir165' `cyrillic-iso8859-5' `ethiopic' `greek-iso8859-7' `hebrew-iso8859-8' `ipa' `japanese-jisx0208' `japanese-jisx0208-1978' `japanese-jisx0212' `katakana-jisx0201' `korean-ksc5601' `sisheng' `thai-tis620' `thai-xtis'

Non-graphic charsets

`control-1'

No conversion

Some of these coding systems may specify EOL conventions. Note that `iso-8859-1' is a no-conversion coding system, not an ISO 2022 coding system. Although unification attempts to compensate for this, it is possible that the `iso-8859-1' coding system will behave differently from other ISO 8859 coding systems.

`binary' `no-conversion' `raw-text' `iso-8859-1'

Latin coding systems

These coding systems are all single-byte, 8-bit ISO 2022 coding systems, combining ASCII in the GL register (bytes with high-bit clear) and an extended Latin character set in the GR register (bytes with high-bit set).

`iso-8859-15' `iso-8859-2' `iso-8859-3' `iso-8859-4' `iso-8859-9' `iso-8859-13' `iso-8859-14' `iso-8859-16'

These coding systems are single-byte, 8-bit coding systems that do not conform to international standards. They should be avoided in all potentially multilingual contexts, including any text distributed over the Internet and World Wide Web.

`windows-1251'

Multilingual coding systems

The following ISO-2022-based coding systems are useful for multilingual text.

`ctext' `iso-2022-lock' `iso-2022-7' `iso-2022-7bit' `iso-2022-7bit-ss2' `iso-2022-8' `iso-2022-8bit-ss2'

XEmacs also supports Unicode with the Mule-UCS package. These are the preferred coding systems for multilingual use. (There is a possible exception for texts that mix several Asian ideographic character sets.)

`utf-16-be' `utf-16-be-no-signature' `utf-16-le' `utf-16-le-no-signature' `utf-7' `utf-7-safe' `utf-8' `utf-8-ws'

Development versions of XEmacs (the 21.5 series) support Unicode internally, with (at least) the following coding systems implemented:

`utf-16-be' `utf-16-be-bom' `utf-16-le' `utf-16-le-bom' `utf-8' `utf-8-bom'

Asian ideographic languages

The following coding systems are based on ISO 2022, and are more or less suitable for encoding multilingual texts. They all can represent ASCII at least, and sometimes several other foreign character sets, without resort to arbitrary ISO 2022 designations. However, these subsets are not identified with the corresponding national standards in XEmacs Mule.

`chinese-euc' `cn-big5' `cn-gb-2312' `gb2312' `hz' `hz-gb-2312' `old-jis' `japanese-euc' `junet' `euc-japan' `euc-jp' `iso-2022-jp' `iso-2022-jp-1978-irv' `iso-2022-jp-2' `euc-kr' `korean-euc' `iso-2022-kr' `iso-2022-int-1'

The following coding systems cannot be used for general multilingual text and do not cooperate well with other coding systems.

`big5' `shift_jis'

Other languages

The following coding systems are based on ISO 2022. Though none of them provides any Latin characters beyond ASCII, XEmacs Mule allows (and up to 21.4 defaults to) use of ISO 2022 control sequences to designate other character sets for inclusion the text.

`iso-8859-5' `iso-8859-7' `iso-8859-8' `ctext-hebrew'

The following are character sets that do not conform to ISO 2022 and thus cannot be safely used in a multilingual context.

`alternativnyj' `koi8-r' `tis-620' `viqr' `viscii' `vscii'

Special coding systems

Mule uses the following coding systems for special purposes.

`automatic-conversion' `undecided' `escape-quoted'

`escape-quoted' is especially important, as it is used internally as the coding system for autosaved data.

The following coding systems are aliases for others, and are used for communication with the host operating system.

`file-name' `keyboard' `terminal'

Mule detection of coding systems is actually limited to detection of classes of coding systems called coding categories. These coding categories are identified by the ISO 2022 control sequences they use, if any, by their conformance to ISO 2022 restrictions on code points that may be used, and by characteristic patterns of use of 8-bit code points.

`no-conversion' `utf-8' `ucs-4' `iso-7' `iso-lock-shift' `iso-8-1' `iso-8-2' `iso-8-designate' `shift-jis' `big5'


[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by XEmacs Webmaster on August, 3 2012 using texi2html