26. Multilingual Support

NOTE: There is a great deal of overlapping and redundant information in this chapter. Ben wrote introductions to Mule issues a number of times, each time not realizing that he had already written another introduction previously. Hopefully, in time these will all be integrated.

NOTE: The information at the top of the source file ‘text.c’ is more complete than the following, and there is also a list of all other places to look for text/I18N-related info. Also look in ‘text.h’ for info about the DFC and Eistring APIs.

Recall that there are two primary ways that text is represented in XEmacs. The buffer representation sees the text as a series of bytes (Ibytes), with a variable number of bytes used per character. The character representation sees the text as a series of integers (Ichars), one per character. The character representation is a cleaner representation from a theoretical standpoint, and is thus used in many cases when lots of manipulations on a string need to be done. However, the buffer representation is the standard representation used in both Lisp strings and buffers, and because of this, it is the “default” representation that text comes in. The reason for using this representation is that it’s compact and is compatible with ASCII.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.1 Introduction to Multilingual Issues #1

There is an introduction to these issues in the Lisp Reference manual. See (lispref)Internationalization Terminology section ‘Internationalization Terminology’ in XEmacs Lisp Reference Manual. Among other documentation that may be of interest to internals programmers is ISO-2022 (see (lispref)ISO 2022 section ‘ISO 2022’ in XEmacs Lisp Reference Manual) and CCL (see (lispref)CCL section ‘CCL’ in XEmacs Lisp Reference Manual)

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.2 Introduction to Multilingual Issues #2

Introduction

This document covers a number of design issues, problems and proposals with regards to XEmacs MULE. At first we present some definitions and some aspects of the design that have been agreed upon. Then we present some issues and problems that need to be addressed, and then I include a proposal of mine to address some of these issues. When there are other proposals, for example from Olivier, these will be appended to the end of this document.

Definitions and Design Basics

First, text is defined to be a series of characters which together defines an utterance or partial utterance in some language. Generally, this language is a human language, but it may also be a computer language if the computer language uses a representation close enough to that of human languages for it to also make sense to call its representation text. Text is opposed to binary, which is a sequence of bytes, representing machine-readable but not human-readable data. A byte is merely a number within a predefined range, which nowadays is nearly always zero to 255. A character is a unit of text. What makes one character different from another is not always clear-cut. It is generally related to the appearance of the character, although perhaps not any possible appearance of that character, but some sort of ideal appearance that is assigned to a character. Whether two characters that look very similar are actually the same depends on various factors such as political ones, such as whether the characters are used to mean similar sorts of things, or behave similarly in similar contexts. In any case, it is not always clearly defined whether two characters are actually the same or not. In practice, however, this is more or less agreed upon.

A character set is just that, a set of one or more characters. The set is unique in that there will not be more than one instance of the same character in a character set, and logically is unordered, although an order is often imposed or suggested for the characters in the character set. We can also define an order on a character set, which is a way of assigning a unique number, or possibly a pair of numbers, or a triplet of numbers, or even a set of four or more numbers to each character in the character set. The combination of an order in the character set results in an ordered character set. In an ordered character set, there is an upper limit and a lower limit on the possible values that a character, or that any number within the set of numbers assigned to a character, can take. However, the lower limit does not have to start at zero or one, or anywhere else in particular, nor does the upper limit have to end anywhere particular, and there may be gaps within these ranges such that particular numbers or sets of numbers do not have a corresponding character, even though they are within the upper and lower limits. For example, ASCII defines a very standard ordered character set. It is normally defined to be 94 characters in the range 33 through 126 inclusive on both ends, with every possible character within this range being actually present in the character set.

Sometimes the ASCII character set is extended to include what are called non-printing characters. Non-printing characters are characters which instead of really being displayed in a more or less rectangular block, like all other characters, instead indicate certain functions typically related to either control of the display upon which the characters are being displayed, or have some effect on a communications channel that may be currently open and transmitting characters, or may change the meaning of future characters as they are being decoded, or some other similar function. You might say that non-printing characters are somewhat of a hack because they are a special exception to the standard concept of a character as being a printed glyph that has some direct correspondence in the non-computer world.

With non-printing characters in mind, the 94-character ordered character set called ASCII is often extended into a 96-character ordered character set, also often called ASCII, which includes in addition to the 94 characters already mentioned, two non-printing characters, one called space and assigned the number 32, just below the bottom of the previous range, and another called delete or rubout, which is given number 127 just above the end of the previous range. Thus to reiterate, the result is a 96-character ordered character set, whose characters take the values from 32 to 127 inclusive. Sometimes ASCII is further extended to contain 32 more non-printing characters, which are given the numbers zero through 31 so that the result is a 128-character ordered character set with characters numbered zero through 127, and with many non-printing characters. Another way to look at this, and the way that is normally taken by XEmacs MULE, is that the characters that would be in the range 30 through 31 in the most extended definition of ASCII, instead form their own ordered character set, which is called control zero, and consists of 32 characters in the range zero through 31. A similar ordered character set called control one is also created, and it contains 32 more non-printing characters in the range 128 through 159. Note that none of these three ordered character sets overlaps in any of the numbers they are assigned to their characters, so they can all be used at once. Note further that the same character can occur in more than one character set. This was shown above, for example, in two different ordered character sets we defined, one of which we could have called ASCII, and the other ASCII-extended, to show that it had extended by two non-printable characters. Most of the characters in these two character sets are shared and present in both of them.

Note that there is no restriction on the size of the character set, or on the numbers that are assigned to characters in an ordered character set. It is often extremely useful to represent a sequence of characters as a sequence of bytes, where a byte as defined above is a number in the range zero to 255. An encoding does precisely this. It is simply a mapping from a sequence of characters, possibly augmented with information indicating the character set that each of these characters belongs to, to a sequence of bytes which represents that sequence of characters and no other – which is to say the mapping is reversible.

A coding system is a set of rules for encoding a sequence of characters augmented with character set information into a sequence of bytes, and later performing the reverse operation. It is frequently possible to group coding systems into classes or types based on common features. Typically, for example, a particular coding system class may contain a base coding system which specifies some of the rules, but leaves the rest unspecified. Individual members of the coding system class are formed by starting with the base coding system, and augmenting it with additional rules to produce a particular coding system, what you might think of as a sort of variation within a theme.

XEmacs Specific Definitions

First of all, in XEmacs, the concept of character is a little different from the general definition given above. For one thing, the character set that a character belongs to may or may not be an inherent part of the character itself. In other words, the same character occurring in two different character sets may appear in XEmacs as two different characters. This is generally the case now, but we are attempting to move in the other direction. Different proposals may have different ideas about exactly the extent to which this change will be carried out. The general trend, though, is to represent all information about a character other than the character itself, using text properties attached to the character. That way two instances of the same character will look the same to lisp code that merely retrieves the character, and does not also look at the text properties of that character. Everyone involved is in agreement in doing it this way with all Latin characters, and in fact for all characters other than Chinese, Japanese, and Korean ideographs. For those, there may be a difference of opinion.

A second difference between the general definition of character and the XEmacs usage of character is that each character is assigned a unique number that distinguishes it from all other characters in the world, or at the very least, from all other characters currently existing anywhere inside the current XEmacs invocation. (If there is a case where the weaker statement applies, but not the stronger statement, it would possibly be with composite characters and any other such characters that are created on the sly.)

This unique number is called the character representation of the character, and its particular details are a matter of debate. There is the current standard in use that it is undoubtedly going to change. What has definitely been agreed upon is that it will be an integer, more specifically a positive integer, represented with less than or equal to 31 bits on a 32-bit architecture, and possibly up to 63 bits on a 64-bit architecture, with the proviso that any characters that whose representation would fit in a 64-bit architecture, but not on a 32-bit architecture, would be used only for composite characters, and others that would satisfy the weak uniqueness property mentioned above, but not with the strong uniqueness property.

At this point, it is useful to talk about the different representations that a sequence of characters can take. The simplest representation is simply as a sequence of characters, and this is called the Lisp representation of text, because it is the representation that Lisp programs see. Other representations include the external representation, which refers to any encoding of the sequence of characters, using the definition of encoding mentioned above. Typically, text in the external representation is used outside of XEmacs, for example in files, e-mail messages, web sites, and the like. Another representation for a sequence of characters is what I will call the byte representation, and it represents the way that XEmacs internally represents text in a buffer, or in a string. Potentially, the representation could be different between a buffer and a string, and then the terms buffer byte representation and string byte representation would be used, but in practice I don’t think this will occur. It will be possible, of course, for buffers and strings, or particular buffers and particular strings, to contain different sub-representations of a single representation. For example, Olivier’s 1-2-4 proposal allows for three sub-representations of his internal byte representation, allowing for 1 byte, 2 bytes, and 4 byte width characters respectively. A particular string may be in one sub-representation, and a particular buffer in another sub-representation, but overall both are following the same byte representation. I do not use the term internal representation here, as many people have, because it is potentially ambiguous.

Another representation is called the array of characters representation. This is a representation on the C-level in which the sequence of text is represented, not using the byte representation, but by using an array of characters, each represented using the character representation. This sort of representation is often used by redisplay because it is more convenient to work with than any of the other internal representations.

The term binary representation may also be heard. Binary representation is used to represent binary data. When binary data is represented in the lisp representation, an equivalence is simply set up between bytes zero through 255, and characters zero through 255. These characters come from four character sets, which are from bottom to top, control zero, ASCII, control 1, and Latin 1. Together, they comprise 256 characters, and are a good mapping for the 256 possible bytes in a binary representation. Binary representation could also be used to refer to an external representation of the binary data, which is a simple direct byte-to-byte representation. No internal representation should ever be referred to as a binary representation because of ambiguity. The terms character set/encoding system were defined generally, above. In XEmacs, the equivalent concepts exist, although character set has been shortened to charset, and in fact represents specifically an ordered character set. For each possible charset, and for each possible coding system, there is an associated object in XEmacs. These objects will be of type charset and coding system, respectively. Charsets and coding systems are divided into classes, or types, the normal term under XEmacs, and all possible charsets encoding systems that may be defined must be in one of these types. If you need to create a charset or coding system that is not one of these types, you will have to modify the C code to support this new type. Some of the existing or soon-to-be-created types are, or will be, generic enough so that this shouldn’t be an issue. Note also that the byte encoding for text and the character coding of a character are closely related. You might say that ideally each is the simplest equivalent of the other given the general constraints on each representation.

To be specific, in the current MULE representation,

Characters encode both the character itself and the character set that it comes from. These character sets are always assumed to be representable as an ordered character set of size 96 or of size 96 by 96, or the trivially-related sizes 94 and 94 by 94. The only allowable exceptions are the control zero and control one character sets, which are of size 32. Character sets which do not naturally have a compatible ordering such as this are shoehorned into an ordered character set, or possibly two ordered character sets of a compatible size.
The variable width byte representation was deliberately chosen to allow scanning text forwards and backwards efficiently. This necessitated defining the possible bytes into three ranges which we shall call A, B, and C. Range A is used exclusively for single-byte characters, which is to say characters that are representing using only one contiguous byte. Multi-byte characters are always represented by using one byte from Range B, followed by one or more bytes from Range C. What this means is that bytes that begin a character are unequivocally distinguished from bytes that do not begin a character, and therefore there is never a problem scaling backwards and finding the beginning of a character. Know that UTF8 adopts a proposal that is very similar in spirit in that it uses separate ranges for the first byte of a multi byte sequence, and the following bytes in multi-byte sequence.
Given the fact that all ordered character sets allowed were essentially 96 characters per dimension, it made perfect sense to make Range C comprise 96 bytes. With a little more tweaking, the currently-standard MULE byte representation was created, and was drafted from this.
The MULE byte representation defined four basic representations for characters, which would take up from one to four bytes, respectively. The MULE character representation thus had the following constraints:
1. Character numbers zero through 255 should represent the characters that binary values zero through 255 would be mapped onto. (Note: this was not the case in Kenichi Handa’s version of this representation, but I changed it.)
2. The four sub-classes of representation in the MULE byte representation should correspond to four contiguous non-overlapping ranges of characters.
3. The algorithmic conversion between the single character represented in the byte representation and in the character representation should be as easy as possible.
4. Given the previous constraints, the character representation should be as compact as possible, which is to say it should use the least number of bits possible.

So you see that the entire structure of the byte and character representations stemmed from a very small number of basic choices, which were

the choice to encode character set information in a character
the choice to assume that all character sets would have an order imposed upon them with 96 characters per one or two dimensions. (This is less arbitrary than it seems–it follows ISO-2022)
the choice to use a variable width byte representation.

What this means is that you cannot really separate the byte representation, the character representation, and the assumptions made about characters and whether they represent character sets from each other. All of these are closely intertwined, and for purposes of simplicity, they should be designed together. If you change one representation without changing another, you are in essence creating a completely new design with its own attendant problems–since your new design is likely to be quite complex and not very coherent with regards to the translation between the character and byte representations, you are likely to run into problems.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.3 Introduction to Multilingual Issues #3

In XEmacs, Mule is a code word for the support for input handling and display of multi-lingual text. This section provides an overview of how this support impacts the C and Lisp code in XEmacs. It is important for anyone who works on the C or the Lisp code, especially on the C code, to be aware of these issues, even if they don’t work directly on code that implements multi-lingual features, because there are various general procedures that need to be followed in order to write Mule-compliant code. (The specifics of these procedures are documented elsewhere in this manual.)

There are four primary aspects of Mule support:

internal handling and representation of multi-lingual text.
conversion between the internal representation of text and the various external representations in which multi-lingual text is encoded, such as Unicode representations (including mostly fixed width encodings such as UCS-2/UTF-16 and UCS-4 and variable width ASCII conformant encodings, such as UTF-7 and UTF-8); the various ISO2022 representations, which typically use escape sequences to switch between different character sets (such as Compound Text, used under X Windows; JIS, used specifically for encoding Japanese; and EUC, a non-modal encoding used for Japanese, Korean, and certain other languages); Microsoft’s multi-byte encodings (such as Shift-JIS); various simple encodings for particular 8-bit character sets (such as Latin-1 and Latin-2, and encodings (such as koi8 and Alternativny) for Cyrillic); and others. This conversion needs to happen both for text in files and text sent to or retrieved from system API calls. It even needs to happen for external binary data because the internal representation does not represent binary data simply as a sequence of bytes as it is represented externally.
Proper display of multi-lingual characters.
Input of multi-lingual text using the keyboard.

These four aspects are for the most part independent of each other.

Characters, Character Sets, and Encodings

A character (which is, BTW, a surprisingly complex concept) is, in a written representation of text, the most basic written unit that has a meaning of its own. It’s comparable to a phoneme when analyzing words in spoken speech (for example, the sound of ‘t’ in English, which in fact has different pronunciations in different words – aspirated in ‘time’, unaspirated in ‘stop’, unreleased or even pronounced as a glottal stop in ‘button’, etc. – but logically is a single concept). Like a phoneme, a character is an abstract concept defined by its meaning. The character ‘lowercase f’, for example, can always be used to represent the first letter in the word ‘fill’, regardless of whether it’s drawn upright or italic, whether the ‘fi’ combination is drawn as a single ligature, whether there are serifs on the bottom of the vertical stroke, etc. (These different appearances of a single character are often called graphs or glyphs.) Our concern when representing text is on representing the abstract characters, and not on their exact appearance.

A character set (or charset), as we define it, is a set of characters, each with an associated number (or set of numbers – see below), called a code point. It’s important to understand that a character is not defined by any number attached to it, but by its meaning. For example, ASCII and EBCDIC are two charsets containing exactly the same characters (lowercase and uppercase letters, numbers 0 through 9, particular punctuation marks) but with different numberings. The ‘comma’ character in ASCII and EBCDIC, for instance, is the same character despite having a different numbering. Conversely, when comparing ASCII and JIS-Roman, which look the same except that the latter has a yen sign substituted for the backslash, we would say that the backslash and yen sign are not the same characters, despite having the same number (95) and despite the fact that all other characters are present in both charsets, with the same numbering. ASCII and JIS-Roman, then, do not have exactly the same characters in them (ASCII has a backslash character but no yen-sign character, and vice-versa for JIS-Roman), unlike ASCII and EBCDIC, even though the numberings in ASCII and JIS-Roman are closer.

It’s also important to distinguish between charsets and encodings. For a simple charset like ASCII, there is only one encoding normally used – each character is represented by a single byte, with the same value as its code point. For more complicated charsets, however, things are not so obvious. Unicode version 2, for example, is a large charset with thousands of characters, each indexed by a 16-bit number, often represented in hex, e.g. 0x05D0 for the Hebrew letter “aleph”. One obvious encoding uses two bytes per character (actually two encodings, depending on which of the two possible byte orderings is chosen). This encoding is convenient for internal processing of Unicode text; however, it’s incompatible with ASCII, so a different encoding, e.g. UTF-8, is usually used for external text, for example files or e-mail. UTF-8 represents Unicode characters with one to three bytes (often extended to six bytes to handle characters with up to 31-bit indices). Unicode characters 00 to 7F (identical with ASCII) are directly represented with one byte, and other characters with two or more bytes, each in the range 80 to FF.

In general, a single encoding may be able to represent more than one charset.

Internal Representation of Text

In an ASCII or single-European-character-set world, life is very simple. There are 256 characters, and each character is represented using the numbers 0 through 255, which fit into a single byte. With a few exceptions (such as case-changing operations or syntax classes like whitespace), “text” is simply an array of indices into a font. You can get different languages simply by choosing fonts with different 8-bit character sets (ISO-8859-1, -2, special-symbol fonts, etc.), and everything will “just work” as long as anyone else receiving your text uses a compatible font.

In the multi-lingual world, however, it is much more complicated. There are a great number of different characters which are organized in a complex fashion into various character sets. The representation to use is not obvious because there are issues of size versus speed to consider. In fact, there are in general two kinds of representations to work with: one that represents a single character using an integer (possibly a byte), and the other representing a single character as a sequence of bytes. The former representation is normally called fixed width, and the other variable width. Both representations represent exactly the same characters, and the conversion from one representation to the other is governed by a specific formula (rather than by table lookup) but it may not be simple. Most C code need not, and in fact should not, know the specifics of exactly how the representations work. In fact, the code must not make assumptions about the representations. This means in particular that it must use the proper macros for retrieving the character at a particular memory location, determining how many characters are present in a particular stretch of text, and incrementing a pointer to a particular character to point to the following character, and so on. It must not assume that one character is stored using one byte, or even using any particular number of bytes. It must not assume that the number of characters in a stretch of text bears any particular relation to a number of bytes in that stretch. It must not assume that the character at a particular memory location can be retrieved simply by dereferencing the memory location, even if a character is known to be ASCII or is being compared with an ASCII character, etc. Careful coding is required to be Mule clean. The biggest work of adding Mule support, in fact, is converting all of the existing code to be Mule clean.

Lisp code is mostly unaffected by these concerns. Text in strings and buffers appears simply as a sequence of characters regardless of whether Mule support is present. The biggest difference with older versions of Emacs, as well as current versions of GNU Emacs, is that integers and characters are no longer equivalent, but are separate Lisp Object types.

Conversion Between Internal and External Representations

All text needs to be converted to an external representation before being sent to a function or file, and all text retrieved from a function of file needs to be converted to the internal representation. This conversion needs to happen as close to the source or destination of the text as possible. No operations should ever be performed on text encoded in an external representation other than simple copying, because no assumptions can reliably be made about the format of this text. You cannot assume, for example, that the end of text is terminated by a null byte. (For example, if the text is Unicode, it will have many null bytes in it.) You cannot find the next “slash” character by searching through the bytes until you find a byte that looks like a “slash” character, because it might actually be the second byte of a Kanji character. Furthermore, all text in the internal representation must be converted, even if it is known to be completely ASCII, because the external representation may not be ASCII compatible (for example, if it is Unicode).

The place where C code needs to be the most careful is when calling external API functions. It is easy to forget that all text passed to or retrieved from these functions needs to be converted. This includes text in structures passed to or retrieved from these functions and all text that is passed to a callback function that is called by the system.

Macros are provided to perform conversions to or from external text. These macros are called TO_EXTERNAL_FORMAT and TO_INTERNAL_FORMAT respectively. These macros accept input in various forms, for example, Lisp strings, buffers, lstreams, raw data, and can return data in multiple formats, including both malloc()ed and alloca()ed data. The use of alloca()ed data here is particularly important because, in general, the returned data will not be used after making the API call, and as a result, using alloca()ed data provides a very cheap and easy to use method of allocation.

These macros take a coding system argument which indicates the nature of the external encoding. A coding system is an object that encapsulates the structures of a particular external encoding and the methods required to convert to and from this encoding. A facility exists to create coding system aliases, which in essence gives a single coding system two different names. It is effectively used in XEmacs to provide a layer of abstraction on top of the actual coding systems. For example, the coding system alias “file-name” points to whichever coding system is currently used for encoding and decoding file names as passed to or retrieved from system calls. In general, the actual encoding will differ from system to system, and also on the particular locale that the user is in. The use of the file-name alias effectively hides that implementation detail on top of that abstract interface layer which provides a unified set of coding systems which are consistent across all operating environments.

The choice of which coding system to use in a particular conversion macro requires some thought. In general, you should choose a lower-level actual coding system when the very design of the APIs you are working with call for that particular coding system. In all other cases, you should find the least general abstract coding system (i.e. coding system alias) that applies to your specific situation. Only use the most general coding systems, such as native, when there is simply nothing else that is more appropriate. By doing things this way, you allow the user more control over how the encoding actually works, because the user is free to map the abstracted coding system names onto to different actual coding systems.

Some common coding systems are:

ctext: Compound Text, which is the standard encoding under X Windows, which is used for clipboard data and possibly other data. (ctext is a coding system of type ISO2022.)
mswindows-unicode: this is used for representing text passed to MS Window API calls with arguments that need to be in Unicode format. (mswindows-unicode is a coding system of type UTF-16)
mswindows-multi-byte: this is used for representing text passed to MS Windows API calls with arguments that need to be in multi-byte format. Note that there are very few if any examples of such calls.
mswindows-tstr: this is used for representing text passed to any MS Windows API calls that declare their argument as LPTSTR, or LPCTSTR. This is the vast majority of system calls and automatically translates either to mswindows-unicode or mswindows-multi-byte, depending on the presence or absence of the UNICODE preprocessor constant. (If we compile XEmacs with this preprocessor constant, then all API calls use Unicode for all text passed to or received from these API calls.)
terminal: used for text sent to or read from a text terminal in the absence of a more specific coding system (calls to window-system specific APIs should use the appropriate window-specific coding system if it makes sense to do so.) Like others here, this is a coding system alias.
file-name: used when specifying the names of files in the absence of a more specific encoding, such as ms-windows-tstr. This is a coding system alias – what it’s an alias of is determined at startup.
native: the most general coding system for specifying text passed to system calls. This generally translates to whatever coding system is specified by the current locale. This should only be used when none of the coding systems mentioned above are appropriate. This is a coding system alias – what it’s an alias of is determined at startup.

Proper Display of Multilingual Text

There are two things required to get this working correctly. One is selecting the correct font, and the other is encoding the text according to the encoding used for that specific font, or the window-system specific text display API. Generally each separate character set has a different font associated with it, which is specified by name and each font has an associated encoding into which the characters must be translated. (this is the case on X Windows, at least; on Windows there is a more general mechanism). Both the specific font for a charset and the encoding of that font are system dependent. Currently there is a way of specifying these two properties under X Windows (using the registry and ccl properties of a character set) but not for other window systems. A more general system needs to be implemented to allow these characteristics to be specified for all Windows systems.

Another issue is making sure that the necessary fonts for displaying various character sets are installed on the system. Currently, XEmacs provides, on its web site, X Windows fonts for a number of different character sets that can be installed by users. This isn’t done yet for Windows, but it should be.

Inputting of Multilingual Text

This is a rather complicated issue because there are many paradigms defined for inputting multi-lingual text, some of which are specific to particular languages, and any particular language may have many different paradigms defined for inputting its text. These paradigms are encoded in input methods and there is a standard API for defining an input method in XEmacs called LEIM, or Library of Emacs Input Methods. Some of these input methods are written entirely in Elisp, and thus are system-independent, while others require the aid either of an external process, or of C level support that ties into a particular system-specific input method API, for example, XIM under X Windows, or the active keyboard layout and IME support under Windows. Currently, there is no support for any system-specific input methods under Microsoft Windows, although this will change.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.4 Introduction to Multilingual Issues #4

The rest of the sections in this chapter consist of yet another introduction to multilingual issues, duplicating the information in the previous sections.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.5 Character Sets

A character set (or charset) is an ordered set of characters. A particular character in a charset is indexed using one or more position codes, which are non-negative integers. The number of position codes needed to identify a particular character in a charset is called the dimension of the charset. In XEmacs/Mule, all charsets have dimension 1 or 2, and the size of all charsets (except for a few special cases) is either 94, 96, 94 by 94, or 96 by 96. The range of position codes used to index characters from any of these types of character sets is as follows:

Charset type            Position code 1         Position code 2
------------------------------------------------------------
94                      33 - 126                N/A
96                      32 - 127                N/A
94x94                   33 - 126                33 - 126
96x96                   32 - 127                32 - 127

Note that in the above cases position codes do not start at an expected value such as 0 or 1. The reason for this will become clear later.

For example, Latin-1 is a 96-character charset, and JISX0208 (the Japanese national character set) is a 94x94-character charset.

[Note that, although the ranges above define the valid position codes for a charset, some of the slots in a particular charset may in fact be empty. This is the case for JISX0208, for example, where (e.g.) all the slots whose first position code is in the range 118 - 127 are empty.]

There are three charsets that do not follow the above rules. All of them have one dimension, and have ranges of position codes as follows:

Charset name            Position code 1
------------------------------------
ASCII                   0 - 127
Control-1               0 - 31
Composite               0 - some large number

(The upper bound of the position code for composite characters has not yet been determined, but it will probably be at least 16,383).

ASCII is the union of two subsidiary character sets: Printing-ASCII (the printing ASCII character set, consisting of position codes 33 - 126, like for a standard 94-character charset) and Control-ASCII (the non-printing characters that would appear in a binary file with codes 0 - 32 and 127).

Control-1 contains the non-printing characters that would appear in a binary file with codes 128 - 159.

Composite contains characters that are generated by overstriking one or more characters from other charsets.

Note that some characters in ASCII, and all characters in Control-1, are control (non-printing) characters. These have no printed representation but instead control some other function of the printing (e.g. TAB or 8 moves the current character position to the next tab stop). All other characters in all charsets are graphic (printing) characters.

When a binary file is read in, the bytes in the file are assigned to character sets as follows:

Bytes           Character set           Range
--------------------------------------------------
0 - 127         ASCII                   0 - 127
128 - 159       Control-1               0 - 31
160 - 255       Latin-1                 32 - 127

This is a bit ad-hoc but gets the job done.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.6 Encodings

An encoding is a way of numerically representing characters from one or more character sets. If an encoding only encompasses one character set, then the position codes for the characters in that character set could be used directly. This is not possible, however, if more than one character set is to be used in the encoding.

For example, the conversion detailed above between bytes in a binary file and characters is effectively an encoding that encompasses the three character sets ASCII, Control-1, and Latin-1 in a stream of 8-bit bytes.

Thus, an encoding can be viewed as a way of encoding characters from a specified group of character sets using a stream of bytes, each of which contains a fixed number of bits (but not necessarily 8, as in the common usage of “byte”).

Here are descriptions of a couple of common encodings:

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.6.1 Japanese EUC (Extended Unix Code)

This encompasses the character sets Printing-ASCII, Katakana-JISX0201 (half-width katakana, the right half of JISX0201), Japanese-JISX0208, and Japanese-JISX0212.

Note that Printing-ASCII and Katakana-JISX0201 are 94-character charsets, while Japanese-JISX0208 and Japanese-JISX0212 are 94x94-character charsets.

The encoding is as follows:

Character set            Representation (PC=position-code)
-------------            --------------
Printing-ASCII           PC1
Katakana-JISX0201        0x8E       | PC1 + 0x80
Japanese-JISX0208        PC1 + 0x80 | PC2 + 0x80
Japanese-JISX0212        PC1 + 0x80 | PC2 + 0x80

Note that there are other versions of EUC for other Asian languages. EUC in general is characterized by

row-column encoding,
big-endian (row-first) ordering, and
ASCII compatibility in variable width forms.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.6.2 JIS7

This encompasses the character sets Printing-ASCII, Latin-JISX0201 (the left half of JISX0201; this character set is very similar to Printing-ASCII and is a 94-character charset), Japanese-JISX0208, and Katakana-JISX0201. It uses 7-bit bytes.

Unlike EUC, this is a modal encoding, which means that there are multiple states that the encoding can be in, which affect how the bytes are to be interpreted. Special sequences of bytes (called escape sequences) are used to change states.

The encoding is as follows:

Character set              Representation (PC=position-code)
-------------              --------------
Printing-ASCII             PC1
Latin-JISX0201             PC1
Katakana-JISX0201          PC1
Japanese-JISX0208          PC1 | PC2


Escape sequence   ASCII equivalent   Meaning
---------------   ----------------   -------
0x1B 0x28 0x4A    ESC ( J            invoke Latin-JISX0201
0x1B 0x28 0x49    ESC ( I            invoke Katakana-JISX0201
0x1B 0x24 0x42    ESC $ B            invoke Japanese-JISX0208
0x1B 0x28 0x42    ESC ( B            invoke Printing-ASCII

Initially, Printing-ASCII is invoked.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.7 Internal Mule Encodings

In XEmacs/Mule, each character set is assigned a unique number, called a leading byte. This is used in the encodings of a character. Leading bytes are in the range 0x80 - 0xFF (except for ASCII, which has a leading byte of 0), although some leading bytes are reserved.

Charsets whose leading byte is in the range 0x80 - 0x9F are called official and are used for built-in charsets. Other charsets are called private and have leading bytes in the range 0xA0 - 0xFF; these are user-defined charsets.

More specifically:

Character set                Leading byte
-------------                ------------
ASCII                        0 (0x7F in arrays indexed by leading byte)
Composite                    0x8D
Dimension-1 Official         0x80 - 0x8C/0x8D
                               (0x8E is free)
Control                      0x8F
Dimension-2 Official         0x90 - 0x99
                               (0x9A - 0x9D are free)
Dimension-1 Private Marker   0x9E
Dimension-2 Private Marker   0x9F
Dimension-1 Private          0xA0 - 0xEF
Dimension-2 Private          0xF0 - 0xFF

There are two internal encodings for characters in XEmacs/Mule. One is called string encoding and is an 8-bit encoding that is used for representing characters in a buffer or string. It uses 1 to 4 bytes per character. The other is called character encoding and is a 21-bit encoding that is used for representing characters individually in a variable.

(In the following descriptions, we’ll ignore composite characters for the moment. We also give a general (structural) overview first, followed later by the exact details.)

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.7.1 Internal String Encoding

ASCII characters are encoded using their position code directly. Other characters are encoded using their leading byte followed by their position code(s) with the high bit set. Characters in private character sets have their leading byte prefixed with a leading byte prefix, which is either 0x9E or 0x9F. (No character sets are ever assigned these leading bytes.) Specifically:

Character set           Encoding (PC=position-code, LB=leading-byte)
-------------           --------
ASCII                   PC-1 |
Control-1               LB   |  PC1 + 0xA0 |
Dimension-1 official    LB   |  PC1 + 0x80 |
Dimension-1 private     0x9E |  LB         | PC1 + 0x80 |
Dimension-2 official    LB   |  PC1 + 0x80 | PC2 + 0x80 |
Dimension-2 private     0x9F |  LB         | PC1 + 0x80 | PC2 + 0x80

The basic characteristic of this encoding is that the first byte of all characters is in the range 0x00 - 0x9F, and the second and following bytes of all characters is in the range 0xA0 - 0xFF. This means that it is impossible to get out of sync, or more specifically:

Given any byte position, the beginning of the character it is within can be determined in constant time.
Given any byte position at the beginning of a character, the beginning of the next character can be determined in constant time.
Given any byte position at the beginning of a character, the beginning of the previous character can be determined in constant time.
Textual searches can simply treat encoded strings as if they were encoded in a one-byte-per-character fashion rather than the actual multi-byte encoding.

None of the pre-Unicode standard non-modal encodings meet all of these conditions. For example, EUC satisfies only (2) and (3), while Shift-JIS and Big5 (not yet described) satisfy only (2). (All non-modal encodings must satisfy (2), in order to be unambiguous.) UTF-8, however, meets all three, and we are considering moving to it as an internal encoding.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.7.2 Internal Character Encoding

One 21-bit word represents a single character. The word is separated into three fields:

Bit number:     20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
                <------------------> <------------------> <------------------>
Field:                    1                    2                    3

Note that each field holds 7 bits.

Character set           Field 1         Field 2         Field 3
-------------           -------         -------         -------
ASCII                      0               0              PC1
   range:                                                   (00 - 7F)
Control-1                  0               1              PC1
   range:                                                   (00 - 1F)
Dimension-1 official       0            LB - 0x7F         PC1
   range:                                    (01 - 0D)      (20 - 7F)
Dimension-1 private        0            LB - 0x80         PC1
   range:                                    (20 - 6F)      (20 - 7F)
Dimension-2 official    LB - 0x8F         PC1             PC2
   range:                    (01 - 0A)       (20 - 7F)      (20 - 7F)
Dimension-2 private     LB - 0x80         PC1             PC2
   range:                    (0F - 1E)       (20 - 7F)      (20 - 7F)
Composite                 0x1F             ?               ?

Note also that character codes 0 - 255 are the same as the “binary encoding” described above.

Most of the code in XEmacs knows nothing of the representation of a character other than that values 0 - 255 represent ASCII, Control 1, and Latin 1.

WARNING WARNING WARNING: The Boyer-Moore code in ‘search.c’, and the code in search_buffer() that determines whether that code can be used, knows that “field 3” in a character always corresponds to the last byte in the textual representation of the character. (This is important because the Boyer-Moore algorithm works by looking at the last byte of the search string and &&#### finish this.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.8 Byte/Character Types; Buffer Positions; Other Typedefs

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.8.1 Byte Types

Stuff pointed to by a char * or unsigned char * will nearly always be one of the following types:

- a) [Ibyte] pointer to internally-formatted text
- b) [Extbyte] pointer to text in some external format, which can be defined as all formats other than the internal one
- c) [Ascbyte] pure ASCII text
- d) [Binbyte] binary data that is not meant to be interpreted as text
- e) [Rawbyte] general data in memory, where we don’t care about whether it’s text or binary
- f) [Boolbyte] a zero or a one
- g) [Bitbyte] a byte used for bit fields
- h) [Chbyte] null-semantics char *; used when casting an argument to an external API where the the other types may not be appropriate

Types (b), (c), (f) and (h) are defined as char, while the others are unsigned char. This is for maximum safety (signed characters are dangerous to work with) while maintaining as much compatibility with external APIs and string constants as possible.

We also provide versions of the above types defined with different underlying C types, for API compatibility. These use the following prefixes:

C = plain char, when the base type is unsigned
U = unsigned
S = signed

(Formerly I had a comment saying that type (e) “should be replaced with void *”. However, there are in fact many places where an unsigned char * might be used – e.g. for ease in pointer computation, since void * doesn’t allow this, and for compatibility with external APIs.)

Note that these typedefs are purely for documentation purposes; from the C code’s perspective, they are exactly equivalent to char *, unsigned char *, etc., so you can freely use them with library functions declared as such.

Using these more specific types rather than the general ones helps avoid the confusions that occur when the semantics of a char * or unsigned char * argument being studied are unclear. Furthermore, by requiring that ALL uses of char be replaced with some other type as part of the Mule-ization process, we can use a search for char as a way of finding code that has not been properly Mule-ized yet.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.8.2 Different Ways of Seeing Internal Text

There are various ways of representing internal text. The two primary ways are as an “array” of individual characters; the other is as a “stream” of bytes. In the ASCII world, where there are only 255 characters at most, things are easy because each character fits into a byte. In general, however, this is not true – see the above discussion of characters vs. encodings.

In some cases, it’s also important to distinguish between a stream representation as a series of bytes and as a series of textual units. This is particularly important wrt Unicode. The UTF-16 representation (sometimes referred to, rather sloppily, as simply the “Unicode” format) represents text as a series of 16-bit units. Mostly, each unit corresponds to a single character, but not necessarily, as characters outside of the range 0-65535 (the BMP or “Basic Multilingual Plane” of Unicode) require two 16-bit units, through the mechanism of “surrogates”. When a series of 16-bit units is serialized into a byte stream, there are at least two possible representations, little-endian and big-endian, and which one is used may depend on the native format of 16-bit integers in the CPU of the machine that XEmacs is running on. (Similarly, UTF-32 is logically a representation with 32-bit textual units.)

Specifically:

- UTF-8 has 1-byte (8-bit) units.
- UTF-16 has 2-byte (16-bit) units.
- UTF-32 has 4-byte (32-bit) units.
- XEmacs-internal encoding (the old “Mule” encoding) has 1-byte (8-bit) units.
- UTF-7 technically has 7-bit units that are within the “mail-safe” range (ASCII 32 - 126 plus a few control characters), but normally is encoded in an 8-bit stream. (UTF-7 is also a modal encoding, since it has a normal mode where printable ASCII characters represent themselves and a shifted mode, introduced with a plus sign, where a base-64 encoding is used.)
- UTF-5 technically has 7-bit units (normally encoded in an 8-bit stream, like UTF-7), but only uses uppercase A-V and 0-9, and only encodes 4 bits worth of data per character. UTF-5 is meant for encoding Unicode inside of DNS names.

Thus, we can imagine three levels in the representation of textual data:

series of characters -> series of textual units -> series of bytes
       [Ichar]                 [Itext]                 [Ibyte]

XEmacs has three corresponding typedefs:

- An Ichar is an integer (at least 32-bit), representing a 31-bit character.
- An Itext is an unsigned value, either 8, 16 or 32 bits, depending on the nature of the internal representation, and corresponding to a single textual unit.
- An Ibyte is an unsigned char, representing a single byte in a textual byte stream.

Internal text in stream format can be simultaneously viewed as either Itext * or Ibyte *. The Ibyte * representation is convenient for copying data from one place to another, because such routines usually expect byte counts. However, Itext * is much better for actually working with the data.

From a text-unit perspective, units 0 through 127 will always be ASCII compatible, and data in Lisp strings (and other textual data generated as a whole, e.g. from external conversion) will be followed by a null-unit terminator. From an Ibyte * perspective, however, the encoding is only ASCII-compatible if it uses 1-byte units.

Similarly to the different text representations, three integral count types exist – Charcount, Textcount and Bytecount.

NOTE: Despite the presence of the terminator, internal text itself can have nulls in it! (Null text units, not just the null bytes present in any UTF-16 encoding.) The terminator is present because in many cases internal text is passed to routines that will ultimately pass the text to library functions that cannot handle embedded nulls, e.g. functions manipulating filenames, and it is a real hassle to have to pass the length around constantly. But this can lead to sloppy coding! We need to be careful about watching for nulls in places that are important, e.g. manipulating string objects or passing data to/from the clipboard.

Ibyte

The data in a buffer or string is logically made up of Ibyte objects, where a Ibyte takes up the same amount of space as a char. (It is declared differently, though, to catch invalid usages.) Strings stored using Ibytes are said to be in “internal format”. The important characteristics of internal format are

- ASCII characters are represented as a single Ibyte, in the range 0 - 0x7f.
- All other characters are represented as a Ibyte in the range 0x80 - 0x9f followed by one or more Ibytes in the range 0xa0 to 0xff.

This leads to a number of desirable properties:

- Given the position of the beginning of a character, you can find the beginning of the next or previous character in constant time.
- When searching for a substring or an ASCII character within the string, you need merely use standard searching routines.

Itext

#### Document me.

Ichar

This typedef represents a single Emacs character, which can be ASCII, ISO-8859, or some extended character, as would typically be used for Kanji. Note that the representation of a character as an Ichar is not the same as the representation of that same character in a string; thus, you cannot do the standard C trick of passing a pointer to a character to a function that expects a string.

An Ichar takes up 21 bits of representation and (for code compatibility and such) is compatible with an int. This representation is visible on the Lisp level. The important characteristics of the Ichar representation are

- values 0x00 - 0x7f represent ASCII.
- values 0x80 - 0xff represent the right half of ISO-8859-1.
- values 0x100 and up represent all other characters.

This means that Ichar values are upwardly compatible with the standard 8-bit representation of ASCII/ISO-8859-1.

Extbyte

Strings that go in or out of Emacs are in “external format”, typedef’ed as an array of char or a char *. There is more than one external format (JIS, EUC, etc.) but they all have similar properties. They are modal encodings, which is to say that the meaning of particular bytes is not fixed but depends on what “mode” the string is currently in (e.g. bytes in the range 0 - 0x7f might be interpreted as ASCII, or as Hiragana, or as 2-byte Kanji, depending on the current mode). The mode starts out in ASCII/ISO-8859-1 and is switched using escape sequences – for example, in the JIS encoding, ’ESC $ B’ switches to a mode where pairs of bytes in the range 0 - 0x7f are interpreted as Kanji characters.

External-formatted data is generally desirable for passing data between programs because it is upwardly compatible with standard ASCII/ISO-8859-1 strings and may require less space than internal encodings such as the one described above. In addition, some encodings (e.g. JIS) keep all characters (except the ESC used to switch modes) in the printing ASCII range 0x20 - 0x7e, which results in a much higher probability that the data will avoid being garbled in transmission. Externally-formatted data is generally not very convenient to work with, however, and for this reason is usually converted to internal format before any work is done on the string.

NOTE: filenames need to be in external format so that ISO-8859-1 characters come out correctly.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.8.3 Buffer Positions

There are three possible ways to specify positions in a buffer. All of these are one-based: the beginning of the buffer is position or index 1, and 0 is not a valid position.

As a “buffer position” (typedef Charbpos):

This is an index specifying an offset in characters from the beginning of the buffer. Note that buffer positions are logically between characters, not on a character. The difference between two buffer positions specifies the number of characters between those positions. Buffer positions are the only kind of position externally visible to the user.

As a “byte index” (typedef Bytebpos):

This is an index over the bytes used to represent the characters in the buffer. If there is no Mule support, this is identical to a buffer position, because each character is represented using one byte. However, with Mule support, many characters require two or more bytes for their representation, and so a byte index may be greater than the corresponding buffer position.

As a “memory index” (typedef Membpos):

This is the byte index adjusted for the gap. For positions before the gap, this is identical to the byte index. For positions after the gap, this is the byte index plus the gap size. There are two possible memory indices for the gap position; the memory index at the beginning of the gap should always be used, except in code that deals with manipulating the gap, where both indices may be seen. The address of the character “at” (i.e. following) a particular position can be obtained from the formula

buffer_start_address + memory_index(position) - 1

except in the case of characters at the gap position.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.8.4 Other Typedefs

Charcount: ———- This typedef represents a count of characters, such as a character offset into a string or the number of characters between two positions in a buffer. The difference between two Charbpos’s is a Charcount, and character positions in a string are represented using a Charcount.

Textcount: ———- #### Document me.

Bytecount: ———- Similar to a Charcount but represents a count of bytes. The difference between two Bytebpos’s is a Bytecount.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.8.5 Usage of the Various Representations

Memory indices are used in low-level functions in insdel.c and for extent endpoints and marker positions. The reason for this is that this way, the extents and markers don’t need to be updated for most insertions, which merely shrink the gap and don’t move any characters around in memory.

(The beginning-of-gap memory index simplifies insertions w.r.t. markers, because text usually gets inserted after markers. For extents, it is merely for consistency, because text can get inserted either before or after an extent’s endpoint depending on the open/closedness of the endpoint.)

Byte indices are used in other code that needs to be fast, such as the searching, redisplay, and extent-manipulation code.

Buffer positions are used in all other code. This is because this representation is easiest to work with (especially since Lisp code always uses buffer positions), necessitates the fewest changes to existing code, and is the safest (e.g. if the text gets shifted underneath a buffer position, it will still point to a character; if text is shifted under a byte index, it might point to the middle of a character, which would be bad).

Similarly, Charcounts are used in all code that deals with strings except for code that needs to be fast, which used Bytecounts.

Strings are always passed around internally using internal format. Conversions between external format are performed at the time that the data goes in or out of Emacs.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.8.6 Working With the Various Representations

We write things this way because it’s very important the MAX_BYTEBPOS_GAP_SIZE_3 is a multiple of 3. (As it happens, 65535 is a multiple of 3, but this may not always be the case. #### unfinished

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.9 Internal Text APIs

NOTE: The most current documentation for these APIs is in ‘text.h’. In case of error, assume that file is correct and this one wrong.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.9.1 Basic internal-format APIs

These are simple functions and macros to convert between text representation and characters, move forward and back in text, etc.

#### Finish the rest of this.

Use the following functions/macros on contiguous text in any of the internal formats. Those that take a format arg work on all internal formats; the others work only on the default (variable-width under Mule) format. If the text you’re operating on is known to come from a buffer, use the buffer-level functions in buffer.h, which automatically know the correct format and handle the gap.

Some terminology:

itext" appearing in the macros means "internal-format text" – type Ibyte *. Operations on such pointers themselves, rather than on the text being pointed to, have "itext" instead of "itext" in the macro name. "ichar" in the macro names means an Ichar – the representation of a character as a single integer rather than a series of bytes, as part of "itext". Many of the macros below are for converting between the two representations of characters.

Note also that we try to consistently distinguish between an "Ichar" and a Lisp character. Stuff working with Lisp characters often just says "char", so we consistently use "Ichar" when that’s what we’re working with.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.9.2 The DFC API

This is for conversion between internal and external text. Note that there is also the "new DFC" API, which returns a pointer to the converted text (in alloca space), rather than storing it into a variable.

The macros below are used for converting data between different formats. Generally, the data is textual, and the formats are related to internationalization (e.g. converting between internal-format text and UTF-8) – but the mechanism is general, and could be used for anything, e.g. decoding gzipped data.

In general, conversion involves a source of data, a sink, the existing format of the source data, and the desired format of the sink. The macros below, however, always require that either the source or sink is internal-format text. Therefore, in practice the conversions below involve source, sink, an external format (specified by a coding system), and the direction of conversion (internal->external or vice-versa).

Sources and sinks can be raw data (sized or unsized – when unsized, input data is assumed to be null-terminated [double null-terminated for Unicode-format data], and on output the length is not stored anywhere), Lisp strings, Lisp buffers, lstreams, and opaque data objects. When the output is raw data, the result can be allocated either with alloca() or malloc(). (There is currently no provision for writing into a fixed buffer. If you want this, use alloca() output and then copy the data – but be careful with the size! Unless you are very sure of the encoding being used, upper bounds for the size are not in general computable.) The obvious restrictions on source and sink types apply (e.g. Lisp strings are a source and sink only for internal data).

All raw data outputted will contain an extra null byte (two bytes for Unicode – currently, in fact, all output data, whether internal or external, is double-null-terminated, but you can’t count on this; see below). This means that enough space is allocated to contain the extra nulls; however, these nulls are not reflected in the returned output size.

The most basic macros are TO_EXTERNAL_FORMAT and TO_INTERNAL_FORMAT. These can be used to convert between any kinds of sources or sinks. However, 99% of conversions involve raw data or Lisp strings as both source and sink, and usually data is output as alloca() rather than malloc(). For this reason, convenience macros are defined for many types of conversions involving raw data and/or Lisp strings, especially when the output is an alloca()ed string. (When the destination is a Lisp_String, there are other functions that should be used instead – build_extstring() and make_extstring(), for example.) The convenience macros are of two types – the older kind that store the result into a specified variable, and the newer kind that return the result. The newer kind of macros don’t exist when the output is sized data, because that would have two return values. NOTE: All convenience macros are ultimately defined in terms of TO_EXTERNAL_FORMAT and TO_INTERNAL_FORMAT. Thus, any comments below about the workings of these macros also apply to all convenience macros.

TO_EXTERNAL_FORMAT (source_type, source, sink_type, sink, codesys)
TO_INTERNAL_FORMAT (source_type, source, sink_type, sink, codesys)

Typical use is

   TO_EXTERNAL_FORMAT (LISP_STRING, str, C_STRING_MALLOC, ptr, Qfile_name);

which means that the contents of the lisp string str are written to a malloc’ed memory area which will be pointed to by ptr, after the function returns. The conversion will be done using the file-name coding system (which will be controlled by the user indirectly by setting or binding the variable file-name-coding-system).

Some sources and sinks require two C variables to specify. We use some preprocessor magic to allow different source and sink types, and even different numbers of arguments to specify different types of sources and sinks.

So we can have a call that looks like

   TO_INTERNAL_FORMAT (DATA, (ptr, len),
                       MALLOC, (ptr, len),
                       coding_system);

The parenthesized argument pairs are required to make the preprocessor magic work.

NOTE: GC is inhibited during the entire operation of these macros. This is because frequently the data to be converted comes from strings but gets passed in as just DATA, and GC may move around the string data. If we didn’t inhibit GC, there’d have to be a lot of messy recoding, alloca-copying of strings and other annoying stuff.

The source or sink can be specified in one of these ways:

DATA,   (ptr, len),    // input data is a fixed buffer of size len
ALLOCA, (ptr, len),    // output data is in a ALLOCA()ed buffer of size len
MALLOC, (ptr, len),    // output data is in a malloc()ed buffer of size len
C_STRING_ALLOCA, ptr,  // equivalent to ALLOCA (ptr, len_ignored) on output
C_STRING_MALLOC, ptr,  // equivalent to MALLOC (ptr, len_ignored) on output
C_STRING,     ptr,     // equivalent to DATA, (ptr, strlen/wcslen (ptr))
                       // on input (the Unicode version is used when correct)
LISP_STRING,  string,  // input or output is a Lisp_Object of type string
LISP_BUFFER,  buffer,  // output is written to (point) in lisp buffer
LISP_LSTREAM, lstream, // input or output is a Lisp_Object of type lstream
LISP_OPAQUE,  object,  // input or output is a Lisp_Object of type opaque

When specifying the sink, use lvalues, since the macro will assign to them, except when the sink is an lstream or a lisp buffer.

For the sink types ALLOCA and C_STRING_ALLOCA, the resulting text is stored in a stack-allocated buffer, which is automatically freed on returning from the function. However, the sink types MALLOC and C_STRING_MALLOC return xmalloc()ed memory. The caller is responsible for freeing this memory using xfree().

The macros accept the kinds of sources and sinks appropriate for internal and external data representation. See the type_checking_assert macros below for the actual allowed types.

Since some sources and sinks use one argument (a Lisp_Object) to specify them, while others take a (pointer, length) pair, we use some C preprocessor trickery to allow pair arguments to be specified by parenthesizing them, as in the examples above.

Anything prefixed by dfc_ (‘data format conversion’) is private. They are only used to implement these macros.

[[Using C_STRING* is appropriate for using with external APIs that take null-terminated strings. For internal data, we should try to be ’\0’-clean - i.e. allow arbitrary data to contain embedded ’\0’.

Sometime in the future we might allow output to C_STRING_ALLOCA or C_STRING_MALLOC _only_ with TO_EXTERNAL_FORMAT(), not TO_INTERNAL_FORMAT().]]

The above comments are not true. Frequently (most of the time, in fact), external strings come as zero-terminated entities, where the zero-termination is the only way to find out the length. Even in cases where you can get the length, most of the time the system will still use the null to signal the end of the string, and there will still be no way to either send in or receive a string with embedded nulls. In such situations, it’s pointless to track the length because null bytes can never be in the string. We have a lot of operations that make it easy to operate on zero-terminated strings, and forcing the user the deal with the length everywhere would only make the code uglier and more complicated, for no gain. –ben

There is no problem using the same lvalue for source and sink.

Also, when pointers are required, the code (currently at least) is lax and allows any pointer types, either in the source or the sink. This makes it possible, e.g., to deal with internal format data held in char *’s or external format data held in WCHAR * (i.e. Unicode).

Finally, whenever storage allocation is called for, extra space is allocated for a terminating zero, and such a zero is stored in the appropriate place, regardless of whether the source data was specified using a length or was specified as zero-terminated. This allows you to freely pass the resulting data, no matter how obtained, to a routine that expects zero termination (modulo, of course, that any embedded zeros in the resulting text will cause truncation). In fact, currently two embedded zeros are allocated and stored after the data result. This is to allow for the possibility of storing a Unicode value on output, which needs the two zeros. Currently, however, the two zeros are stored regardless of whether the conversion is internal or external and regardless of whether the external coding system is in fact Unicode. This behavior may change in the future, and you cannot rely on this – the most you can rely on is that sink data in Unicode format will have two terminating nulls, which combine to form one Unicode null character.

NOTE: You might ask, why are these not written as functions that RETURN the converted string, since that would allow them to be used much more conveniently, without having to constantly declare temporary variables? The answer is that in fact I originally did write the routines that way, but that required either

(a) calling alloca() inside of a function call, or
(b) using expressions separated by commas and a global temporary variable, or
(c) using the GCC extension ({ ... }).

Turned out that all of the above had bugs, all caused by GCC (hence the comments about “those GCC wankers” and “ream gcc up the ass”). As for (a), some versions of GCC (especially on Intel platforms), which had buggy implementations of alloca() that couldn’t handle being called inside of a function call – they just decremented the stack right in the middle of pushing args. Oops, crash with stack trashing, very bad. (b) was an attempt to fix (a), and that led to further GCC crashes, esp. when you had two such calls in a single subexpression, because GCC couldn’t be counted upon to follow even a minimally reasonable order of execution. True, you can’t count on one argument being evaluated before another, but GCC would actually interleave them so that the temp var got stomped on by one while the other was accessing it. So I tried (c), which was problematic because that GCC extension has more bugs in it than a termite’s nest.

So reluctantly I converted to the current way. Now, that was awhile ago (c. 1994), and it appears that the bug involving alloca in function calls has long since been fixed. More recently, I defined the new-dfc routines down below, which DO allow exactly such convenience of returning your args rather than store them in temp variables, and I also wrote a configure check to see whether alloca() causes crashes inside of function calls, and if so use the portable alloca() implementation in alloca.c. If you define TEST_NEW_DFC, the old routines get written in terms of the new ones, and I’ve had a beta put out with this on and it appeared to this appears to cause no problems – so we should consider switching, and feel no compunctions about writing further such function- like alloca() routines in lieu of statement-like ones. –ben

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.9.3 The Eistring API

(This API is currently under-used) When doing simple things with internal text, the basic internal-format APIs are enough. But to do things like delete or replace a substring, concatenate various strings, etc. is difficult to do cleanly because of the allocation issues. The Eistring API is designed to deal with this, and provides a clean way of modifying and building up internal text. (Note that the former lack of this API has meant that some code uses Lisp strings to do similar manipulations, resulting in excess garbage and increased garbage collection.)

NOTE: The Eistring API is (or should be) Mule-correct even without an ASCII-compatible internal representation.

#### NOTE: This is a work in progress.  Neither the API nor especially
the implementation is finished.

NOTE: An Eistring is a structure that makes it easy to work with
internally-formatted strings of data.  It provides operations similar
in feel to the standard strcpy(), strcat(), strlen(), etc., but

(a) it is Mule-correct
(b) it does dynamic allocation so you never have to worry about size
    restrictions
(c) it comes in an ALLOCA() variety (all allocation is stack-local,
    so there is no need to explicitly clean up) as well as a malloc()
    variety
(d) it knows its own length, so it does not suffer from standard null
    byte brain-damage -- but it null-terminates the data anyway, so
    it can be passed to standard routines
(e) it provides a much more powerful set of operations and knows about
    all the standard places where string data might reside: Lisp_Objects,
    other Eistrings, Ibyte * data with or without an explicit length,
    ASCII strings, Ichars, etc.
(f) it provides easy operations to convert to/from externally-formatted
    data, and is easier to use than the standard TO_INTERNAL_FORMAT
    and TO_EXTERNAL_FORMAT macros. (An Eistring can store both the internal
    and external version of its data, but the external version is only
    initialized or changed when you call eito_external().)

The idea is to make it as easy to write Mule-correct string manipulation
code as it is to write normal string manipulation code.  We also make
the API sufficiently general that it can handle multiple internal data
formats (e.g. some fixed-width optimizing formats and a default variable
width format) and allows for ANY data format we might choose in the
future for the default format, including UCS2. (In other words, we can't
assume that the internal format is ASCII-compatible and we can't assume
it doesn't have embedded null bytes.  We do assume, however, that any
chosen format will have the concept of null-termination.) All of this is
hidden from the user.

#### It is really too bad that we don't have a real object-oriented
language, or at least a language with polymorphism!


 ********************************************** 
 *                 Declaration                * 
 ********************************************** 

To declare an Eistring, either put one of the following in the local
variable section:

DECLARE_EISTRING (name);
     Declare a new Eistring and initialize it to the empty string.  This
     is a standard local variable declaration and can go anywhere in the
     variable declaration section.  NAME itself is declared as an
     Eistring *, and its storage declared on the stack.

DECLARE_EISTRING_MALLOC (name);
     Declare and initialize a new Eistring, which uses malloc()ed
     instead of ALLOCA()ed data.  This is a standard local variable
     declaration and can go anywhere in the variable declaration
     section.  Once you initialize the Eistring, you will have to free
     it using eifree() to avoid memory leaks.  You will need to use this
     form if you are passing an Eistring to any function that modifies
     it (otherwise, the modified data may be in stack space and get
     overwritten when the function returns).

or use

Eistring ei;
void eiinit (Eistring *ei);
void eiinit_malloc (Eistring *einame);
     If you need to put an Eistring elsewhere than in a local variable
     declaration (e.g. in a structure), declare it as shown and then
     call one of the init macros.

Also note:

void eifree (Eistring *ei);
     If you declared an Eistring to use malloc() to hold its data,
     or converted it to the heap using eito_malloc(), then this
     releases any data in it and afterwards resets the Eistring
     using eiinit_malloc().  Otherwise, it just resets the Eistring
     using eiinit().


 ********************************************** 
 *                 Conventions                * 
 ********************************************** 

 - The names of the functions have been chosen, where possible, to
   match the names of str*() functions in the standard C API.
 - 


 ********************************************** 
 *               Initialization               * 
 ********************************************** 

void eireset (Eistring *eistr);
     Initialize the Eistring to the empty string.

void eicpy_* (Eistring *eistr, ...);
     Initialize the Eistring from somewhere:

void eicpy_ei (Eistring *eistr, Eistring *eistr2);
     ... from another Eistring.
void eicpy_lstr (Eistring *eistr, Lisp_Object lisp_string);
     ... from a Lisp_Object string.
void eicpy_ch (Eistring *eistr, Ichar ch);
     ... from an Ichar (this can be a conventional C character).

void eicpy_lstr_off (Eistring *eistr, Lisp_Object lisp_string,
                     Bytecount off, Charcount charoff,
                     Bytecount len, Charcount charlen);
     ... from a section of a Lisp_Object string.
void eicpy_lbuf (Eistring *eistr, Lisp_Object lisp_buf,
     	    Bytecount off, Charcount charoff,
     	    Bytecount len, Charcount charlen);
     ... from a section of a Lisp_Object buffer.
void eicpy_raw (Eistring *eistr, const Ibyte *data, Bytecount len);
     ... from raw internal-format data in the default internal format.
void eicpy_rawz (Eistring *eistr, const Ibyte *data);
     ... from raw internal-format data in the default internal format
     that is "null-terminated" (the meaning of this depends on the nature
     of the default internal format).
void eicpy_raw_fmt (Eistring *eistr, const Ibyte *data, Bytecount len,
                    Internal_Format intfmt, Lisp_Object object);
     ... from raw internal-format data in the specified format.
void eicpy_rawz_fmt (Eistring *eistr, const Ibyte *data,
                     Internal_Format intfmt, Lisp_Object object);
     ... from raw internal-format data in the specified format that is
     "null-terminated" (the meaning of this depends on the nature of
     the specific format).
void eicpy_c (Eistring *eistr, const Ascbyte *c_string);
     ... from an ASCII null-terminated string.  Non-ASCII characters in
     the string are ILLEGAL (read abort() with error-checking defined).
void eicpy_c_len (Eistring *eistr, const Ascbyte *c_string, len);
     ... from an ASCII string, with length specified.  Non-ASCII characters
     in the string are ILLEGAL (read abort() with error-checking defined).
void eicpy_ext (Eistring *eistr, const Extbyte *extdata,
                Lisp_Object codesys);
     ... from external null-terminated data, with coding system specified.
void eicpy_ext_len (Eistring *eistr, const Extbyte *extdata,
                    Bytecount extlen, Lisp_Object codesys);
     ... from external data, with length and coding system specified.
void eicpy_lstream (Eistring *eistr, Lisp_Object lstream);
     ... from an lstream; reads data till eof.  Data must be in default
     internal format; otherwise, interpose a decoding lstream.


 ********************************************** 
 *    Getting the data out of the Eistring    * 
 ********************************************** 

Ibyte *eidata (Eistring *eistr);
     Return a pointer to the raw data in an Eistring.  This is NOT
     a copy.

Lisp_Object eimake_string (Eistring *eistr);
     Make a Lisp string out of the Eistring.

Lisp_Object eimake_string_off (Eistring *eistr,
                               Bytecount off, Charcount charoff,
     			  Bytecount len, Charcount charlen);
     Make a Lisp string out of a section of the Eistring.

void eicpyout_alloca (Eistring *eistr, LVALUE: Ibyte *ptr_out,
                      LVALUE: Bytecount len_out);
     Make an ALLOCA() copy of the data in the Eistring, using the
     default internal format.  Due to the nature of ALLOCA(), this
     must be a macro, with all lvalues passed in as parameters.
     (More specifically, not all compilers correctly handle using
     ALLOCA() as the argument to a function call -- GCC on x86
     didn't used to, for example.) A pointer to the ALLOCA()ed data
     is stored in PTR_OUT, and the length of the data (not including
     the terminating zero) is stored in LEN_OUT.

void eicpyout_alloca_fmt (Eistring *eistr, LVALUE: Ibyte *ptr_out,
                          LVALUE: Bytecount len_out,
                          Internal_Format intfmt, Lisp_Object object);
     Like eicpyout_alloca(), but converts to the specified internal
     format. (No formats other than FORMAT_DEFAULT are currently
     implemented, and you get an assertion failure if you try.)

Ibyte *eicpyout_malloc (Eistring *eistr, Bytecount *intlen_out);
     Make a malloc() copy of the data in the Eistring, using the
     default internal format.  This is a real function.  No lvalues
     passed in.  Returns the new data, and stores the length (not
     including the terminating zero) using INTLEN_OUT, unless it's
     a NULL pointer.

Ibyte *eicpyout_malloc_fmt (Eistring *eistr, Internal_Format intfmt,
                              Bytecount *intlen_out, Lisp_Object object);
     Like eicpyout_malloc(), but converts to the specified internal
     format. (No formats other than FORMAT_DEFAULT are currently
     implemented, and you get an assertion failure if you try.)


 ********************************************** 
 *             Moving to the heap             * 
 ********************************************** 

void eito_malloc (Eistring *eistr);
     Move this Eistring to the heap.  Its data will be stored in a
     malloc()ed block rather than the stack.  Subsequent changes to
     this Eistring will realloc() the block as necessary.  Use this
     when you want the Eistring to remain in scope past the end of
     this function call.  You will have to manually free the data
     in the Eistring using eifree().

void eito_alloca (Eistring *eistr);
     Move this Eistring back to the stack, if it was moved to the
     heap with eito_malloc().  This will automatically free any
     heap-allocated data.



 ********************************************** 
 *            Retrieving the length           * 
 ********************************************** 

Bytecount eilen (Eistring *eistr);
     Return the length of the internal data, in bytes.  See also
     eiextlen(), below.
Charcount eicharlen (Eistring *eistr);
     Return the length of the internal data, in characters.


 ********************************************** 
 *           Working with positions           * 
 ********************************************** 

Bytecount eicharpos_to_bytepos (Eistring *eistr, Charcount charpos);
     Convert a char offset to a byte offset.
Charcount eibytepos_to_charpos (Eistring *eistr, Bytecount bytepos);
     Convert a byte offset to a char offset.
Bytecount eiincpos (Eistring *eistr, Bytecount bytepos);
     Increment the given position by one character.
Bytecount eiincpos_n (Eistring *eistr, Bytecount bytepos, Charcount n);
     Increment the given position by N characters.
Bytecount eidecpos (Eistring *eistr, Bytecount bytepos);
     Decrement the given position by one character.
Bytecount eidecpos_n (Eistring *eistr, Bytecount bytepos, Charcount n);
     Deccrement the given position by N characters.


 ********************************************** 
 *    Getting the character at a position     * 
 ********************************************** 

Ichar eigetch (Eistring *eistr, Bytecount bytepos);
     Return the character at a particular byte offset.
Ichar eigetch_char (Eistring *eistr, Charcount charpos);
     Return the character at a particular character offset.


 ********************************************** 
 *    Setting the character at a position     * 
 ********************************************** 

Ichar eisetch (Eistring *eistr, Bytecount bytepos, Ichar chr);
     Set the character at a particular byte offset.
Ichar eisetch_char (Eistring *eistr, Charcount charpos, Ichar chr);
     Set the character at a particular character offset.


 ********************************************** 
 *               Concatenation                * 
 ********************************************** 

void eicat_* (Eistring *eistr, ...);
     Concatenate onto the end of the Eistring, with data coming from the
     same places as above:

void eicat_ei (Eistring *eistr, Eistring *eistr2);
     ... from another Eistring.
void eicat_c (Eistring *eistr, Ascbyte *c_string);
     ... from an ASCII null-terminated string.  Non-ASCII characters in
     the string are ILLEGAL (read abort() with error-checking defined).
void eicat_raw (ei, const Ibyte *data, Bytecount len);
     ... from raw internal-format data in the default internal format.
void eicat_rawz (ei, const Ibyte *data);
     ... from raw internal-format data in the default internal format
     that is "null-terminated" (the meaning of this depends on the nature
     of the default internal format).
void eicat_lstr (ei, Lisp_Object lisp_string);
     ... from a Lisp_Object string.
void eicat_ch (ei, Ichar ch);
     ... from an Ichar.

All except the first variety are convenience functions.
n the general case, create another Eistring from the source.)


 ********************************************** 
 *                Replacement                 * 
 ********************************************** 

void eisub_* (Eistring *eistr, Bytecount off, Charcount charoff,
     			  Bytecount len, Charcount charlen, ...);
     Replace a section of the Eistring, specifically:

void eisub_ei (Eistring *eistr, Bytecount off, Charcount charoff,
     	  Bytecount len, Charcount charlen, Eistring *eistr2);
     ... with another Eistring.
void eisub_c (Eistring *eistr, Bytecount off, Charcount charoff,
     	 Bytecount len, Charcount charlen, Ascbyte *c_string);
     ... with an ASCII null-terminated string.  Non-ASCII characters in
     the string are ILLEGAL (read abort() with error-checking defined).
void eisub_ch (Eistring *eistr, Bytecount off, Charcount charoff,
     	  Bytecount len, Charcount charlen, Ichar ch);
     ... with an Ichar.

void eidel (Eistring *eistr, Bytecount off, Charcount charoff,
            Bytecount len, Charcount charlen);
     Delete a section of the Eistring.


 ********************************************** 
 *      Converting to an external format      * 
 ********************************************** 

void eito_external (Eistring *eistr, Lisp_Object codesys);
     Convert the Eistring to an external format and store the result
     in the string.  NOTE: Further changes to the Eistring will NOT
     change the external data stored in the string.  You will have to
     call eito_external() again in such a case if you want the external
     data.

Extbyte *eiextdata (Eistring *eistr);
     Return a pointer to the external data stored in the Eistring as
     a result of a prior call to eito_external().

Bytecount eiextlen (Eistring *eistr);
     Return the length in bytes of the external data stored in the
     Eistring as a result of a prior call to eito_external().


 ********************************************** 
 * Searching in the Eistring for a character  * 
 ********************************************** 

Bytecount eichr (Eistring *eistr, Ichar chr);
Charcount eichr_char (Eistring *eistr, Ichar chr);
Bytecount eichr_off (Eistring *eistr, Ichar chr, Bytecount off,
     		Charcount charoff);
Charcount eichr_off_char (Eistring *eistr, Ichar chr, Bytecount off,
     		     Charcount charoff);
Bytecount eirchr (Eistring *eistr, Ichar chr);
Charcount eirchr_char (Eistring *eistr, Ichar chr);
Bytecount eirchr_off (Eistring *eistr, Ichar chr, Bytecount off,
     		 Charcount charoff);
Charcount eirchr_off_char (Eistring *eistr, Ichar chr, Bytecount off,
     		      Charcount charoff);


 ********************************************** 
 *   Searching in the Eistring for a string   * 
 ********************************************** 

Bytecount eistr_ei (Eistring *eistr, Eistring *eistr2);
Charcount eistr_ei_char (Eistring *eistr, Eistring *eistr2);
Bytecount eistr_ei_off (Eistring *eistr, Eistring *eistr2, Bytecount off,
     		   Charcount charoff);
Charcount eistr_ei_off_char (Eistring *eistr, Eistring *eistr2,
     			Bytecount off, Charcount charoff);
Bytecount eirstr_ei (Eistring *eistr, Eistring *eistr2);
Charcount eirstr_ei_char (Eistring *eistr, Eistring *eistr2);
Bytecount eirstr_ei_off (Eistring *eistr, Eistring *eistr2, Bytecount off,
     		    Charcount charoff);
Charcount eirstr_ei_off_char (Eistring *eistr, Eistring *eistr2,
     			 Bytecount off, Charcount charoff);

Bytecount eistr_c (Eistring *eistr, Ascbyte *c_string);
Charcount eistr_c_char (Eistring *eistr, Ascbyte *c_string);
Bytecount eistr_c_off (Eistring *eistr, Ascbyte *c_string, Bytecount off,
     		   Charcount charoff);
Charcount eistr_c_off_char (Eistring *eistr, Ascbyte *c_string,
     		       Bytecount off, Charcount charoff);
Bytecount eirstr_c (Eistring *eistr, Ascbyte *c_string);
Charcount eirstr_c_char (Eistring *eistr, Ascbyte *c_string);
Bytecount eirstr_c_off (Eistring *eistr, Ascbyte *c_string,
     		   Bytecount off, Charcount charoff);
Charcount eirstr_c_off_char (Eistring *eistr, Ascbyte *c_string,
     			Bytecount off, Charcount charoff);


 ********************************************** 
 *                 Comparison                 * 
 ********************************************** 

int eicmp_* (Eistring *eistr, ...);
int eicmp_off_* (Eistring *eistr, Bytecount off, Charcount charoff,
                 Bytecount len, Charcount charlen, ...);
int eicasecmp_* (Eistring *eistr, ...);
int eicasecmp_off_* (Eistring *eistr, Bytecount off, Charcount charoff,
                     Bytecount len, Charcount charlen, ...);
int eicasecmp_i18n_* (Eistring *eistr, ...);
int eicasecmp_i18n_off_* (Eistring *eistr, Bytecount off, Charcount charoff,
                          Bytecount len, Charcount charlen, ...);

     Compare the Eistring with the other data.  Return value same as
     from strcmp.  The * is either ei for another Eistring (in
     which case ... is an Eistring), or c for a pure-ASCII string
     (in which case ... is a pointer to that string).  For anything
     more complex, first create an Eistring out of the source.
     Comparison is either simple (eicmp_...), ASCII case-folding
     (eicasecmp_...), or multilingual case-folding
     (eicasecmp_i18n_...).


More specifically, the prototypes are:

int eicmp_ei (Eistring *eistr, Eistring *eistr2);
int eicmp_off_ei (Eistring *eistr, Bytecount off, Charcount charoff,
                  Bytecount len, Charcount charlen, Eistring *eistr2);
int eicasecmp_ei (Eistring *eistr, Eistring *eistr2);
int eicasecmp_off_ei (Eistring *eistr, Bytecount off, Charcount charoff,
                      Bytecount len, Charcount charlen, Eistring *eistr2);
int eicasecmp_i18n_ei (Eistring *eistr, Eistring *eistr2);
int eicasecmp_i18n_off_ei (Eistring *eistr, Bytecount off,
     		      Charcount charoff, Bytecount len,
     		      Charcount charlen, Eistring *eistr2);

int eicmp_c (Eistring *eistr, Ascbyte *c_string);
int eicmp_off_c (Eistring *eistr, Bytecount off, Charcount charoff,
                 Bytecount len, Charcount charlen, Ascbyte *c_string);
int eicasecmp_c (Eistring *eistr, Ascbyte *c_string);
int eicasecmp_off_c (Eistring *eistr, Bytecount off, Charcount charoff,
                     Bytecount len, Charcount charlen,
                     Ascbyte *c_string);
int eicasecmp_i18n_c (Eistring *eistr, Ascbyte *c_string);
int eicasecmp_i18n_off_c (Eistring *eistr, Bytecount off, Charcount charoff,
                          Bytecount len, Charcount charlen,
                          Ascbyte *c_string);


 ********************************************** 
 *         Case-changing the Eistring         * 
 ********************************************** 

void eilwr (Eistring *eistr);
     Convert all characters in the Eistring to lowercase.
void eiupr (Eistring *eistr);
     Convert all characters in the Eistring to uppercase.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.10 Coding for Mule

Although Mule support is not compiled by default in XEmacs, many people are using it, and we consider it crucial that new code works correctly with multibyte characters. This is not hard; it is only a matter of following several simple user-interface guidelines. Even if you never compile with Mule, with a little practice you will find it quite easy to code Mule-correctly.

Note that these guidelines are not necessarily tied to the current Mule implementation; they are also a good idea to follow on the grounds of code generalization for future I18N work.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.10.1 Character-Related Data Types

First, let’s review the basic character-related datatypes used by XEmacs. Note that some of the separate typedefs are not mandatory, but they improve clarity of code a great deal, because one glance at the declaration can tell the intended use of the variable.

Ichar

An Ichar holds a single Emacs character.

Obviously, the equality between characters and bytes is lost in the Mule world. Characters can be represented by one or more bytes in the buffer, and Ichar is a C type large enough to hold any character. (This currently isn’t quite true for ISO 10646, which defines a character as a 31-bit non-negative quantity, while XEmacs characters are only 30-bits. This is irrelevant, unless you are considering using the ISO 10646 private groups to support really large private character sets—in particular, the Mule character set!—in a version of XEmacs using Unicode internally.)

Without Mule support, an Ichar is equivalent to an unsigned char. [[This doesn’t seem to be true; ‘lisp.h’ unconditionally ‘typedef’s Ichar to int.]]

Ibyte

The data representing the text in a buffer or string is logically a set of Ibytes.

XEmacs does not work with the same character formats all the time; when reading characters from the outside, it decodes them to an internal format, and likewise encodes them when writing. Ibyte (in fact unsigned char) is the basic unit of XEmacs internal buffers and strings format. An Ibyte * is the type that points at text encoded in the variable-width internal encoding.

One character can correspond to one or more Ibytes. In the current Mule implementation, an ASCII character is represented by the same Ibyte, and other characters are represented by a sequence of two or more Ibytes. (This will also be true of an implementation using UTF-8 as the internal encoding. In fact, only code that implements character code conversions and a very few macros used to implement motion by whole characters will notice the difference between UTF-8 and the Mule encoding.)

Without Mule support, there are exactly 256 characters, implicitly Latin-1, and each character is represented using one Ibyte, and there is a one-to-one correspondence between Ibytes and Ichars.

Charxpos

Charbpos

Charcount

A Charbpos represents a character position in a buffer. A Charcount represents a number (count) of characters. Logically, subtracting two Charbpos values yields a Charcount value. When representing a character position in a string, we just use Charcount directly. The reason for having a separate typedef for buffer positions is that they are 1-based, whereas string positions are 0-based and hence string counts and positions can be freely intermixed (a string position is equivalent to the count of characters from the beginning). When representing a character position that could be either in a buffer or string (for example, in the extent code), Charxpos is used. Although all of these are typedefed to EMACS_INT, we use them in preference to EMACS_INT to make it clear what sort of position is being used.

Charxpos, Charbpos and Charcount values are the only ones that are ever visible to Lisp.

Bytexpos

Bytecount

A Bytebpos represents a byte position in a buffer. A Bytecount represents the distance between two positions, in bytes. Byte positions in strings use Bytecount, and for byte positions that can be either in a buffer or string, Bytexpos is used. The relationship between Bytexpos, Bytebpos and Bytecount is the same as the relationship between Charxpos, Charbpos and Charcount.

Extbyte

When dealing with the outside world, XEmacs works with Extbytes, which are equivalent to char. The distance between two Extbytes is a Bytecount, since external text is a byte-by-byte encoding. Extbytes occur mainly at the transition point between internal text and external functions. XEmacs code should not, if it can possibly avoid it, do any actual manipulation using external text, since its format is completely unpredictable (it might not even be ASCII-compatible).

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.10.2 Working With Character and Byte Positions

Now that we have defined the basic character-related types, we can look at the macros and functions designed for work with them and for conversion between them. Most of these macros are defined in ‘buffer.h’, and we don’t discuss all of them here, but only the most important ones. Examining the existing code is the best way to learn about them.

MAX_ICHAR_LEN

This preprocessor constant is the maximum number of buffer bytes to represent an Emacs character in the variable width internal encoding. It is useful when allocating temporary strings to keep a known number of characters. For instance:

{
  Charcount cclen;
  ...
  {
    /* Allocate place for cclen characters. */
    Ibyte *buf = (Ibyte *) alloca (cclen * MAX_ICHAR_LEN);
...

If you followed the previous section, you can guess that, logically, multiplying a Charcount value with MAX_ICHAR_LEN produces a Bytecount value.

In the current Mule implementation, MAX_ICHAR_LEN equals 4. Without Mule, it is 1. In a mature Unicode-based XEmacs, it will also be 4 (since all Unicode characters can be encoded in UTF-8 in 4 bytes or less), but some versions may use up to 6, in order to use the large private space provided by ISO 10646 to “mirror” the Mule code space.

itext_ichar

set_itext_ichar

The itext_ichar macro takes a Ibyte pointer and returns the Ichar stored at that position. If it were a function, its prototype would be:

Ichar itext_ichar (Ibyte *p);

set_itext_ichar stores an Ichar to the specified byte position. It returns the number of bytes stored:

Bytecount set_itext_ichar (Ibyte *p, Ichar c);

It is important to note that set_itext_ichar is safe only for appending a character at the end of a buffer, not for overwriting a character in the middle. This is because the width of characters varies, and set_itext_ichar cannot resize the string if it writes, say, a two-byte character where a single-byte character used to reside.

A typical use of set_itext_ichar can be demonstrated by this example, which copies characters from buffer buf to a temporary string of Ibytes.

{
  Charbpos pos;
  for (pos = beg; pos < end; pos++)
    {
      Ichar c = BUF_FETCH_CHAR (buf, pos);
      p += set_itext_ichar (buf, c);
    }
}

Note how set_itext_ichar is used to store the Ichar and increment the counter, at the same time.

INC_IBYTEPTR

DEC_IBYTEPTR

These two macros increment and decrement an Ibyte pointer, respectively. They will adjust the pointer by the appropriate number of bytes according to the byte length of the character stored there. Both macros assume that the memory address is located at the beginning of a valid character.

Without Mule support, INC_IBYTEPTR (p) and DEC_IBYTEPTR (p) simply expand to p++ and p--, respectively.

bytecount_to_charcount

Given a pointer to a text string and a length in bytes, return the equivalent length in characters.

Charcount bytecount_to_charcount (Ibyte *p, Bytecount bc);

charcount_to_bytecount

Given a pointer to a text string and a length in characters, return the equivalent length in bytes.

Bytecount charcount_to_bytecount (Ibyte *p, Charcount cc);

itext_n_addr

Return a pointer to the beginning of the character offset cc (in characters) from p.

Ibyte *itext_n_addr (Ibyte *p, Charcount cc);

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.10.3 Conversion to and from External Data

When an external function, such as a C library function, returns a char pointer, you should almost never treat it as Ibyte. This is because these returned strings may contain 8bit characters which can be misinterpreted by XEmacs, and cause a crash. Likewise, when exporting a piece of internal text to the outside world, you should always convert it to an appropriate external encoding, lest the internal stuff (such as the infamous \201 characters) leak out.

The interface to conversion between the internal and external representations of text are the numerous conversion macros defined in ‘buffer.h’. There used to be a fixed set of external formats supported by these macros, but now any coding system can be used with them. The coding system alias mechanism is used to create the following logical coding systems, which replace the fixed external formats. The (dontusethis-set-symbol-value-handler) mechanism was enhanced to make this possible (more work on that is needed).

Often useful coding systems:

Qbinary

This is the simplest format and is what we use in the absence of a more appropriate format. This converts according to the binary coding system:

On input, bytes 0–255 are converted into (implicitly Latin-1) characters 0–255. A non-Mule xemacs doesn’t really know about different character sets and the fonts to display them, so the bytes can be treated as text in different 1-byte encodings by simply setting the appropriate fonts. So in a sense, non-Mule xemacs is a multi-lingual editor if, for example, different fonts are used to display text in different buffers, faces, or windows. The specifier mechanism gives the user complete control over this kind of behavior.
On output, characters 0–255 are converted into bytes 0–255 and other characters are converted into ‘~’.

Qnative

Format used for the external Unix environment—argv[], stuff from getenv(), stuff from the ‘/etc/passwd’ file, etc. This is encoded according to the encoding specified by the current locale. [[This is dangerous; current locale is user preference, and the system is probably going to be something else. Is there anything we can do about it?]]

Qfile_name

Format used for filenames. This is normally the same as Qnative, but the two should be distinguished for clarity and possible future separation – and also because Qfile_name can be changed using either the file-name-coding-system or pathname-coding-system (now obsolete) variables.

Qctext

Compound-text format. This is the standard X11 format used for data stored in properties, selections, and the like. This is an 8-bit no-lock-shift ISO2022 coding system. This is a real coding system, unlike Qfile_name, which is user-definable.

Qmswindows_tstr

Used for external data in all MS Windows functions that are declared to accept data of type LPTSTR or LPCSTR. This maps to either Qmswindows_multibyte (a locale-specific encoding, same as Qnative) or Qmswindows_unicode, depending on whether XEmacs is being run under Windows 9X or Windows NT/2000/XP.

Many other coding systems are provided by default.

There are two fundamental macros to convert between external and internal format, as well as various convenience macros to simplify the most common operations.

TO_INTERNAL_FORMAT converts external data to internal format, and TO_EXTERNAL_FORMAT converts the other way around. The arguments each of these receives are a source type, a source, a sink type, a sink, and a coding system (or a symbol naming a coding system).

A typical call looks like

TO_EXTERNAL_FORMAT (LISP_STRING, str, C_STRING_MALLOC, ptr, Qfile_name);

which means that the contents of the lisp string str are written to a malloc’ed memory area which will be pointed to by ptr, after the function returns. The conversion will be done using the file-name coding system, which will be controlled by the user indirectly by setting or binding the variable file-name-coding-system.

So we can have a call that looks like

TO_INTERNAL_FORMAT (DATA, (ptr, len),
                    MALLOC, (ptr, len),
                    coding_system);

The parenthesized argument pairs are required to make the preprocessor magic work.

Here are the different source and sink types:

DATA, (ptr, len),: input data is a fixed buffer of size len at address ptr
ALLOCA, (ptr, len),: output data is placed in an alloca()ed buffer of size len pointed to by ptr
MALLOC, (ptr, len),: output data is in a malloc()ed buffer of size len pointed to by ptr
C_STRING_ALLOCA, ptr,: equivalent to ALLOCA (ptr, len_ignored) on output.
C_STRING_MALLOC, ptr,: equivalent to MALLOC (ptr, len_ignored) on output
C_STRING, ptr,: equivalent to DATA, (ptr, strlen/wcslen (ptr)) on input
LISP_STRING, string,: input or output is a Lisp_Object of type string
LISP_BUFFER, buffer,: output is written to (point) in lisp buffer buffer
LISP_LSTREAM, lstream,: input or output is a Lisp_Object of type lstream
LISP_OPAQUE, object,: input or output is a Lisp_Object of type opaque

A source type of C_STRING or a sink type of C_STRING_ALLOCA or C_STRING_MALLOC is appropriate where the external API is not ’\0’-byte-clean – i.e. it expects strings to be terminated with a null byte. For external APIs that are in fact ’\0’-byte-clean, we should of course not use these.

The sinks to be specified must be lvalues, unless they are the lisp object types LISP_LSTREAM or LISP_BUFFER.

There is no problem using the same lvalue for source and sink.

Garbage collection is inhibited during these conversion operations, so it is OK to pass in data from Lisp strings using XSTRING_DATA.

Note that it doesn’t make sense for LISP_STRING to be a source for TO_INTERNAL_FORMAT or a sink for TO_EXTERNAL_FORMAT. You’ll get an assertion failure if you try.

99% of conversions involve raw data or Lisp strings as both source and sink, and usually data is output as alloca(), or sometimes xmalloc(). For this reason, convenience macros are defined for many types of conversions involving raw data and/or Lisp strings, especially when the output is an alloca()ed string. (When the destination is a Lisp string, there are other functions that should be used instead – build_extstring() and make_extstring(), for example.) Most convenience macros return the result as the return value. However, when two values need to be returned (that is, the output is sized data), both values are stored into variables that are passed into the macros as parameters. NOTE: All convenience macros are ultimately defined in terms of TO_EXTERNAL_FORMAT and TO_INTERNAL_FORMAT. Thus, any comments above about the workings of these macros also apply to all convenience macros.

A typical convenience macro is

  out = ITEXT_TO_EXTERNAL (in, codesys);

This is equivalent to

  TO_EXTERNAL_FORMAT (C_STRING, in, C_STRING_ALLOCA, out, codesys);

but is easier to write and somewhat clearer, since it clearly identifies the arguments without the clutter of having the preprocessor types mixed in. Furthermore, it returns the converted data (still in alloca() space) rather than storing it, which is far more convenient for most operations as there is no need to declare an extra temporary variable to hold the return value.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.10.4 General Guidelines for Writing Mule-Aware Code

This section contains some general guidance on how to write Mule-aware code, as well as some pitfalls you should avoid.

Never use char and char *.

In XEmacs, the use of char and char * is almost always a mistake. If you want to manipulate an Emacs character from “C”, use Ichar. If you want to examine a specific octet in the internal format, use Ibyte. If you want a Lisp-visible character, use a Lisp_Object and make_char. If you want a pointer to move through the internal text, use Ibyte *. Also note that you almost certainly do not need Ichar *.

All uses of char should be replaced with one of the following:

Ibyte

Pointer to internally-formatted text. The data representing the text in a buffer is logically a set of Ibytes.

CIbyte

Used when you are working with internal data but for whatever reason need to have it declared a char *. Examples are function arguments whose values are most commonly literal strings, or where you have to apply a stdlib string function to internal data.

In general, you should avoid this where possible and use Ascbyte if the text is just ASCII (e.g. string literals) or otherwise Ibyte, for consistency. For example, the new Mule workspace contains Ibyte versions of the stdlib string functions.

Extbyte, UExtbyte

Pointer to text in some external format, which can be defined as all formats other than the internal one. The data representing a string in “external” format (binary or any external encoding) is logically a set of Extbytes. Extbyte is guaranteed to be just a char, so for example strlen (Extbyte *) is OK. Extbyte is only a documentation device for referring to external text.

Ascbyte, UAscbyte

pure ASCII text, consisting of bytesf in a string in entirely US-ASCII format: (Nothing outside the range 00 - 7F).

Binbyte, CBinbyte, SBinbyte

Binary data that is not meant to be interpreted as text.

Rawbyte, CRawbyte

General data in memory, where we don’t care about whether it’s text or binary; often used when computing memory-based/byte-based offsets of pointers. In general, there should be no manipulation of the memory pointed to by these pointers other than just copying it around.

Boolbyte

A byte used to represent a boolean value: 0 or 1. Normally use plain Boolint, and only use Boolbyte to save space.

Bitbyte

A byte composed of bitfields. Hardly ever used.

Chbyte, UChbyte, SChbyte

A no-semantics char. Used (pretty-much) ONLY for casting arguments to functions accepting a char *, unsigned char *, etc. where the other types don’t exactly apply and what you are logically concerned with is the type of the function’s argument and not its semantics.

DO NOT DO NOT DO NOT DO NOT use this as a sloppy replacement for one of the other types. If you’re not using this as part of casting an argument to a function call, and you’re not Ben Wing, you’re using it wrong. Go find another one of the types.

Note the significance of the prefixed versions of the above types:

U: unsigned char
S: signed char
C: plain char

Be careful not to confuse Charcount, Bytecount, Charbpos and Bytebpos.

The whole point of using different types is to avoid confusion about the use of certain variables. Lest this effect be nullified, you need to be careful about using the right types.

Always convert external data

It is extremely important to always convert external data, because XEmacs can crash if unexpected 8-bit sequences are copied to its internal buffers literally.

This means that when a system function, such as readdir, returns a string, you normally need to convert it using one of the conversion macros described in the previous chapter, before passing it further to Lisp.

Actually, most of the basic system functions that accept ’\0’-terminated string arguments, like stat() and open(), have encapsulated equivalents that do the internal to external conversion themselves. The encapsulated equivalents have a qxe_ prefix and have string arguments of type Ibyte *, and you can pass internally encoded data to them, often from a Lisp string using XSTRING_DATA. (A better design might be to provide versions that accept Lisp strings directly.) [[Really? Then they’d either take Lisp_Objects and need to check type, or they’d take Lisp_Strings, and violate the rules about passing any of the specific Lisp types.]]

Also note that many internal functions, such as make_string, accept Ibytes, which removes the need for them to convert the data they receive. This increases efficiency because that way external data needs to be decoded only once, when it is read. After that, it is passed around in internal format.

Do all work in internal format

External-formatted data is completely unpredictable in its format. It may be fixed-width Unicode (not even ASCII compatible); it may be a modal encoding, in which case some occurrences of (e.g.) the slash character may be part of two-byte Asian-language characters, and a naive attempt to split apart a pathname by slashes will fail; etc. Internal-format text should be converted to external format only at the point where an external API is actually called, and the first thing done after receiving external-format text from an external API should be to convert it to internal text.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.10.5 An Example of Mule-Aware Code

As an example of Mule-aware code, we will analyze the string function, which conses up a Lisp string from the character arguments it receives. Here is the definition, pasted from alloc.c:

DEFUN ("string", Fstring, 0, MANY, 0, /*
Concatenate all the argument characters and make the result a string.
*/
       (int nargs, Lisp_Object *args))
{
  Ibyte *storage = alloca_array (Ibyte, nargs * MAX_ICHAR_LEN);
  Ibyte *p = storage;

  for (; nargs; nargs--, args++)
    {
      Lisp_Object lisp_char = *args;
      CHECK_CHAR_COERCE_INT (lisp_char);
      p += set_itext_ichar (p, XCHAR (lisp_char));
    }
  return make_string (storage, p - storage);
}

Now we can analyze the source line by line.

Obviously, string will be as long as there are arguments to the function. This is why we allocate MAX_ICHAR_LEN * nargs bytes on the stack, i.e. the worst-case number of bytes for nargs Ichars to fit in the string.

Then, the loop checks that each element is a character, converting integers in the process. Like many other functions in XEmacs, this function silently accepts integers where characters are expected, for historical and compatibility reasons. Unless you know what you are doing, CHECK_CHAR will also suffice. XCHAR (lisp_char) extracts the Ichar from the Lisp_Object, and set_itext_ichar stores it to storage, increasing p in the process.

Other instructive examples of correct coding under Mule can be found all over the XEmacs code. For starters, I recommend Fnormalize_menu_item_name in ‘menubar.c’. After you have understood this section of the manual and studied the examples, you can proceed writing new Mule-aware code.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.10.6 Mule-izing Code

A lot of code is written without Mule in mind, and needs to be made Mule-correct or “Mule-ized”. There is really no substitute for line-by-line analysis when doing this, but the following checklist can help:

Check all uses of XSTRING_DATA.
Check all uses of build_cistring and make_string.
Check all uses of tolower and toupper.
Check object print methods.
Check for use of functions such as write_cistring, write_fmt_string, stderr_out, stdout_out.
Check all occurrences of char and correct to one of the other typedefs described above.
Check all existing uses of TO_EXTERNAL_FORMAT, TO_INTERNAL_FORMAT, and any convenience macros (grep for ‘EXTERNAL_TO’, ‘TO_EXTERNAL’, and ‘TO_SIZED_EXTERNAL’).
In Windows code, string literals may need to be encapsulated with XETEXT.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.11 CCL

MACHINE CODE:

The machine code consists of a vector of 32-bit words.
The first such word specifies the start of the EOF section of the code;
this is the code executed to handle any stuff that needs to be done
(e.g. designating back to ASCII and left-to-right mode) after all
other encoded/decoded data has been written out.  This is not used for
charset CCL programs.

REGISTER: 0..7  -- referred by RRR or rrr

OPERATOR BIT FIELD (27-bit): XXXXXXXXXXXXXXX RRR TTTTT
        TTTTT (5-bit): operator type
        RRR (3-bit): register number
        XXXXXXXXXXXXXXXX (15-bit):
                CCCCCCCCCCCCCCC: constant or address
                000000000000rrr: register number

AAAA:   00000 +
        00001 -
        00010 *
        00011 /
        00100 %
        00101 &
        00110 |
        00111 ~

        01000 <<
        01001 >>
        01010 <8
        01011 >8
        01100 //
        01101 not used
        01110 not used
        01111 not used

        10000 <
        10001 >
        10010 ==
        10011 <=
        10100 >=
        10101 !=

OPERATORS:      TTTTT RRR XX..

SetCS:          00000 RRR C...C      RRR = C...C
SetCL:          00001 RRR .....      RRR = c...c
                c.............c
SetR:           00010 RRR ..rrr      RRR = rrr
SetA:           00011 RRR ..rrr      RRR = array[rrr]
                C.............C      size of array = C...C
                c.............c      contents = c...c

Jump:           00100 000 c...c      jump to c...c
JumpCond:       00101 RRR c...c      if (!RRR) jump to c...c
WriteJump:      00110 RRR c...c      Write1 RRR, jump to c...c
WriteReadJump:  00111 RRR c...c      Write1, Read1 RRR, jump to c...c
WriteCJump:     01000 000 c...c      Write1 C...C, jump to c...c
                C...C
WriteCReadJump: 01001 RRR c...c      Write1 C...C, Read1 RRR,
                C.............C      and jump to c...c
WriteSJump:     01010 000 c...c      WriteS, jump to c...c
                C.............C
                S.............S
                ...
WriteSReadJump: 01011 RRR c...c      WriteS, Read1 RRR, jump to c...c
                C.............C
                S.............S
                ...
WriteAReadJump: 01100 RRR c...c      WriteA, Read1 RRR, jump to c...c
                C.............C      size of array = C...C
                c.............c      contents = c...c
                ...
Branch:         01101 RRR C...C      if (RRR >= 0 && RRR < C..)
                c.............c      branch to (RRR+1)th address
Read1:          01110 RRR ...        read 1-byte to RRR
Read2:          01111 RRR ..rrr      read 2-byte to RRR and rrr
ReadBranch:     10000 RRR C...C      Read1 and Branch
                c.............c
                ...
Write1:         10001 RRR .....      write 1-byte RRR
Write2:         10010 RRR ..rrr      write 2-byte RRR and rrr
WriteC:         10011 000 .....      write 1-char C...CC
                C.............C
WriteS:         10100 000 .....      write C..-byte of string
                C.............C
                S.............S
                ...
WriteA:         10101 RRR .....      write array[RRR]
                C.............C      size of array = C...C
                c.............c      contents = c...c
                ...
End:            10110 000 .....      terminate the execution

SetSelfCS:      10111 RRR C...C      RRR AAAAA= C...C
                ..........AAAAA
SetSelfCL:      11000 RRR .....      RRR AAAAA= c...c
                c.............c
                ..........AAAAA
SetSelfR:       11001 RRR ..Rrr      RRR AAAAA= rrr
                ..........AAAAA
SetExprCL:      11010 RRR ..Rrr      RRR = rrr AAAAA c...c
                c.............c
                ..........AAAAA
SetExprR:       11011 RRR ..rrr      RRR = rrr AAAAA Rrr
                ............Rrr
                ..........AAAAA
JumpCondC:      11100 RRR c...c      if !(RRR AAAAA C..) jump to c...c
                C.............C
                ..........AAAAA
JumpCondR:      11101 RRR c...c      if !(RRR AAAAA rrr) jump to c...c
                ............rrr
                ..........AAAAA
ReadJumpCondC:  11110 RRR c...c      Read1 and JumpCondC
                C.............C
                ..........AAAAA
ReadJumpCondR:  11111 RRR c...c      Read1 and JumpCondR
                ............rrr
                ..........AAAAA

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.12 Microsoft Windows-Related Multilingual Issues

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.12.1 Microsoft Documentation

Documentation on international support in Windows is scattered throughout MSDN. Here are some good places to look:

C Runtime (CRT) intl support
1. Visual Tools and Languages -> Visual Studio 6.0 Documentation -> Visual C++ Documentation -> Using Visual C++ -> Run-Time Library Reference -> Internationalization
2. Visual Tools and Languages -> Visual Studio 6.0 Documentation -> Visual C++ Documentation -> Using Visual C++ -> Run-Time Library Reference -> Global Constants -> Locale Categories
3. Visual Tools and Languages -> Visual Studio 6.0 Documentation -> Visual C++ Documentation -> Using Visual C++ -> Run-Time Library Reference -> Appendixes -> Language and Country/Region Strings
4. Visual Tools and Languages -> Visual Studio 6.0 Documentation -> Visual C++ Documentation -> Using Visual C++ -> Run-Time Library Reference -> Appendixes -> Generic-Text Mappings
5. Function documentation for various functions: Visual Tools and Languages -> Visual Studio 6.0 Documentation -> Visual C++ Documentation -> Using Visual C++ -> Run-Time Library Reference -> Alphabetic Function Reference e.g. _setmbcp(), setlocale(), strcoll functions
Win32 API intl support
1. Platform SDK Documentation -> Base Services -> International Features
2. Platform SDK Documentation -> User Interface Services -> Windows User Interface -> User Input -> Keyboard Input -> Character Messages -> International Features
3. Backgrounders -> Windows Platform -> Windows 2000 -> International Support in Microsoft Windows 2000
Microsoft Layer for Unicode
Platform SDK Documentation -> Windows API -> Windows 95/98/Me Programming -> Windows 95/98/Me Overviews -> Microsoft Layer for Unicode on Windows 95/98/Me Systems
Look in the CRT sources! They come with VC++. See win32.c.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.12.2 Locales, code pages, and other concepts of “language”

First, make sure you clearly understand the difference between the C runtime library (CRT) and the Win32 API! See win32.c.

There are various different ways of representing the vague concept of “language”, and it can be very confusing. So:

The CRT library has the concept of “locale”, which is a combination of language and country, and which controls the way currency and dates are displayed, the encoding of data, etc.
XEmacs has the concept of “language environment”, more or less like a locale; although currently in most cases it just refers to the language, and no sub-language distinctions are made. (Exceptions are with Chinese, which has different language environments for Taiwan and mainland China, due to the different encodings and writing systems.)
Windows has a number of different language concepts:
1. There are “languages” and “sublanguages”, which correspond to the languages and countries of the C library – e.g. LANG_ENGLISH and SUBLANG_ENGLISH_US. These are identified by 8-bit integers, called the “primary language identifier” and “sublanguage identifier”, respectively. These are combined into a 16-bit integer or “language identifier” by MAKELANGID().
2. The language identifier in turn is combined with a “sort identifier” (and optionally a “sort version”) to yield a 32-bit integer called a “locale identifier” (type LCID), which identifies locales – the primary means of distinguishing language/regional settings and similar to C library locales.
3. A “code page” combines the XEmacs concepts of “charset” and “coding system”. It logically encompasses
  - - a set of supported characters
  - - an enumeration associating each character with a code point, which is a number or number pair; there may be disjoint ranges of numbers supported
  - - a way of encoding a series of characters into a string of bytes
  Note that the first two properties correspond to an XEmacs “charset” and the latter an XEmacs “coding system”.
  
  Traditional encodings are either simple one-byte encodings, or combination one-byte/two-byte encodings (aka MBCS encodings, where MBCS stands for “Multibyte Character Set”) with the following properties:
  - - all characters are encoded as a one-byte or two-byte sequence
  - - the encoding is stateless (non-modal)
  - - the lower 128 bytes are compatible with ASCII
  - - in the higher bytes, the value of the first byte (“lead byte”) determines whether a second byte follows
  - - the values used for second bytes may overlap those used for first bytes, and (in some encodings) include values in the low half; thus, moving backwards is hard, and pure-ASCII algorithms (e.g. finding the next slash) will fail unless rewritten to be MBCS-aware (neither of these problems exist in UTF-8 or in the XEmacs internal string encoding)
  Recent code pages, however, do not necessarily follow these properties – code pages have been expanded to include arbitrary encodings, such as UTF-8 (may have more than two bytes per character) and ISO-2022-JP (complex modal encoding).
4. Every Windows locale has four associated code pages: ANSI (an international standard or some Microsoft-created approximation; the native code page under Windows), OEM (a DOS encoding, still used in the FAT file system), Mac (an encoding used on the Macintosh) and EBCDIC (a non-ASCII-compatible encoding used on IBM mainframes, originally based on the BCD or “binary-coded decimal” encoding of numbers). All code pages associated with a locale follow (as far as I know) the properties listed above for traditional code pages. More than one locale can share a code page – e.g. all the Western European languages, including English, do.
5. Windows also has an “input locale identifier” (aka “keyboard layout id”) or HKL, which is a 32-bit integer composed of the 16-bit language identifier and a 16-bit “device identifier”, which originally specified a particular keyboard layout (e.g. the locale “US English” can have the QWERTY layout, the Dvorak layout, etc.), but has been expanded to include speech-to-text converters and other non-keyboard ways of inputting text. Note that both the HKL and LCID share the language identifier in the lower 16 bits, and in both cases a 0 in the upper 16 bits means “default” (sort order or device), providing a way to convert between HKL’s, LCID’s, and language identifiers (i.e. language/sublanguage pairs). The default keyboard layout for a language is (as far as I can determine) established using the Regional Settings control panel applet, where you can add input locales as combinations of language (actually language/sublanguage) and layout; presumably if you list only one input locale with a particular language, the corresponding layout is the default for that language. But what if you list more than one? You can specify a single default input locale, but there appears to be no way to do so on a per-language basis.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.12.3 More about code pages

Here is what MSDN says about code pages (article “Code Pages”):

A code page is a character set, which can include numbers, punctuation marks, and other glyphs. Different languages and locales may use different code pages. For example, ANSI code page 1252 is used for American English and most European languages; OEM code page 932 is used for Japanese Kanji.

A code page can be represented in a table as a mapping of characters to single-byte values or multibyte values. Many code pages share the ASCII character set for characters in the range 0x00 ?0x7F.

The Microsoft run-time library uses the following types of code pages:

– System-default ANSI code page. By default, at startup the run-time system automatically sets the multibyte code page to the system-default ANSI code page, which is obtained from the operating system. The call

setlocale ( LC_ALL, "" );

also sets the locale to the system-default ANSI code page.

– Locale code page. The behavior of a number of run-time routines is dependent on the current locale setting, which includes the locale code page. (For more information, see Locale-Dependent Routines.) By default, all locale-dependent routines in the Microsoft run-time library use the code page that corresponds to the locale. At run-time you can change or query the locale code page in use with a call to setlocale.

– Multibyte code page. The behavior of most of the multibyte-character routines in the run-time library depends on the current multibyte code page setting. By default, these routines use the system-default ANSI code page. At run-time you can query and change the multibyte code page with _getmbcp and _setmbcp, respectively.

– The "C" locale is defined by ANSI to correspond to the locale in which C programs have traditionally executed. The code page for the "C" locale (code page) corresponds to the ASCII character set. For example, in the "C" locale, islower returns true for the values 0x61 to 0x7A only. In another locale, islower may return true for these as well as other values, as defined by that locale.

Under “Locale-Dependent Routines” we notice the following setlocale dependencies:

atof, atoi, atol (LC_NUMERIC) is Routines (LC_CTYPE) isleadbyte (LC_CTYPE) localeconv (LC_MONETARY, LC_NUMERIC) MB_CUR_MAX (LC_CTYPE) _mbccpy (LC_CTYPE) _mbclen (LC_CTYPE) mblen (LC_CTYPE ) _mbstrlen (LC_CTYPE) mbstowcs (LC_CTYPE) mbtowc (LC_CTYPE) printf (LC_NUMERIC, for radix character output) scanf (LC_NUMERIC, for radix character recognition) setlocale/_wsetlocale (Not applicable) strcoll (LC_COLLATE) _stricoll/_wcsicoll (LC_COLLATE) _strncoll/_wcsncoll (LC_COLLATE) _strnicoll/_wcsnicoll (LC_COLLATE) strftime, wcsftime (LC_TIME) _strlwr (LC_CTYPE) strtod/wcstod/strol/wcstol/strtoul/wcstoul (LC_NUMERIC, for radix character recognition) _strupr (LC_CTYPE) strxfrm/wcsxfrm (LC_COLLATE) tolower/towlower (LC_CTYPE) toupper/towupper (LC_CTYPE) wcstombs (LC_CTYPE) wctomb (LC_CTYPE) _wtoi/_wtol (LC_NUMERIC)

NOTE: The above documentation doesn’t clearly explain the “locale code page” and “multibyte code page”. These are two different values, maintained respectively in the CRT global variables __lc_codepage and __mbcodepage. Calling e.g. setlocale (LC_ALL, "JAPANESE") sets ONLY __lc_codepage to 932 (the code page for Japanese), and leaves __mbcodepage unchanged (usually 1252, i.e. Windows-ANSI). You’d have to call _setmbcp() to change __mbcodepage. Figuring out from the documentation which routines use which code page is not so obvious. But:

from “Interpretation of Multibyte-Character Sequences” it appears that all “multibyte-character routines” use the multibyte code page except for mblen(), _mbstrlen(), mbstowcs(), mbtowc(), wcstombs(), and wctomb().
from “_setmbcp”: “The multibyte code page also affects multibyte-character processing by the following run-time library routines: _exec functions _mktemp _stat _fullpath _spawn functions _tempnam _makepath _splitpath tmpnam. In addition, all run-time library routines that receive multibyte-character argv or envp program arguments as parameters (such as the _exec and _spawn families) process these strings according to the multibyte code page. Hence these routines are also affected by a call to _setmbcp that changes the multibyte code page.”

Summary: from looking at the CRT source (which comes with VC++) and carefully looking through the docs, it appears that:

the “locale code page” is used by all of the routines listed above under “Locale-Dependent Routines” (EXCEPT _mbccpy() and _mbclen()), as well as any other place that converts between multibyte and Unicode strings, e.g. the startup code.
the “multibyte code page” is used in all of the mb*() routines except mblen(), _mbstrlen(), mbstowcs(), mbtowc(), wcstombs(), and wctomb(); also _exec*(), _spawn*(), _mktemp(), _stat(), _fullpath(), _tempnam(), _makepath(), _splitpath(), tmpnam(), and similar functions without the leading underscore.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.12.4 More about locales

In addition to the locale defined by the CRT, Windows (i.e. the Win32 API) defines various locales:

The system-default locale is the locale defined under “Language settings for the system” in the “Regional Options” control panel. This is NOT user-specific, and changing it requires a reboot (at least under Windows 2000). The ANSI code page of the system-default locale is returned by GetACP(), and you can specify this code page in calls e.g. to MultiByteToWideChar with the constant CP_ACP.
The user-default locale is the locale defined under “Settings for the current user” in the “Regional Options” control panel.
There is a thread-local locale set by SetThreadLocale. #### What is this used for?

The Win32 API has a bunch of multibyte functions – all of those that end with ...A(), and on which we spend so much effort in intl-encap-win32.c. These appear to ALWAYS use the ANSI code page of the system-default locale (GetACP(), CP_ACP). Note that this applies also, for example, to the encoding of filenames in all file-handling routines, including the CRT ones such as open(), because they pass their args unchanged to the Win32 API.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.12.5 Unicode support under Windows

Basically, the whole concept of locales and code pages is broken, because it is extremely messy to support and does not allow for documents that use multiple languages simultaneously. Unicode was designed in response to this, the idea being to create a single character set that could be used to encode all the world’s languages. Windows has supported Unicode since the beginning of the Win32 API. Internally, every code page has an associated table to convert the characters of that code page to and from Unicode, and the Win32 API itself probably (perhaps always) uses Unicode internally.

Under Windows there are two different versions of all library routines that accept or return text, those that handle Unicode text and those handling “multibyte” text, i.e. variable-width ASCII-compatible text in some national format such as EUC or Shift-JIS. Because Windows 95 basically doesn’t support Unicode but Windows NT does, and Microsoft doesn’t provide any way of writing a single binary that will work on both systems and still use Unicode when it’s available (although see below, Microsoft Layer for Unicode), we need to provide a way of run-time conditionalizing so you could have one binary for both systems. “Unicode-splitting” refers to writing code that will handle this properly. This means using Qmswindows_tstr as the external conversion format, calling the appropriate qxe...() Unicode-split version of library functions, and doing other things in certain cases, e.g. when a qxe() function is not present.

Unicode support also requires that the various Windows APIs be “Unicode-encapsulated”, so that they automatically call the ANSI or Unicode version of the API call appropriately and handle the size differences in structures. What this means is:

first, note that Windows already provides a sort of encapsulation of all APIs that deal with text. All such APIs are underlyingly provided in two versions, with an A or W suffix (ANSI or “wide” i.e. Unicode), and the compile-time constant UNICODE controls which is selected by the unsuffixed API. Same thing happens with structures, and also with types, where the generic types have names beginning with T – TCHAR, LPTSTR, etc.. Unfortunately, this is compile-time only, not run-time, so not sufficient. (Creating the necessary run-time encoding is not conceptually difficult, but very time-consuming to write. It adds no significant overhead, and the only reason it’s not standard in Windows is conscious marketing attempts by Microsoft to cripple Windows 95. FUCK MICROSOFT! They even describe in a KnowledgeBase article exactly how to create such an API [although we don’t exactly follow their procedure], and point out its usefulness; the procedure is also described more generally in Nadine Kano’s book on Win32 internationalization – written SIX YEARS AGO! Obviously Microsoft has such an API available internally.)
what we do is provide an encapsulation of each standard Windows API call that is split into A and W versions. current theory is to avoid all preprocessor games; so we name the function with a prefix – “qxe” currently – and require callers to use the prefixed name. Callers need to explicitly use the W version of all structures, and convert text themselves using Qmswindows_tstr. the qxe encapsulated version will automatically call the appropriate A or W version depending on whether we’re running on 9x or NT (you can force use of the A calls on NT, e.g. for testing purposes, using the command- line switch -nuni aka -no-unicode-lib-calls), and copy data between W and A versions of the structures as necessary.
We require the caller to handle the actual translation of text to avoid possible overflow when dealing with fixed-size Windows structures. There are no such problems when copying data between the A and W versions because ANSI text is never larger than its equivalent Unicode representation.

NOTE NOTE NOTE: As of August 2001, Microsoft (finally! See my nasty comment above) released their own Unicode-encapsulation library, called Microsoft Layer for Unicode on Windows 95/98/Me Systems. It tries to be more transparent than we are, in that

its routines do ANSI/Unicode string translation, while we don’t, for efficiency (we already have to do internal/external conversion so it’s no extra burden to do the proper conversion directly rather than always converting to Unicode and then doing a second conversion to ANSI as necessary)
rather than requiring separately-named routines (qxeFooBar), they physically override the existing routines at the link level. it also appears that they do this BADLY, in that if you link with the MLU, you get an application that runs ONLY on Win9x!!! (hint – use GetProcAddress()). there’s still no way to create a single binary! fucking losers.
they assume you compile with UNICODE defined, so there’s no need for the application to explicitly use ...W structures, as we require.
they also intercept windows procedures to deal with notify messages as necessary, which we don’t do yet.
they (of course) don’t use Extbyte.

at some point (especially when they fix the single-binary problem!), we should consider switching. for the meantime, we’ll stick with what i’ve already written. perhaps we should think about adopting some of the greater transparency they have; but i opted against transparency on purpose, to make the code easier to follow for someone who’s not familiar with it. until our library is really complete and bug-free, we should think twice before doing this.

According to Microsoft documentation, only the following functions are provided under Windows 9x to support Unicode (see MSDN page “Windows 95/98/Me General Limitations”):

EnumResourceLanguagesW EnumResourceNamesW EnumResourceTypesW ExtTextOutW FindResourceW FindResourceExW GetCharWidthW GetCommandLineW GetTextExtentPointW GetTextExtentPoint32W lstrcatW lstrcpyW lstrlenW MessageBoxW MessageBoxExW MultiByteToWideChar TextOutW WideCharToMultiByte

also maybe GetTextExtentExPoint? (KB Q125671 “Unicode Functions Supported by Windows 95”)

Q210341 says this in addition:

SUMMARY:

Although Windows 95 is an eight-bit ANSI, or for Far East Windows, a Multibyte (MBCS) character set operating system, it implements a few Unicode functions. Windows 98 has added support for a few more functions and there are techniques to implement additional Unicode support.

MORE INFORMATION:

Windows 95 is natively an eight-bit character code operating system. That is, it fundamentally processes all character strings one byte at a time. Far East versions of Windows 95 are called Multibyte Character Set (MBCS) systems because they use a signal or lead byte combined with a second trailing byte to expand the character code range beyond the 256 limitation of a one-byte representation.

The Unicode standard offers application developers an opportunity to work with text without the limitations of character set based systems. For more information on the Unicode standard see the References" section of this article. Windows NT is a fully Unicode capable operating system so it may be desirable to write software that supports Unicode on Windows 95.

Even though Windows 95 and Windows 98 are not Unicode based, they do provide some limited Unicode functionality. Drawing of Unicode text is possible because the TrueType fonts that are used by Windows are encoded using Unicode. Therefore, a small subset of Win32 functions have wide character (Unicode) equivalents that are implemented in Windows 95. To review the list of these functions that was first published for Windows 95 see the white paper listed in the "References" section of this article.

The Quick Info information in the Platform SDK describes the following wide character functions as implemented on Windows 95:

[same list as above minus GetTextExtentExPoint, and minus lstrcpy/lstrcat]

For Windows 98, there have been two more functions implemented:

[lstrcpyW/lstrcatW]

Also available to applications on Windows 95 and later is the CF_UNICODETEXT clipboard format for exchanging/converting Unicode text across the clipboard. See Nadine Kano’s book listed in the "References" section of this article.

With this API subset, an application can read, write, display, and convert Unicode data. However, in some cases an application developer working with Unicode may find a need to work directly with the glyphs in the TrueType font file.

Such a case arises if a software developer would like to use the services of the GetGlyphOutline() function. Unfortunately, there is no wide character implementation of this function on Windows 95. However, this function does work with TrueType glyph indices so the solution is convert the Unicode character code to a glyph index.

A developer might also want to take advantage of the TrueType Open tables of a font to perform ligature or contextual glyph substitution. To do this, the application would need to work with glyph indices. See the "References" section of this article for more information on converting Unicode to glyph indices.

REFERENCES:

For additional information about Unicode and the GetGlyphOutline function, click the article number below to view the article in the Microsoft Knowledge Base:

241358 PRB: The GetGlyphOutlineW Function Fails on Windows 95 and Windows 98

For additional information about converting Unicode character codes, click the article number below to view the article in the Microsoft Knowledge Base:

241020 HOWTO: Translate Unicode Character Codes to TrueType Glyph Indices in Windows 95

For information on writing applications for world wide markets, please see the following book:

Developing International Software for Windows 95 and Windows NT by Nadine Kano. ISBN 1-55615-840-8 Microsoft Press. Also available on MSDN in the Books section.

Background white paper: Differences in Win32 API Implementations Among Windows Operating Systems by Noel Nyman.

Available on MSDN in the Windows Platform Guidelines section

However, the C runtime library provides some additional support (according to the CRT sources, as the docs are not very clear on this):

wmain() is completely supported, and appropriate Unicode-formatted argv and envp will always be passed.
Likewise, wWinMain() is completely supported. (NOTE: The docs are not at all clear on how these various entry points interact, and implies that a windows-subsystem program “must” use WinMain(), while a console- subsystem program “must” use main(), and a program compiled with UNICODE (which we don’t, see above) “must” use the w*() versions, while a program not compiled this way “must” use the plain versions. In fact it appears that the CRT provides four different compiler entry points, namely w?(main|WinMain)CRTStartup, and we simply choose the one we like using the appropriate link flag.
_wenviron, _wputenv

NOTE:

wsetargv.obj uses routines that were buggily left out of MSVCRT; anyway, from looking at the source, it does NOT correctly work under Win 9x as it blindly calls the Unicode version of Unicode-split APIs such as FindFirstFile)
the w*() file routines are NOT supported – or at least, they blindly call the ...W() versions of the Win32 API calls.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.12.6 The golden rules of writing Unicode-safe code

There are no preprocessor games going on.
Do not set the UNICODE constant.
You need to change your code to call the Windows API prefixed with "qxe" functions (when they exist) and use the ...W structs instead of the generic ones. String arguments in the qxe functions are of type Extbyte *.
You code is responsible for conversion of text arguments. We try to handle everything else – the argument differences, the copying back and forth of structures, etc. Use Qmswindows_tstr and macros such as C_STRING_TO_TSTR. You are also responsible for interpreting and specifying string sizes, which have not been changed. Usually these are in characters, meaning you need to divide by XETCHAR_SIZE. (But, some functions want sizes in bytes, even with Unicode strings. Look in the documentation.) Use XETEXT when specifying string constants, so that they show up in Unicode as necessary.
If you need to process external strings (in general you should not do this; do all your manipulations in internal format and convert at the point of entry into or exit from the function), use the xet...() functions.
If you have to declare a fixed array to hold a string coming from Windows (and hence either multibyte or Unicode), declare it of type Extbyte[] and multiply the size by MAX_XETCHAR_SIZE.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.12.7 The format of the locale in setlocale()

It appears that under Unix the standard format for the string in setlocale() involves two-letter language and country abbreviations, e.g. ja or ja_jp or ja_jp.euc for Japanese. Windows (MSDN article "Language Strings" in the run-time reference appendix, see doc list above) speaks of "(primary) language" and "sublanguage" (usually a country, but in the case of Chinese the sublanguage is "simplified" or "traditional"). It is highly flexible in what it takes, and thankfully it canonicalizes the result to a unique form "Language_Country.Encoding". It allows (note that all specifications can be in any case):

the full "language_country.encoding" specification or just language_country", in which case the default encoding will be chosen.
a three-letter acronym, consisting of the ISO-standard two-letter language abbreviation followed by a third letter indicating the sublanguage.
just a language name, e.g. "dutch", standing for the combination of the language with "default" as sublanguage, referring to the default (often "prototypical") country for that language (in this case the Netherlands). You can abbreviate the name by removing any number of letters from the end. Ambiguity is not a problem: Even specifying just a single letter is valid providing any language starting with that letter exists, but the result may not be what you want (e.g. "c" maps to "catalan", not "chinese", "czech", etc.). The way of resolving ambiguity appears fairly random – it’s not alphabetical ("a" maps to "arabic" not "albanian").
a combination of language and sublanguage separated by a hyphen, e.g. "dutch-belgian"; note that the sublanguage designator in this case is NOT necessarily the same as the country, e.g. "belgian" vs. "belgium". "dutch-belgium" (or even "dutch-belg") does NOT get you the right result, but returns "Dutch_Netherlands.1252" instead! This is because, although you may not abbreviate the result, Windows accepts any unknown value in the sublanguage field and treats it as equivalent to "default". Note also that the if the sublanguage name has underscores in it, you need to change them to spaces, e.g. "spanish-dominican republic".
sometimes, just a sublanguage name, e.g. "belgian", standing for the combination of one of the languages spoken in that region and the sublanguage of the region – in this case Dutch. Note that there is no guarantee of "protypicality" in this case in choice of language! You could hardly say that Dutch (aka Flemish) is more prototypical of Belgium than French. You cannot abbreviate this form, if it’s allowed at all.

In addition:

note further that you are not limited to the language/sublanguage combinations predefined by Windows. You can set weird combinations like "Chinese_Kenya.1255" (Chinese spoken in Kenya, represented by Windows-1255, i.e. Hebrew!) and Windows don’t complain, despite the language-encoding inconsistency. You can also make up a weird combination and leave out the encoding, e.g. "Chinese_Qatar", which maps to "Chinese_Qatar.1256", where Windows-1256 is Arabic – i.e. it appears to be choosing the encoding based on a default for the country.
note also that the names for countries are often not what you expect. "urdu_pakistan" fails, and just "urdu" shows why, as it maps to "Urdu_Islamic Republic of Pakistan.1256". That is, some countries exist in their full name, and the canonicalized form with underscore is not very forgiving in its handling of country specifications. Similarly, Uzbekistan is "Republic of Uzbekistan", and "China" is "People’s Republic of China" – but in this latter case, unlike the other two, just "China" works as an alias, e.g. "uzbek_china" maps to "Uzbek_People’s Republic of China.936".
note that just the two-letter ISO language code is NOT allowed. Sometimes you’ll get lucky (e.g. "fr" does map to "france"), but sometimes you’ll get no match (e.g. "pl"), and sometimes you’ll get really unlucky in that the call will succeed but with the wrong language (e.g. "es" maps to "estonian", not "spanish").

As an example, MSDN article "Language Strings" indicates that German (default) can be specified using "deu" or "german"; German (Austrian) with "dea" or "german-austrian"; German (Swiss) with "des", "german-swiss", or "swiss"; French (Swiss) with "french-swiss" or "frs"; and English (USA) with "american", "american english", "american-english", "english-american", "english-us", "english-usa", "enu", "us", or "usa". This is not, of course, an exhaustive list even for just the given locales – just "english" works in practice because English (Default) maps to English (USA). (#### Is this always the case?)

Given the canonicalization, we don’t have to worry too much about the different kinds of inputs to setlocale() – unlike for Unix, where no canonicalization is usually performed, the particular locales that exist vary tremendously from OS to OS, and we need to parse the uncanonicalized locale spec, directly from the user, to figure out the encoding to use, making various guesses if not enough information is present. Yuck! The tricky thing under Windows is figuring how to deal with the sublang. It appears that the trick of simply passing the text of the manifest constant itself of the sublang, with appropriate hacking (e.g. of underscore to space), works most of the time.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.12.8 Random other Windows I18N docs

Introduction to Internationalization Issues in the Win32 API

Abstract: This page provides an overview of the aspects of the Win32 internationalization API that are relevant to XEmacs, including the basic distinction between multibyte and Unicode encodings. Also included are pointers to how XEmacs should make use of this API.

The Win32 API is quite well-designed in its handling of strings encoded for various character sets. The API is geared around the idea that two different methods of encoding strings should be supported. These methods are called multibyte and Unicode, respectively. The multibyte encoding is compatible with ASCII strings and is a more efficient representation when dealing with strings containing primarily ASCII characters, but it has a great number of serious deficiencies and limitations, including that it is very difficult and error-prone to work with strings in this encoding, and any particular string in a multibyte encoding can only contain characters from a very limited number of character sets. The Unicode encoding rectifies all of these deficiencies, but it is not compatible with ASCII strings (in other words, an existing program will not be able to handle the encoded strings unless it is explicitly modified to do so), and it takes up twice as much memory space as multibyte encodings when encoding a purely ASCII string.

Multibyte encodings use a variable number of bytes (either one or two) to represent characters. ASCII characters are also represented by a single byte with its high bit not set, and non-ASCII characters are represented by one or two bytes, the first of which always has its high bit set. (The second byte, when it exists, may or may not have its high bit set.) There is no single multibyte encoding. Instead, there is generally one encoding per non-ASCII character set. Such an encoding is capable of representing (besides ASCII characters, of course) only characters from one (or possibly two) particular character sets.

Multibyte encoding makes processing of strings very difficult. For example, given a pointer to the beginning of a character within a string, finding the pointer to the beginning of the previous character may require backing up all the way to the beginning of the string, and then moving forward. Also, an operation such as separating out the components of a path by searching for backslashes will fail if it’s implemented in the simplest (but not multibyte-aware) fashion, because it may find what appears to be a backslash, but which is actually the second byte of a two-byte character. Also, the limited number of character sets that any particular multibyte encoding can represent means that loss of data is likely if a string is converted from the XEmacs internal format into a multibyte format.

For these reasons, the C code in XEmacs should never do any sort of work with multibyte encoded strings (or with strings in any external encoding for that matter). Strings should always be maintained in the internal encoding, which is predictable, and converted to an external encoding only at the point where the string moves from the XEmacs C code and enters a system library function. Similarly, when a string is returned from a system library function, it should be immediately converted into the internal coding before any operations are done on it.

Unicode, unlike multibyte encodings, is a fixed-width encoding where every character is represented using 16 bits. It is also capable of encoding all the characters from all the character sets in common use in the world. The predictability and completeness of the Unicode encoding makes it a very good encoding for strings that may contain characters from many character sets mixed up with each other. At the same time, of course, it is incompatible with routines that expect ASCII characters and also incompatible with general string manipulation routines, which will encounter a great number of what would appear to be embedded nulls in the string. It also takes twice as much room to encode strings containing primarily ASCII characters. This is why XEmacs does not use Unicode or similar encoding internally for buffers.

The Win32 API cleverly deals with the issue of 8 bit vs. 16 bit characters by declaring a type called TCHAR which specifies a generic character, either 8 bits or 16 bits. Generally TCHAR is defined to be the same as the simple C type char, unless the preprocessor constant UNICODE is defined, in which case TCHAR is defined to be WCHAR, which is a 16 bit type. Nearly all functions in the Win32 API that take strings are defined to take strings that are actually arrays of TCHARs. There is a type LPTSTR which is defined to be a string of TCHARs and another type LPCTSTR which is a const string of TCHARs. The theory is that any program that uses TCHARs exclusively to represent characters and does not make assumptions about the size of a TCHAR or the way that the characters are encoded should work transparently regardless of whether the UNICODE preprocessor constant is defined, which is to say, regardless of whether 8 bit multibyte or 16 bit Unicode characters are being used. The way that this is actually implemented is that every Win32 API function that takes a string as an argument actually maps to one of two functions which are suffixed with an A (which stands for ANSI, and means multibyte strings) or W (which stands for wide, and means Unicode strings). The mapping is, of course, controlled by the same UNICODE preprocessor constant. Generally all structures containing strings in them actually map to one of two different kinds of structures, with either an A or a W suffix after the structure name.

Unfortunately, not all of the implementations of the Win32 API implement all of the functionality described above. In particular, Windows 95 does not implement very much Unicode functionality. It does implement functions to convert multibyte-encoded strings to and from Unicode strings, and provides Unicode versions of certain low-level functions like ExtTextOut(). In fact, all of the rest of the Unicode versions of API functions are just stubs that return an error. Conversely, all versions of Windows NT completely implement all the Unicode functionality, but some versions (especially versions before Windows NT 4.0) don’t implement much of the multibyte functionality. For this reason, as well as for general code cleanliness, XEmacs needs to be written in such a way that it works with or without the UNICODE preprocessor constant being defined.

Getting XEmacs to run when all strings are Unicode primarily involves removing any assumptions made about the size of characters. Remember what I said earlier about how the point of conversion between internally and externally encoded strings should occur at the point of entry or exit into or out of a library function. With this in mind, an externally encoded string in XEmacs can be treated simply as an arbitrary sequence of bytes of some length which has no particular relationship to the length of the string in the internal encoding.

Use Qnative for Unix conversion, Qmswindows_tstr for Windows ...

String constants that are to be passed directly to Win32 API functions, such as the names of window classes, need to be bracketed in their definition with a call to the macro XETEXT. This appropriately makes a string of either regular or wide chars, which is to say this string may be prepended with an L (causing it to be a wide string) depending on XEUNICODE_P.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.13 Modules for Internationalization

‘mule-canna.c’
‘mule-ccl.c’
‘mule-charset.c’
‘mule-charset.h’
‘file-coding.c’
‘file-coding.h’
‘mule-coding.c’
‘mule-mcpath.c’
‘mule-mcpath.h’
‘mule-wnnfns.c’
‘mule.c’

These files implement the MULE (Asian-language) support. Note that MULE actually provides a general interface for all sorts of languages, not just Asian languages (although they are generally the most complicated to support). This code is still in beta.

‘mule-charset.*’ and ‘file-coding.*’ provide the heart of the XEmacs MULE support. ‘mule-charset.*’ implements the charset Lisp object type, which encapsulates a character set (an ordered one- or two-dimensional set of characters, such as US ASCII or JISX0208 Japanese Kanji).

‘file-coding.*’ implements the coding-system Lisp object type, which encapsulates a method of converting between different encodings. An encoding is a representation of a stream of characters, possibly from multiple character sets, using a stream of bytes or words, and defines (e.g.) which escape sequences are used to specify particular character sets, how the indices for a character are converted into bytes (sometimes this involves setting the high bit; sometimes complicated rearranging of the values takes place, as in the Shift-JIS encoding), etc. It also contains some generic coding system implementations, such as the binary (no-conversion) coding system and a sample gzip coding system.

‘mule-coding.c’ contains the implementations of text coding systems.

‘mule-ccl.c’ provides the CCL (Code Conversion Language) interpreter. CCL is similar in spirit to Lisp byte code and is used to implement converters for custom encodings.

‘mule-canna.c’ and ‘mule-wnnfns.c’ implement interfaces to external programs used to implement the Canna and WNN input methods, respectively. This is currently in beta.

‘mule-mcpath.c’ provides some functions to allow for pathnames containing extended characters. This code is fragmentary, obsolete, and completely non-working. Instead, pathname-coding-system is used to specify conversions of names of files and directories. The standard C I/O functions like ‘open()’ are wrapped so that conversion occurs automatically.

‘mule.c’ contains a few miscellaneous things. It currently seems to be unused and probably should be removed.

‘intl.c’

This provides some miscellaneous internationalization code for implementing message translation and interfacing to the Ximp input method. None of this code is currently working.

‘iso-wide.h’

This contains leftover code from an earlier implementation of Asian-language support, and is not currently used.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.14 The Great Mule Merge of March 2002

In March 2002, just after the release of XEmacs 21.5 beta 5, Ben Wing merged what was nominally a very large refactoring of the “Mule” multilingual support code into the mainline. This merge added robust support for Unicode on all platforms, and by providing support for Win32 Unicode APIs made the Mule support on the Windows platform a reality. This merge also included a large number of other changes and improvements, not necessarily related to internationalization.

This node basically amounts to the ChangeLog for 2002-03-12.

Some effort has been put into proper markup for code and file names, and some reorganization according to themes of revision. However, much remains to be done.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.14.1 List of changed files in new Mule workspace

This node lists the files that were touched in the Great Mule Merge.

Deleted files

src/iso-wide.h
src/mule-charset.h
src/mule.c
src/ntheap.h
src/syscommctrl.h
lisp/files-nomule.el
lisp/help-nomule.el
lisp/mule/mule-help.el
lisp/mule/mule-init.el
lisp/mule/mule-misc.el
nt/config.h

Other deleted files

These files were all zero-width and accidentally present.

src/events-mod.h
tests/Dnd/README.OffiX
tests/Dnd/dragtest.el
netinstall/README.xemacs
lib-src/srcdir-symlink.stamp

New files

CHANGES-ben-mule
README.ben-mule-21-5
README.ben-separate-stderr
TODO.ben-mule-21-5
etc/TUTORIAL.{cs,es,nl,sk,sl}
etc/unicode/*
lib-src/make-mswin-unicode.pl
lisp/code-init.el
lisp/resize-minibuffer.el
lisp/unicode.el
lisp/mule/china-util.el
lisp/mule/cyril-util.el
lisp/mule/devan-util.el
lisp/mule/devanagari.el
lisp/mule/ethio-util.el
lisp/mule/indian.el
lisp/mule/japan-util.el
lisp/mule/korea-util.el
lisp/mule/lao-util.el
lisp/mule/lao.el
lisp/mule/mule-locale.txt
lisp/mule/mule-msw-init.el
lisp/mule/thai-util.el
lisp/mule/thai.el
lisp/mule/tibet-util.el
lisp/mule/tibetan.el
lisp/mule/viet-util.el
src/charset.h
src/intl-auto-encap-win32.c
src/intl-auto-encap-win32.h
src/intl-encap-win32.c
src/intl-win32.c
src/intl-x.c
src/mule-coding.c
src/text.c
src/text.h
src/unicode.c
src/s/win32-common.h
src/s/win32-native.h

Changed files

“Too numerous to mention.” (Ben didn’t write that, I did, but it’s a good guess that’s the intent....)

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.14.2 Changes to the MULE subsystems

configure changes

file-coding always compiled in. eol detection is off by default on unix, non-mule, but can be enabled with configure option --with-default-eol-detection or command-line flag -eol.
code that selects which files are compiled is mostly moved to ‘Makefile.in.in’. see comment in ‘Makefile.in.in’.
vestigial i18n3 code deleted.
new cygwin mswin libs imm32 (input methods), mpr (user name enumeration).
check for link, symlink.
vfork-related code deleted.
fix ‘configure.usage’. (delete --with-file-coding, --no-doc-file, add --with-default-eol-detection, --quick-build).
‘nt/config.h’ has been eliminated and everything in it merged into ‘config.h.in’ and ‘s/windowsnt.h’. see ‘config.h.in’ for more info.
massive rewrite of ‘s/windowsnt.h’, ‘m/windowsnt.h’, ‘s/cygwin32.h’, ‘s/mingw32.h’. common code moved into ‘s/win32-common.h’, ‘s/win32-native.h’.
in ‘nt/xemacs.mak’, ‘nt/config.inc.samp’, variable is called MULE, not HAVE_MULE, for consistency with sources.
define TABDLY, TAB3 in ‘freebsd.h’ (#### from where?)

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.14.3 Pervasive changes throughout XEmacs sources

all #ifdef FILE_CODING statements removed from code.

Changes to string processing

new ‘qxe()’ string functions that accept Intbyte * as arguments. These work exactly like the standard strcmp(), strcpy(), sprintf(), etc. except for the argument declaration differences. We use these whenever we have Intbyte * strings, which is quite often.
new fun build_istring() takes an Intbyte *. also new funs build_msg_intstring (like build_istring()) and build_msg_string (like build_cistring()) to do a GETTEXT() before building the string. (elimination of old build_translated_string(), replaced by build_msg_string()).
function intern_istring() for Intbyte * arguments, like intern().
numerous places throughout code where char * replaced with something else, e.g. Char_ASCII *, Intbyte *, Char_Binary *, etc. same with unsigned char *, going to UChar_Binary *, etc.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.14.4 Changes to specific subsystems

Changes to the init code

lots of init code rewritten to be mule-correct.

Changes to processes

always call egetenv(), never getenv(), for mule correctness.

command line (‘`startup.el`’, ‘`emacs.c`’)

new option -eol to enable auto EOL detection under non-mule unix.
new option -nuni (--no-unicode-lib-calls) to force use of non-Unicode API’s under Windows NT, mostly for debugging purposes.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.14.5 Mule changes by theme

the code that handles the details of processing multilingual text has been consolidated to make it easier to extend it. it has been yanked out of various files (‘buffer.h’, ‘mule-charset.h’, ‘lisp.h’, ‘insdel.c’, ‘fns.c’, ‘file-coding.c’, etc.) and put into ‘text.c’ and ‘text.h’. ‘mule-charset.h’ has also been renamed ‘charset.h’. all long comments concerning the representations and their processing have been consolidated into ‘text.c’.
major rewriting of file-coding. it’s mostly abstracted into coding systems that are defined by methods (similar to devices and specifiers), with the ultimate aim being to allow non-i18n coding systems such as gzip. there is a “chain” coding system that allows multiple coding systems to be chained together. (it doesn’t yet have the concept that either end of a coding system can be bytes or chars; this needs to be added.)
large amounts of code throughout the code base have been Mule-ized, not just Windows code.
total rewriting of OS locale code. it notices your locale at startup and sets the language environment accordingly, and calls setlocale() and sets LANG when you change the language environment. new language environment properties locale, mswindows-locale, cygwin-locale, native-coding-system, to determine langenv from locale and vice-versa; fix all language environments (lots of language files). langenv startup code rewritten. many new functions to convert between locales, language environments, etc.
major overhaul of the way default values for the various coding system variables are handled. all default values are collected into one location, a new file ‘code-init.el’, which provides a unified mechanism for setting and querying what i call “basic coding system variables” (which may be aliases, parts of conses, etc.) and a mechanism of different configurations (Windows w/Mule, Windows w/o Mule, Unix w/Mule, Unix w/o Mule, unix w/o Mule but w/auto EOL), each of which specifies a set of default values. we determine the configuration at startup and set all the values in one place. (‘code-init.el’, ‘code-files.el’, ‘coding.el’, ...)
i copied the remaining language-specific files from fsf. i made some minor changes in certain cases but for the most part the stuff was just copied and may not work.
ms windows mule support, with full unicode support. required font, redisplay, event, other changes. ime support from ikeyama.

Lisp-Visible Changes:

ensure that escape-quoted works correctly even without Mule support and use it for all auto-saves. (‘auto-save.el’, ‘fileio.c’, ‘coding.el’, ‘files.el’)
new var buffer-file-coding-system-when-loaded specifies the actual coding system used when the file was loaded (buffer-file-coding-system is usually the same, but may be changed because it controls how the file is written out). use it in revert-buffer (‘files.el’, ‘code-files.el’) and in new submenu File->Revert Buffer with Specified Encoding (‘menubar-items.el’).
improve docs on how the coding system is determined when a file is read in; improved docs are in both find-file and insert-file-contents and a reference to where to find them is in buffer-file-coding-system-for-read. (‘files.el’, ‘code-files.el’)
new (brain-damaged) FSF way of calling post-read-conversion (only one arg, not two) is supported, along with our two-argument way, as best we can. (‘code-files.el’)
add inexplicably missing var default-process-coding-system. use it. get rid of former hacked-up way of setting these defaults using comint-exec-hook. also fun set-buffer-process-coding-system. (‘code-process.el’, ‘code-cmds.el’, ‘process.c’)
remove function set-default-coding-systems; replace with set-default-output-coding-systems, which affects only the output defaults (buffer-file-coding-system, output half of default-process-coding-system). the input defaults should not be set by this because they should always remain undecided in normal circumstances. fix prefer-coding-system to use the new function and correct its docs.
fix bug in coding-system-change-eol-conversion (‘code-cmds.el’)
recognize all eol types in prefer-coding-system (‘code-cmds.el’)
rewrite coding-system-category to be correct (‘coding.el’)

Internal Changes

major improvements to eistring code, fleshing out of missing funs.

Separate encoding and decoding lstreams have been combined into a single coding lstream. Functions‘ make_encoding_*_stream’ and ‘make_decoding_*_stream’ have been combined into ‘make_coding_*_stream’, which takes an argument specifying whether encode or decode is wanted.
remove last vestiges of I18N3, I18N4 code.
ascii optimization for strings: we keep track of the number of ascii chars at the beginning and use this to optimize byte<->char conversion on strings.
‘mule-misc.el’, ‘mule-init.el’ deleted; code in there either deleted, rewritten, or moved to another file.
‘mule.c’ deleted.
move non-Mule-specific code out of ‘mule-cmds.el’ into ‘code-cmds.el’. (coding-system-change-text-conversion; remove duplicate coding-system-change-eol-conversion)
remove duplicate set-buffer-process-coding-system (‘code-cmds.el’)
add some commented-out code from FSF ‘mule-cmds.el’ (find-coding-systems-region-subset-p, find-coding-systems-region, find-coding-systems-string, find-coding-systems-for-charsets, find-multibyte-characters, last-coding-system-specified, select-safe-coding-system, select-message-coding-system) (‘code-cmds.el’)
remove obsolete alias pathname-coding-system, function set-pathname-coding-system (‘coding.el’)
remove coding-system property doc-string; split into description (short, for menu items) and documentation (long); correct coding system defns (‘coding.el’, ‘file-coding.c’, lots of language files)
move coding-system-base into C and make use of internal info (‘coding.el’, ‘file-coding.c’)
move undecided defn into C (‘coding.el’, ‘file-coding.c’)
use define-coding-system-alias, not copy-coding-system (‘coding.el’)
new coding system iso-8859-6 for arabic
delete windows-1251 support from ‘cyrillic.el’; we do it automatically
remove ‘setup-*-environment’ as per FSF 21
rewrite ‘european.el’ with lang envs for each language, so we can specify the locale
fix corruption in ‘greek.el’
sync ‘japanese.el’ with FSF 20.6
fix warnings in ‘mule-ccl.el’
move FSF compat Mule fns from ‘obsolete.el’ to ‘mule-charset.el’
eliminate unused ‘truncate-string{-to-width}’
make-coding-system accepts (but ignores) the additional properties present in the fsf version, for compatibility.
i fixed the iso2022 handling so it will correctly read in files containing unknown charsets, creating a “temporary” charset which can later be overwritten by the real charset when it’s defined. this allows iso2022 elisp files with literals in strange languages to compile correctly under mule. i also added a hack that will correctly read in and write out the emacs-specific “composition” escape sequences, i.e. ‘ESC 0’ through ‘ESC 4’. this means that my workspace correctly compiles the new file ‘devanagari.el’ that i added.
elimination of string-to-char-list (use string-to-list)
elimination of junky define-charset

Selection

fix msw selection code for Mule. proper encoding for RegisterClipboardFormat. store selection as CF_UNICODETEXT, which will get converted to the other formats. don’t respond to destroy messages from EmptyClipboard().

Menubar

new items ‘Open With Specified Encoding’, ‘Revert Buffer with Specified Encoding’
split Mule menu into ‘Encoding’ (non-Mule-specific; includes new item to control EOL auto-detection) and ‘International’ submenus on ‘Options’, ‘International’ on ‘Help’

Unicode support:

translation tables added in ‘etc/unicode’
new files ‘unicode.c’, ‘unicode.el’ containing unicode coding systems and support; old code ripped out of ‘file-coding.c’
translation tables read in at startup (NEEDS WORK TO MAKE IT MORE EFFICIENT)
support CF_TEXT, CF_UNICODETEXT in ‘select.el’
encapsulation code added so that we can support both Windows 9x and NT in a single executable, determining at runtime whether to call the Unicode or non-Unicode API. encapsulated routines in ‘intl-encap-win32.c’ (non-auto-generated) and ‘intl-auto-encap-win32.[ch]’ (auto-generated). code generator in ‘lib-src/make-mswin-unicode.pl’. changes throughout the code to use the wide structures (W suffix) and call the encapsulated Win32 API routines (‘qxe’ prefix). calling code needs to do proper conversion of text using new coding systems Qmswindows_tstr, Qmswindows_unicode, or Qmswindows_multibyte. (the first points to one of the other two.)

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.14.6 File-coding rewrite

The coding system code has been majorly rewritten. It’s abstracted into coding systems that are defined by methods (similar to devices and specifiers). The types of conversions have also been generalized. Formerly, decoding always converted bytes to characters and encoding the reverse (these are now called “text file converters”), but conversion can now happen either to or from bytes or characters. This allows coding systems such as gzip and base64 to be written. When specifying such a coding system to an operation that expects a text file converter (such as reading in or writing out a file), the appropriate coding systems to convert between bytes and characters are automatically inserted into the conversion chain as necessary. To facilitate creating such chains, a special coding system called “chain” has been created, which chains together two or more coding systems.

Encoding detection has also been abstracted. Detectors are logically separate from coding systems, and each detector defines one or more categories. (For example, the detector for Unicode defines categories such as UTF-8, UTF-16, UCS-4, and UTF-7.) When a particular detector is given a piece of text to detect, it determines likeliness values (seven of them, from 3 [most likely] to -3 [least likely]; specific criteria are defined for each possible value). All detectors are run in parallel on a particular piece of text, and the results tabulated together to determine the actual encoding of the text.

Encoding and decoding are now completely parallel operations, and the former “encoding” and “decoding” lstreams have been combined into a single “coding” lstream. Coding system methods that were formerly split in such a fashion have also been combined.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.14.7 General User-Visible Changes

Search

make regex routines reentrant, since they’re sometimes called reentrantly. (see ‘regex.c’ for a description of how.) all global variables used by the regex routines get pushed onto a stack by the callers before being set, and are restored when finished. redo the preprocessor flags controlling REL_ALLOC in conjunction with this.

Menubar

move menu-splitting code (menu-split-long-menu, etc.) from ‘font-menu.el’ to ‘menubar-items.el’ and redo its algorithm; use in various items with long generated menus; rename to remove ‘font-’ from beginning of functions but keep old names as aliases
new fn menu-sort-menu
redo items ‘Grep All Files in Current Directory {and Below}’ using stuff from sample ‘init.el’
‘Debug on Error’ and friends now affect current session only; not saved
maybe-add-init-button -> init-menubar-at-startup and call explicitly from ‘startup.el’
don’t use charset-registry in ‘msw-font-menu.el’; it’s only for X

Changes to key bindings

These changes are primarily found in ‘keymap.c’, ‘keydefs.el’, and ‘help.el’, but are found in many other files.

M-home, M-end now move forward and backward in buffers; with <Shift>, stay within current group (e.g. all C files; same grouping as the gutter tabs). (bindings ‘switch-to-{next/previous}-buffer[-in-group]’ in ‘files.el’)
needed to move code from ‘gutter-items.el’ to ‘buff-menu.el’ that’s used by these bindings, since ‘gutter-items.el’ is loaded only when the gutter is active and these bindings (and hence the code) is not (any more) gutter specific.
new global vars global-tty-map and global-window-system-map specify key bindings for use only on TTY’s or window systems, respectively. this is used to make ESC ESC be keyboard-quit on window systems, but ESC ESC ESC on TTY’s, where <Meta + arrow> keys may appear as ESC ESC O A or whatever. C-z on window systems is now zap-up-to-char, and iconify-frame is moved to C-Z. ESC ESC is isearch-quit. (‘isearch-mode.el’)
document ‘global-{tty,window-system}-map’ in various places; display them when you do C-h b.
fix up function documentation in general for keyboard primitives. e.g. key-bindings now contains a detailed section on the steps prior to looking up in keymaps, i.e. function-key-map, keyboard-translate-table. etc. define-key and other obvious starting points indicate where to look for more info.
eliminate use and mention of grody advertised-undo and deprecated-help. (‘simple.el’, ‘startup.el’, ‘picture.el’, ‘menubar-items.el’)

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.14.8 General Lisp-Visible Changes

gzip support

The gzip protocol is now partially supported as a coding system.

new coding system gzip (bytes -> bytes); unfortunately, not quite working yet because it handles only the raw zlib format and not the higher-level gzip format (the zlib library is brain-damaged in that it provides low-level, stream-oriented API’s only for raw zlib, and for gzip you have only high-level API’s, which aren’t useful for xemacs).
configure support (--with-zlib).

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.14.9 User documentation

Tutorial

massive rewrite; sync to FSF 21.0.106, switch focus to window systems, new sections on terminology and multiple frames, lots of fixes for current xemacs idioms.
german version from Adrian mostly matching my changes.
copy new tutorials from FSF (Spanish, Dutch, Slovak, Slovenian, Czech); not updated yet though.
eliminate ‘help-nomule.el’ and ‘mule-help.el’; merge into one single tutorial function, fix lots of problems, put back in ‘help.el’ where it belongs. (there was some random junk in ‘help-nomule.el’, string-width and make-char. string-width is now in ‘subr.el’ with a single definition, and make-char in ‘text.c’.)

Sample init file

remove forward/backward buffer code, since it’s now standard.
when disabling C-x C-c, make it display a message saying how to exit, not just beep and complain “undefined”.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.14.10 General internal changes

Changes to gnuclient and gnuserv

clean up headers a bit.
use proper ms win idiom for checking for temp directory (TEMP or TMP, not TMPDIR).

Process changes

Move setenv from packages; synch setenv/getenv with 21.0.105

Changes to I/O internals

use PATH_MAX consistently instead of MAXPATHLEN, MAX_PATH, etc.
all code that does preprocessor games with C lib I/O functions (open, read) has been removed. The code has been changed to call the correct function directly. Functions that accept Intbyte * arguments for filenames and such and do automatic conversion to or from external format will be prefixed ‘qxe...()’. Functions that are retrying in case of EINTR are prefixed ‘retry_...()’. DONT_ENCAPSULATE is long-gone.
never call getcwd() any more. use our shadowed value always.

Changes to string processing

the ‘doprnt.c’ external entry points have been completely rewritten to be more useful and have more sensible names. We now have, for example, versions that work exactly like sprintf() but return a malloc()ed string.
code in ‘print.c’ that handles stdout, stderr rewritten.
places that print to stderr directly replaced with stderr_out().
new convenience functions write_fmt_string(), write_fmt_string_lisp(), stderr_out_lisp(), write_string().

Changes to Allocation, Objects, and the Lisp Interpreter

automatically use “managed lcrecord” code when allocating. any lcrecord can be put on a free list with free_lcrecord().
record_unwind_protect() returns the old spec depth.
unbind_to() now takes only one arg. use unbind_to_1() if you want the 2-arg version, with GC protection of second arg.
new funs to easily inhibit GC. ({begin,end}_gc_forbidden()) use them in places where gc is currently being inhibited in a more ugly fashion. also, we disable GC in certain strategic places where string data is often passed in, e.g. ‘dfc’ functions, ‘print’ functions.
make_buffer() -> wrap_buffer() for consistency with other objects; same for make_frame() -> wrap_frame() and make_console() -> wrap_console().
better documentation in condition-case.
new convenience funs record_unwind_protect_freeing() and record_unwind_protect_freeing_dynarr() for conveniently setting up an unwind-protect to xfree() or Dynarr_free() a pointer.

s/m files:

removal of unused DATA_END, TEXT_END, SYSTEM_PURESIZE_EXTRA, HAVE_ALLOCA (automatically determined)
removal of vfork references (we no longer use vfork)

‘`make-docfile`’:

clean up headers a bit.
allow ‘.obj’ to mean equivalent ‘.c’, just like for ‘.o’.
allow specification of a “response file” (a command-line argument beginning with @, specifying a file containing further command-line arguments) – a standard mswin idiom to avoid potential command-line limits and to simplify makefiles. use this in ‘xemacs.mak’.

debug support

(‘cmdloop.el’) new var breakpoint-on-error, which breaks into the C debugger when an unhandled error occurs noninteractively. useful when debugging errors coming out of complicated make scripts, e.g. package compilation, since you can set this through an env var.
(‘startup.el’) new env var XEMACSDEBUG, specifying a Lisp form executed early in the startup process; meant to be used for turning on debug flags such as breakpoint-on-error or stack-trace-on-error, to track down noninteractive errors.
(‘cmdloop.el’) removed non-working code in command-error to display a backtrace on debug-on-error. use stack-trace-on-error instead to get this.
(‘process.c’) new var debug-process-io displays data sent to and received from a process.
(‘alloc.c’) staticpros have name stored with them for easier debugging.
(‘emacs.c’) code that handles fatal errors consolidated and rewritten. much more robust and correctly handles all fatal exits on mswin (e.g. aborts, not previously handled right).

‘`startup.el`’

move init routines from before-init-hook or after-init-hook; just call them directly (init-menubar-at-startup, init-mule-at-startup).
help message fixed up (divided into sections), existing problem causing incomplete output fixed, undocumented options documented.

‘`frame.el`’

delete old commented-out code.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.14.11 Ben’s TODO list (probably obsolete)

These notes substantially overlap those in Ben’s README (probably obsolete). They should probably be combined.

April 11, 2002

Priority:

Finish checking in current mule ws.
Start working on bugs reported by others and noticed by me:
- problems cutting and pasting binary data, e.g. from byte-compiler instructions
- test suite failures
- process i/o problems w.r.t. eol: |uniq (e.g.) leaves ^M’s at end of line; running "bash" as shell-file-name doesn’t work because it doesn’t like the extra ^M’s.

March 20, 2002

bugs:

TTY-mode problem. When you start up in TTY mode, XEmacs goes through the loadup process and appears to be working – you see the startup screen pulsing through the different screens, and it appears to be listening (hitting a key stops the screen motion), but it’s frozen – the screen won’t get off the startup, key commands don’t cause anything to happen. STATUS: In progress.
Memory ballooning in some cases. Not yet understood.
other test suite failures?
need to review the handling of sounds. seems that not everything is documented, not everything is consistently used where it’s supposed to, some sounds are ugly, etc. add sounds to ‘completer’ as well.
redo with-trapping-errors so that the backtrace is stored away and only outputted when an error actually occurs (i.e. in the condition-case handler). test. (use ding of various sorts as a helpful way of checking out what’s going on.)
problems with process input: |uniq (for example) leaves ^M’s at end of line.
carefully review looking up of fonts by charset, esp. wrt the last element of a font spec.
add package support to ignore certain files – *-util.el for languages.
review use of escape-quoted in auto_save_1() vs. the buffer’s own coding system.
figure out how to get the total amount of data memory (i.e. everything but the code, or even including the code if can’t distinguish) used by the process on each different OS, and use it in a new algorithm for triggering GC: trigger only when a certain % of the data size has been consed up; in addition, have a minimum.
fixed bugs???
- Occasional crash when freeing display structures. The problem seems to be this: A window has a "display line dynarr"; each display line has a "display block dynarr". Sometimes this display block dynarr is getting freed twice. It appears from looking at the code that sometimes a display line from somewhere in the dynarr gets added to the end – hence two pointers to the same display block dynarr. need to review this code.

August 29, 2001

This is the most current list of priorities in ‘ben-mule-21-5’. Updated often.

high-priority:

[input]

support for WM_IME_CHAR. IME input can work under -nuni if we use WM_IME_CHAR. probably we should always be using this, instead of snarfing input using WM_COMPOSITION. i’ll check this out.
Russian C-x problem. see above.

[clean-up]

make sure it compiles and runs under non-mule. remember that some code needs the unicode support, or at least a simple version of it.
make sure it compiles and runs under pdump. see below.
make sure it compiles and runs under cygwin. see below.
clean up mswindows-multibyte, TSTR_TO_C_STRING. expand dfc optimizations to work across chain.
eliminate last vestiges of codepage<->charset conversion and similar stuff.

[other]

test the "file-coding is binary only on Unix, no-Mule" stuff.
test that things work correctly in -nuni if the system environment is set to e.g. japanese – i should get japanese menus, japanese file names, etc. same for russian, hebrew ...
cut and paste. see below.
misc issues with handling lang environments. see also August 25, "finally: working on the C-x in ...".
- when switching lang env, needs to set keyboard layout.
- user var to control whether, when moving into text of a particular language, we set the appropriate keyboard layout. we would need to have a lisp api for retrieving and setting the keyboard layout, set text properties to indicate the layout of text, and have a way of dealing with text with no property on it. (e.g. saved text has no text properties on it.) basically, we need to get a keyboard layout from a charset; getting a language would do. Perhaps we need a table that maps charsets to language environments.
- test that the lang env is properly set at startup. test that switching the lang env properly sets the C locale (call setlocale(), set LANG, etc.) – a spawned subprogram should have the new locale in its environment.
look through everything below and see if anything is missed in this priority list, and if so add it. create a separate file for the priority list, so it can be updated as appropriate.

mid-priority:

clean up the chain coding system. its list should specify decode order, not encode; i now think this way is more logical. it should check the endpoints to make sure they make sense. it should also allow for the specification of "reverse-direction coding systems": use the specified coding system, but invert the sense of decode and encode.
along with that, places that take an arbitrary coding system and expect the ends to be anything specific need to check this, and add the appropriate conversions from byte->char or char->byte.
get some support for arabic, thai, vietnamese, japanese jisx 0212: at least get the unicode information in place and make sure we have things tied together so that we can display them. worry about r2l some other time.
check the handling of C-c. can XEmacs itself be interrupted with C-c? is that impossible now that we are a window, not a console, app? at least we should work something out with ‘i’ so that if it receives a C-c or C-break, it interrupts XEmacs, too. check out how process groups work and if they apply only to console apps. also redo the way that XEmacs sends C-c to other apps. the business of injecting code should be last resort. we should try C-c first, and if that doesn’t work, then the next time we try to interrupt the same process, use the injection method.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

26.14.12 Ben’s README (probably obsolete)

These notes substantially overlap those in Ben’s TODO list (probably obsolete). They should probably be combined.

This may be of some historical interest as a record of Ben at work. There may also be some useful suggestions as yet unimplemented.

oct 27, 2001

——– proposal for better buffer-switching commands:

implement what VC++ currently has. you have a single "switch" command like CTRL-TAB, which as long as you hold the <CTRL> button down, brings successive buffers that are "next in line" into the current position, bumping the rest forward. once you release the <CTRL> key, the chain is broken, and further CTRL-TABs will start from the beginning again. this way, frequently used buffers naturally move toward the front of the chain, and you can switch back and forth between two buffers using CTRL-TAB. the only thing about CTRL-TAB is it’s a bit awkward. the way to implement is to have modifier-up strokes fire off a hook, like modifier-up-hook. this is driven by event dispatch, so there are no synchronization issues. when C-tab is pressed, the binding function does something like set a one-shot handler on the modifier-up-hook (perhaps separate hooks for separate modifiers?).

to do this, we’d also want to change the buffer tabs so that they maintain their own order. in particular, they start out synched to the regular order, but as you make changes, you don’t want the tabs to change order. (in fact, they may already do this.) selecting a particular buffer from the buffer tabs DOES make the buffer go to the head of the line. the invariant is that if the tabs are displaying X items, those X items are the first X items in the standard buffer list, but may be in a different order. (it looks like the tabs may already implement all of this.)

oct 26, 2001

necessary testing/changes:

test all eol detection stuff under windows w/ and w/o mule, unix w/ and w/o mule. (test configure flag, command-line flag, menu option) may need a way of pretending to be unix under cygwin.
test under windows w/ and w/o mule, cygwin w/ and w/o mule, cygwin x windows w/ and w/o mule.
test undecided-dos/unix/mac.
check ESC ESC works as isearch-quit under TTY’s.
test coding-system-base and all its uses (grep for them).
menu item to revert to most recent auto save.
consider renaming build_cistring -> build_istring and build_c_string to build_cistring. (consistent with build_msg_string et al; many more build_c_string than build_cistring)

oct 20, 2001

fixed problem causing crash due to invalid internal-format data, fixed an existing bug in valid_char_p, and added checks to more quickly catch when invalid chars are generated. still need to investigate why mswindows-multibyte is being detected.

i now see why – we only process 65536 bytes due to a constant MAX_BYTES_PROCESSED_FOR_DETECTION. instead, we should have no limit as long as we have a seekable stream. we also need to write stderr_out_lisp(), used in the debug info routines i wrote.

check once more about DEBUG_XEMACS. i think debugging info should be ON by default. make sure it is. check that nothing untoward will result in a production system, e.g. presumably assert()s should not really abort(). (!! Actually, this should be runtime settable! Use a variable for this, and it can be set using the same XEMACSDEBUG method. In fact, now that I think of it, I’m sure that debugging info should be on always, with runtime ways of turning on or off any funny behavior.)

oct 19, 2001

fixed various bugs preventing packages from being able to be built. still another bug, with ‘psgml/etc/cdtd/docbook’, which contains some strange characters starting around char pos 110,000. It gets detected as mswindows-multibyte (wrong! why?) and then invalid internal-format data is generated. need to fix mswindows-multibyte (and possibly add something that signals an error as well; need to work on this error-signalling mechanism) and figure out why it’s getting detected as such. what i should do is add a debug var that outputs blow-by-blow info of the detection process.

oct 9, 2001

the stuff with global-window-system-map doesn’t appear to work. in any case it needs better documentation. [DONE]

M-home, M-end do work, but cause cl-macs to get loaded. why?

oct 8, 2001

finished the coding system changes and they finally work!

need to implement undecided-unix/dos/mac. they should be easy to do; it should be enough to specify an eol-type but not do-eol, but check this.

consider making the standard naming be foo-lf/crlf/cr, with unix/dos/mac as aliases.

print methods for coding systems should include some of the generic properties. (also then fix print_..._within_print_method). [DONE]

in a little while, go back and delete the text-file-wrapper-coding-system code. (it’ll be in CVS if necessary to get at it.) [DONE]

need to verify at some point that non-text-file coding systems work properly when specified. when gzip is working, this would be a good test case. (and consider creating base64 as well!)

remove extra crap from coding-system-category that checks for chain coding systems. [DONE]

perhaps make a primitive that gets at coding-system-canonical. [DONE]

need to test cygwin, compiling the mule packages, get unix-eol stuff working. frank from germany says he doesn’t see a lisp backtrace when he gets an error during temacs? verify that this actually gets outputted.

consider putting the current language on the modeline, mousable so it can be switched. also consider making the coding system be mousable and the line number (pick a line) and the percentage (pick a percentage).

oct 6, 2001

added code so that debug_print() will output a newline to the mswindows debugging output, not just the console. need to test. [DONE]

working on problem where all files are being detected as binary. the problem may be that the undecided coding system is getting wrapped with an auto-eol coding system, which it shouldn’t be – but even in this situation, we should get the right results! check the canonicalize-after-coding methods. also, determine_real_coding_system appears to be getting called even when we’re not detecting encoding. also, undecided needs a print method to show its params, and chain needs to be updated to show canonicalize_after_coding. check others as well. [DONE]

oct 5, 2001

finished up coding system changes, testing.

errors byte-compiling files in iso-2022-7-bit. perhaps it’s not correctly detecting the encoding?

noticed a problem in the dfc macros: we call get_coding_system_for_text_file with eol_wrap == 1, to allow for auto-detection of the eol type; but this defeats the check and short-circuit for unicode.

still need to implement calling determine_real_coding_system() for non-seekable streams. to implement correctly, we need to do our own buffering. [DONE, BUT WITHOUT BUFFERING]

oct 4, 2001

implemented most stuff below.

need to finish up changes to make_coding_system_1. (i changed the way internal coding systems were handled; i need to create subsidiaries for all types of coding systems, not just text ones.) there’s a nasty xfree() crash i was hitting; perhaps it’ll go away once all stuff has been rewritten.

check under cygwin to make sure that when an error occurs during loadup, a backtrace is output.

as soon as andy releases his new setup, we should put it onto various standard windows software repositories.

oct 3, 2001

added global-tty-map and global-window-system-map. add some stuff to the maps, e.g. C-x ESC for repeat vs. C-x ESC ESC on TTY’s, and of course ESC ESC on window systems vs. ESC ESC ESC on TTY’s. [TEST]

was working on integrating the two help-for-tutorial versions (mule, non-mule). [DONE, but test under non-Mule]

was working on the file-coding changes. need to think more about text-file-wrapper. conclusion i think is that get_coding_system_for_text_file should wrap using a special coding system type called a text-file-wrapper, which inherits from chain, and implements canonicalize-after-decoding to just return the unwrapped coding system. We need to implement inheritance of coding systems, which will certainly come in extremely useful when coding systems get implemented in Lisp, which should happen at some point. (see existing docs about this.) essentially, we have a way of declaring that we inherit from some system, and the appropriate data structures get created, perhaps just an extra inheritance pointer. but when we create the coding system, the extra data needs to be a stretchy array of offsets, pointing to the type-specific data for the coding system type and all its parents. that means that in the methods structure for a coding system (which perhaps should be expanded beyond method, it’s just a "class structure") is the index in these arrays of offsets. CODING_SYSTEM_DATA() can take any of the coding system classes (rename type to class!) that make up this class. similarly, a coding system class inherits its methods from the class above unless specifying its own method, and can call the superclass method at any point by either just invoking its name, or conceivably by some macro like

‘CALL_SUPER (method, (args))’

similar mods would have to be made to coding stream structures.

perhaps for the immediate we can just sort of fake things like we currently do with undecided calling some stuff from chain.

oct 2, 2001

need to implement support for iso-8859-15, i.e. iso-8859-1 + euro symbol. figure out how to fall back to iso-8859-1 as necessary.

leave the current bindings the way they are for the moment, but bump off M-home and M-end (hardly used), and substitute my buffer movement stuff there. [DONE, but test]

there’s something to be said for combining block of 6 and paragraph, esp. if we make the definition of "paragraph" be so that it skips by 6 when within code. hmm.

eliminate advertised-undo crap, and similar hacks. [DONE]

think about obsolete stuff to be eliminated. think about eliminating or dimming obsolete items from hyper-apropos and something similar in completion buffers.

sep 30, 2001

synched up the tutorials with FSF 21.0.105. was rewriting them to favor the cursor keys over the older C-p, etc. keys.

Got thinking about key bindings again.

I think that M-up/down and M-C-up/down should be reversed. I use scroll-up/down much more often than motion by paragraph.
Should we eliminate move by block (of 6) and subsitute it for paragraph? This would have the advantage that I could make bindings for buffer change (forward/back buffer, perhaps M-C-up/down. with shift, M-C-S-up/down only goes within the same type (C files, etc.). alternatively, just bump off beginning-of-defun from C-M-home, since it’s on C-M-a already.

need someone to go over the other tutorials (five new ones, from FSF 21.0.105) and fix them up to correspond to the english one.

shouldn’t shift-motion work with C-a and such as well as arrows?

sep 29, 2001

charcount_to_bytecount can also be made to scream – as can scan_buffer, buffer_mule_signal_inserted_region, others? we should start profiling though before going too far down this line.

Debug code that causes no slowdown should in general remain in the executable even in the release version because it may be useful (e.g. for people to see the event output). so DEBUG_XEMACS should be rethought. things like use of ‘msvcrtd.dll’ should be controlled by error_checking on. maybe DEBUG_XEMACS controls general debug code (e.g. use of ‘msvcrtd.dll’, asserts abort, error checking), and the actual debugging code should remain always, or be conditonalized on something else (e.g. ‘DEBUGGING_FUNS_PRESENT’).

doc strings in dumped files are displayed with an extra blank line between each line. presumably this is recent? i assume either the change to detect-coding-region or the double-wrapping mentioned below.

error with coding-system-property on iso-2022-jp-dos. problem is that that coding system is wrapped, so its type shows up as chain, not iso-2022. this is a general problem, and i think the way to fix it is to in essence do late canonicalization – similar in spirit to what was done long ago, canonicalize_when_code, except that the new coding system (the wrapper) is created only once, either when the original cs is created or when first needed. this way, operations on the coding system work like expected, and you get the same results as currently when decoding/encoding. the only thing tricky is handling canonicalize-after-coding and the ever-tricky double-wrapping problem mentioned below. i think the proper solution is to move the autodetection of eol into the main autodetect type. it can be asked to autodetect eol, coding, or both. for just coding, it does like it currently does. for just eol, it does similar to what it currently does but runs the detection code that convert-eol currently does, and selects the appropriate convert-eol system. when it does both eol and coding, it does something on the order of creating two more autodetect coding systems, one for eol only and one for coding only, and chains them together. when each has detected the appropriate value, the results are combined. this automatically eliminates the double-wrapping problem, removes the need for complicated canonicalize-after-coding stuff in chain, and fixes the problem of autodetect not having a seekable stream because hidden inside of a chain. (we presume that in the both-eol-and-coding case, the various autodetect coding streams can communicate with each other appropriately.)

also, we should solve the problem of internal coding systems floating around and clogging up the list simply by having an "internal" property on cs’s and an internal param to coding-system-list (optional; if not given, you don’t get the internal ones). [DONE]

we should try to reduce the size of the from-unicode tables (the dominant memory hog in the tables). one obvious thing is to not store a whole emchar as the mapped-to value, but a short that encodes the octets. [DONE]

sep 28, 2001

need to merge up to latest in trunk.

add unicode charsets for all non-translatable unicode chars; probably want to extend the concept of charsets to allow for dimension 3 and dimension 4 charsets. for the moment we should stick with just dimension 3 charsets; otherwise we run past the current maximum of 4 bytes per emchar. (most code would work automatically since it uses MAX_EMCHAR_LEN; the trickiness is in certain code that has intimate knowledge of the representation. e.g. bufpos_to_bytind() has to multiply or divide by 1, 2, 3, or 4, and has special ways of handling each number. with 5 or 6 bytes per char, we’d have to change that code in various ways.) 96x96x96 = 884,000 or so, so with two 96x96x96 charsets, we could tackle all Unicode values representable by UTF-16 and then some – and only these codepoints will ever have assigned chars, as far as we know.

need an easy way of showing the current language environment. some menus need to have the current one checked or whatever. [DONE]

implement unicode surrogates.

implement buffer-file-coding-system-when-loaded – make sure find-file, revert-file, etc. set the coding system [DONE]

verify all the menu stuff [DONE]

implemented the entirely-ascii check in buffers. not sure how much gain it’ll get us as we already have a known range inside of which is constant time, and with pure-ascii files the known range spans the whole buffer. improved the comment about how bufpos-to-bytind and vice-versa work. [DONE]

fix double-wrapping of convert-eol: when undecided converts itself to something with a non-autodetect eol, it needs to tell the adjacent convert-eol to reduce itself to nothing.

need menu item for find file with specified encoding. [DONE]

renamed coding systems mswindows-### to windows-### to follow the standard in rfc1345. [DONE]

implemented coding-system-subsidiary-parent [DONE] HAVE_MULE -> MULE in files in ‘nt/’ so that depend checking works [DONE]

need to take the smarter search-all-files-in-dir stuff from my sample init file and put it on the grep menu [DONE]

added item for revert w/specified encoding; mostly works, but needs fixes. in particular, you get the correct results, but buffer-file-coding-system does not reflect things right. also, there are too many entries. need to split into submenus. there is already split code out there; see if it’s generalized and if not make it so. it should only split when there’s more than a specified number, and when splitting, split into groups of a specified size, not into a specified number of groups. [DONE]

too many entries in the langenv menus; need to split. [DONE]

sep 27, 2001

NOTE: M-x grep for make-string causes crash now. something definitely to do with string changes. check very carefully the diffs and put in those sledgehammer checks. [DONE]

fix font-lock bug i introduced. [DONE]

added optimization to strings (keeps track of # of bytes of ascii at the beginning of a string). perhaps should also keep an all-ascii flag to deal with really large (> 2 MB) strings. rewrite code to count ascii-begin to use the 4-or-8-at-a-time stuff in bytecount_to_charcount.

Error: M-q is causing Invalid Regexp error on the above paragraph. It’s not in working. I assume it’s a side effect of the string stuff. VERIFY! Write sledgehammer checks for strings. [DONE]

revamped the locale/init stuff so that it tries much harder to get things right. should test a bit more. in particular, test out Describe Language on the various created environments and make sure everything looks right.

should change the menus: move the submenus on ‘Edit->Mule’ directly under ‘Edit’. add a menu entry on ‘File’ to say "Reload with specified encoding ->". [DONE]

Also ‘Find File’ with specified encoding -> Also entry to change the EOL settings for Unix, and implement it.

decode-coding-region isn’t working because it needs to insert a binary (char->byte) converter. [DONE]

chain should be rearranged to be in decoding order; similar for source/sink-type, other things?

the detector should check for a magic cookie even without a seekable input. (currently its input is not seekable, because it’s hidden within a chain. #### See what we can do about this.)

provide a way to display various settings, e.g. the current category mappings and priority (see mule-diag; get this working so it’s in the path); also a way to print out the likeliness results from a detection, perhaps a debug flag.

problem with ‘env’, which causes path issues due to ‘env’ in packages. move env code to process, sync with fsf 21.0.105, check that the autoloads in ‘env’ don’t cause problems. [DONE]

8-bit iso2022 detection appears broken; or at least, mule-canna.c is not so detected.

sep 25, 2001

something else to do is review the font selection and fix it so that (e.g.) JISX-0212 can be displayed.

also, text in widgets needs to be drawn by us so that the correct fonts will be displayed even in multi-lingual text.

sep 24, 2001

the detection system is now properly abstracted. the detectors have been rewritten to include multiple levels of abstraction. now we just need detectors for ascii, binary, and latin-x, as well as more sophisticated detectors in general and further review of the general algorithm for doing detection. (#### Is this written up anywhere?) after that, consider adding error-checking to decoding (VERY IMPORTANT) and verifying the binary correctness of things under unix no-mule.

sep 23, 2001

began to fix the detection system – adding multiple levels of likelihood and properly abstracting the detectors. the system is in place except for the abstraction of the detector-specific data out of the struct detection_state. we should get things working first before tackling that (which should not be too hard). i’m rewriting algorithms here rather than just converting code, so it’s harder. mostly done with everything, but i need to review all detectors except iso2022 and make them properly follow the new way. also write a no-conversion detector. also need to look into the ‘recode’ package and see how (if?) they handle detection, and maybe copy some of the algorithms. also look at recent FSF 21.0 and see if their algorithms have improved.

sep 22, 2001

fixed gc bugs from yesterday.
fixed truename bug.
close/finalize stuff works.
eliminated notyet stuff in syswindows.h.
eliminated special code in tstr_to_c_string.
fixed pdump problems. (many of them, mostly latent bugs, ugh)
fixed cygwin sscanf problems in parse-unicode-translation-table. (NOT a sscanf bug, but subtly different behavior w.r.t. whitespace in the format string, combined with a debugger that sucks ROCKS!! and consistently outputs garbage for variable values.)

main stuff to test is the handling of EOF recognition vs. binary (i.e. check what the default settings are under Unix). then we may have something that WORKS on all platforms!!! (Also need to test Windows non-Mule)

sep 21, 2001

finished redoing the close/finalize stuff in the lstream code. but i encountered again the nasty bug mentioned on sep 15 that disappeared on its own then. the problem seems to be that the finalize method of some of the lstreams is calling Lstream_delete(), which calls free_managed_lcrecord(), which is a no-no when we’re inside of garbage-collection and the object passed to free_managed_lcrecord() is unmarked, and about to be released by the gc mechanism – the free lists will end up with xfree()d objects on them, which is very bad. we need to modify free_managed_lcrecord() to check if we’re in gc and the object is unmarked, and ignore it rather than move it to the free list. [DONE]

(#### What we really need to do is do what Java and C# do w.r.t. their finalize methods: For objects with finalizers, when they’re about to be freed, leave them marked, run the finalizer, and set another bit on them indicating that the finalizer has run. Next GC cycle, the objects will again come up for freeing, and this time the sweeper notices that the finalize method has already been called, and frees them for good (provided that a finalize method didn’t do something to make the object alive again).)

sep 20, 2001

redid the lstream code so there is only one coding stream. combined the various doubled coding stream methods into one; i’m a little bit unsure of this last part, though, as the results of combining the two together seem unclean. got it to compile, but it crashes in loadup. need to go through and rehash the close vs. finalize stuff, as the problem was stuff getting freed too quickly, before the canonicalize-after-decoding was run. should eliminate entirely CODING_STATE_END and use a different method (close coding stream). rewrite to use these two. make sure they’re called in the right places. Lstream_close on a stream should *NOT* do finalizing. finalize only on delete. [DONE]

in general i’d like to see the flags eliminated and converted to bit-fields. also, rewriting the methods to take advantage of rejecting should make it possible to eliminate much of the state in the various methods, esp. including the flags. need to test this is working, though – reduce the buffer size down very low and try files with only CRLF’s in them, with one offset by a byte from the other, and see if we correctly handle rejection.

still have the problem with incorrectly truenaming files.

sep 19, 2001

bug reported: crash while closing lstreams.

the lstream/coding system close code needs revamping. we need to document that order of closing lstreams is very important, and make sure we’re consistent. furthermore, chain and undecided lstreams need to close their underneath lstreams when they receive the EOF signal (there may be data in the underneath streams waiting to come out), not when they themselves are closed. [DONE]

(if only we had proper inheritance. i think in any case we should simulate it for the chain coding stream – write things in such a way that undecided can use the chain coding stream and not have to duplicate anything itself.)

in general we need to carefully think through the closing process to make sure everything always works correctly and in the right order. also check very carefully to make sure there are no dangling pointers to deleted objects floating around.

move the docs for the lstream functions to the functions themselves, not the header files. document more carefully what exactly Lstream_delete() means and how it’s used, what the connections are between Lstream_close(), Lstream_delete(), Lstream_flush(), lstream_finalize, etc. [DONE]

additional error-checking: consider deadbeefing the memory in objects stored in lcrecord free lists; furthermore, consider whether lifo or fifo is correct; under error-checking, we should perhaps be doing fifo, and setting a minimum number of objects on the lists that’s quite large so that it’s highly likely that any erroneous accesses to freed objects will go into such deadbeefed memory and cause crashes. also, at the earliest available opportunity, go through all freed memory and check for any consistency failures (overwrites of the deadbeef), crashing if so. perhaps we could have some sort of id for each block, to easier trace where the offending block came from. (all of these ideas are present in the debug system malloc from VC++, plus more stuff.) there’s similar code i wrote sitting somewhere (in ‘free-hook.c’? doesn’t appear so. we need to delete the blocking stuff out of there!). also look into using the debug system malloc from VC++, which has lots of cool stuff in it. we even have the sources. that means compiling under pdump, which would be a good idea anyway. set it as the default. (but then, we need to remove the requirement that Xpm be a DLL, which is extremely annoying. look into this.)

test the windows code page coding systems recently created.

problems reading my mail files – 1personal appears to hang, others come up with lots of ^M’s. investigate.

test the enum functions i just wrote, and finish them.

still pdump problems.

sep 18, 2001

critical-quit broken sometime after aug 25.

fixed critical quit.
fixed process problems.
print routines work. (no routine for ccl, though)
can read and write unicode files, and they can still be read by some other program
defaults should come up correctly – mswindows-multibyte is general.

still need to test matej’s stuff. seems ok with multibyte stuff but needs more testing.

sep 17, 2001

!!!!! something broken with processes !!!!! cannot send mail anymore. must investigate.

sep 17, 2001

on mon/wed nights, stop *BEFORE* 11pm. Otherwise i just start getting woozy and can’t concentrate.

just finished getting assorted fixups to the main branch committed, so it will compile under C++ (Andy committed some code that broke C++ builds). cup’d the code into the fixtypes workspace, updated the tags appropriately. i’ve created the appropriate log message, sitting in fixtypes.txt in /src/xemacs; perhaps it should go into a README. now i just have to build on everything (it’s currently building), verify it’s ok, run patcher-mail, commit, send.

my mule ws is also very close. need to:

test the new print routines.
test it can read and write unicode files, and they can still be read by some other program.
try to see if unicode can be auto-detected properly.
test it can read and write multibyte files in a few different formats. currently can’t recognize them, but if you set the cs right, it should work.
examine the test files sent by matej and see if we can handle them.

sep 15, 2001

more eol fixing. this stuff is utter crap.

currently we wrap coding systems with convert-eol-autodetect when we create them in make_coding_system_1. i had a feeling that this would be a problem, and indeed it is – when autodetecting with ‘undecided’, for example, we end up with multiple layers of eol conversion. to avoid this, we need to do the eol wrapping *ONLY* when we actually retrieve a coding system in places such as insert-file-contents. these places are insert-file-contents, load, process input, call-process-internal, ‘encode/decode/detect-coding-region’, database input, ...

(later) it’s fixed, and things basically work. NOTE: for some reason, adding code to wrap coding systems with convert-eol-lf when eol-type == lf results in crashing during garbage collection in some pretty obscure place – an lstream is free when it shouldn’t be. this is a bad sign. i guess something might be getting initialized too early?

we still need to fix the canonicalization-after-decoding code to avoid problems with coding systems like ‘internal-7’ showing up. basically, when eol==lf is detected, nil should be returned, and the callers should handle it appropriately, eliding when necessary. chain needs to recognize when it’s got only one (or even 0) items in the chain, and elide out the chain.

sep 11, 2001: the day that will live in infamy

rewrite of sep 9 entry about formats:

when calling ‘make-coding-system’, the name can be a cons of ‘(format1 . format2)’, specifying that it decodes ‘format1->format2’ and encodes the other way. if only one name is given, that is assumed to be ‘format1’, and the other is either ‘external’ or ‘internal’ depending on the end type. normally the user when decoding gives the decoding order in formats, but can leave off the last one, ‘internal’, which is assumed. a multichain might look like gzip|multibyte|unicode, using the coding systems named ‘gzip’, ‘(unicode . multibyte)’ and ‘unicode’. the way this actually works is by searching for gzip->multibyte; if not found, look for gzip->external or gzip->internal. (In general we automatically do conversion between internal and external as necessary: thus gzip|crlf does the expected, and maps to gzip->external, external->internal, crlf->internal, which when fully specified would be gzip|external:external|internal:crlf|internal – see below.) To forcibly fit together two converters that have explicitly specified and incompatible names (say you have unicode->multibyte and iso8859-1->ebcdic and you know that the multibyte and iso8859-1 in this case are compatible), you can force-cast using :, like this: ebcdic|iso8859-1:multibyte|unicode. (again, if you force-cast between internal and external formats, the conversion happens automatically.)

sep 10, 2001

moved the autodetection stuff (both codesys and eol) into particular coding systems – ‘undecided’ and ‘convert-eol’ (type == ‘autodetect’). needs lots of work. still need to search through the rest of the code and find any remaining auto-detect code and move it into the undecided coding system. need to modify make-coding-system so that it spits out auto-detecting versions of all text-file coding systems unless we say not to. need eliminate entirely the EOF flag from both the stream info and the coding system; have only the original-eof flag. in coding_system_from_mask, need to check that the returned value is not of type ‘undecided’, falling back to no-conversion if so. also need to make sure we wrap everything appropriate for text-files – i removed the wrapping on set-coding-category-list or whatever (need to check all those files to make sure all wrapping is removed). need to review carefully the new code in ‘undecided’ to make sure it works are preserves the same logic as previously. need to review the closing and rewinding behavior of chain and undecided (same – should really consolidate into helper routines, so that any coding system can embed a chain in it) – make sure the dynarr’s are getting their data flushed out as necessary, rewound/closed in the right order, no missing steps, etc.

also split out mule stuff into ‘mule-coding.c’. work done on ‘configure’/‘xemacs.mak’/‘Makefile’s not done yet. work on ‘emacs.c’/‘symsinit.h’ to interface with the new init functions not done yet.

also put in a few declarations of the way i think the abstracted detection stuff ought to go. DON’T WORK ON THIS MORE UNTIL THE REST IS DEALT WITH AND WE HAVE A WORKING XEMACS AGAIN WITH ALL EOL ISSUES NAILED.

really need a version of ‘cvs-mods’ that reports only the current directory. WRITE THIS! use it to implement a better ‘cvs-checkin’.

sep 9, 2001

implemented a gzip coding system. unfortunately, doesn’t quite work right because it doesn’t handle the gzip headers – it just reads and writes raw zlib data. there’s no function in the library to skip past the header, but we do have some code out of the library that we can snarf that implements header parsing. we need to snarf that, store it, and output it again at the beginning when encoding. in the process, we should create a "get next byte" macro that bails out when there are no more. using this, we set up a nice way of doing most stuff statelessly – if we have to bail, we reject everything back to the sync point. also need to fix up the autodetection of zlib in configure.in.

BIG problems with eol. finished up everything i thought i would need to get eol stuff working, but no – when you have mswindows-unicode, with its eol set to autodetect, the detection routines themselves do the autodetect (first), and fail (they report CR on CRLF because of the NULL byte between the CR and the LF) since they’re not looking at ascii data. with a chain it’s similarly bad. for mswindows-multibyte, for example, which is a chain unicode->unicode-to-multibyte, autodetection happens inside of the chain, both when unicode and unicode-to-multibyte are active. we could twiddle around with the eol flags to try to deal with this, but it’s gonna be a big mess, which is exactly what we’re trying to avoid. what we basically want is to entirely rip out all EOL settings from either the coding system or the stream (yes, there are two! one might saw autodetect, and then the stream contains the actual detected value). instead, we simply create an eol-autodetect coding system – or rather, it’s part of the convert-eol coding system. convert-eol, type = autodetect, does autodetection the first time it gets data sent to it to decode, and thereafter sets a stream parameter indicating the actual eol type for this stream. this means that all autodetect coding systems, as created by make-coding-system, really are chains with a convert-eol at the beginning. only subsidiary xxx-unix has no wrapping at all. this should allow eof detection of gzip, unicode, etc. for that matter, general autodetection should be entirely encapsulated inside of the ‘autodetect’ coding system, with no eol-autodetection – the chain becomes convert-eol (autodetect) -> autodetect or perhaps backwards. the generic autodetect similarly has a coding-system in its stream methods, and needs somehow or other to insert the detected coding-system into the chain. either it contains a chain inside of it (perhaps it *IS* a chain), or there’s some magic involving canonicalization-type switcherooing in the middle of a decode. either way, once everything is good and done and we want to save the coding system so it can be used later, we need to do another sort of canonicalization – converting auto-detect-type coding systems into the detected systems. again, a coding-system method, with some magic currently so that subsidiaries get properly used rather than something that’s new but equivalent to subsidiaries. (#### perhaps we could use a hash table to avoid recreating coding systems when not necessary. but that would require that coding systems be immutable from external, and i’m not sure that’s the case.)

i really think, after all, that i should reverse the naming of everything in chain and source-sink-type – they should be decoding-centric. later on, if/when we come up with the proper way to make it totally symmetrical, we’ll be fine whether before then we were encoding or decoding centric.

sep 9, 2001

investigated eol parameter.

implemented handling in make-coding-system of eol-cr and eol-crlf. fixed calls everywhere to Fget_coding_system / Ffind_coding_system to reject non-char->byte coding systems.

still need to handle "query eol type using coding-system-property" so it magically returns the right type by parsing the chain.

no work done on formats, as mentioned below. we should consider using : instead of || to indicate casting.

early sep 9, 2001

renamed some codesys properties: ‘list’ in chain -> chain; ‘subtype’ in unicode -> type. everything compiles again and sort of works; some CRLF problems that may resolve themselves when i finish the convert-eol stuff. the stuff to create subsidiaries has been rewritten to use chains; but i still need to investigate how the EOL type parameter is used. also, still need to implement this: when a coding system is created, and its eol type is not autodetect or lf, a chain needs to be created and returned. i think that what needs to happen is that the eol type can only be set to autodetect or lf; later on this should be changed to simply be either autodetect or not (but that would require ripping out the eol converting stuff in the various coding systems), and eventually we will do the work on the detection mechanism so it can do chain detection; then we won’t need an eol autodetect setting at all. i think there’s a way to query the eol type of a coding system; this should check to see if the coding system is a chain and there’s a convert-eol at the front; if so, the eol type comes from the type of the convert-eol.

also check out everywhere that Fget_coding_system or Ffind_coding_system is called, and see whether anything but a char->byte system can be tolerated. create a new function for all the places that only want char->byte, something like ‘get_coding_system_char_to_byte_only’.

think about specifying formats in make-coding-system. perhaps the name can be a cons of (format1, format2), specifying that it encodes format1->format2 and decodes the other way. if only one name is given, that is assumed to be format2, and the other is either ‘byte’ or ‘char’ depending on the end type. normally the user when decoding gives the decoding order in formats, but can leave off the last one, ‘char’, which is assumed. perhaps we should say ‘internal’ instead of ‘char’ and ‘external’ instead of byte. a multichain might look like gzip|multibyte|unicode, using the coding systems named ‘gzip’, ‘(unicode . multibyte)’ and ‘unicode’. we would have to allow something where one format is given only as generic byte/char or internal/external to fit with any of the same byte/char type. when forcibly fitting together two converters that have explicitly specified and incompatible names (say you have unicode->multibyte and iso8859-1->ebcdic and you know that the multibyte and iso8859-1 in this case are compatible), you can force-cast using ||, like this: ebcdic|iso8859-1||multibyte|unicode. this will also force external->internal translation as necessary: unicode|multibyte||crlf|internal does unicode->multibyte, external->internal, crlf->internal. perhaps you’d need to put in the internal translation, like this: unicode|multibyte|internal||crlf|internal, which means unicode->multibyte, external->internal (multibyte is compatible with external); force-cast to crlf format and convert crlf->internal.

even later: Sep 8, 2001

chain doesn’t need to set character mode, that happens automatically when the coding systems are created. fixed chain to return correct source/sink type for itself and to check the compatibility of source/sink types in its chain. fixed decode/encode-coding-region to check the source and sink types of the coding system performing the conversion and insert appropriate byte->char/char->byte converters (aka "binary" coding system). fixed set-coding-category-system to only accept the traditional encode-char-to-byte types of coding systems.

still need to extend chain to specify the parameters mentioned below, esp. "reverse". also need to extend the print mechanism for chain so it prints out the chain. probably this should be general: have a new method to return all properties, and output those properties. you could also implement a read syntax for coding systems this way.

still need to implement convert-eol and finish up the rest of the eol stuff mentioned below.

later September 7, 2001 (more like Sep 8)

moved many Lisp_Coding_System * params to Lisp_Object. In general this is the way to go, and if we ever implement a copying GC, we will never want to be passing direct pointers around. With no error-checking, we lose no cycles using Lisp_Objects in place of pointers – the Lisp_Object itself is nothing but a pointer, and so all the casts and "dereferences" boil down to nothing.

Clarified and cleaned up the "character mode" on streams, and documented who (caller or object itself) has the right to be setting character mode on a stream, depending on whether it’s a read or write stream. changed conversion_end_type method and enum source_sink_type to return encoding-centric values, rather than decoding-centric. for the moment, we’re going to be entirely encoding-centric in everything; we can rethink later. fixed coding systems so that the decode and encode methods are guaranteed to receive only full characters, if that’s the source type of the data, as per conversion_end_type.

still need to fix the chain method so that it correctly sets the character mode on all the lstreams in it and checks the source/sink types to be compatible. also fix decode-coding-string and friends to put the appropriate byte->character (i.e. no-conversion) coding systems on the ends as necessary so that the final ends are both character. also add to chain a parameter giving the ability to switch the direction of conversion of any particular item in the chain (i.e. swap encoding and decoding). i think what we really want to do is allow for arbitrary parameters to be put onto a particular coding system in the chain, of which the only one so far is swap-encode-decode. don’t need too much codage here for that, but make the design extendable.

September 7, 2001

just added a return value from the decode and encode methods of a coding system, so that some of the data can get rejected. fixed the calling routines to handle this. need to investigate when and whether the coding lstream is set to character mode, so that the decode/encode methods only get whole characters. if not, we should do so, according to the source type of these methods. also need to implement the convert_eol coding system, and fix the subsidiary coding systems (and in general, any coding system where the eol type is specified and is not LF) to be chains involving convert_eol.

after everything is working, need to remove eol handling from encode/decode methods and eventually consider rewriting (simplifying) them given the reject ability.

September 5, 2001

need to organize this. get everything below into the TODO list. CVS the TODO list frequently so i can delete old stuff. prioritize it!!!!!!!!!
move ‘README.ben-mule...’ to ‘STATUS.ben-mule...’; use ‘README’ for intro, overview of what’s new, what’s broken, how to use the features, etc.
need a global and local ‘coding-category-precedence’ list, which get merged.
finished the BOM support. also finished something not listed below, expansion to the auto-generator of Unicode-encapsulation to support bracketing code with ‘#if ... #endif’, for Cygwin and MINGW problems, e.g. This is tested; appears to work.
need to add more multibyte coding systems now that we have various properties to specify them. need to add DEFUN’s for mac-code-page and ebcdic-code-page for completeness. need to rethink the whole way that the priority list works. it will continue to be total junk until multiple levels of likeliness get implemented.
need to finish up the stuff about the various defaults. [need to investigate more generally where all the different default values are that control encoding. (there are six places or so.) need to list them in make-coding-system docs and put pointers elsewhere. [[[[#### what interface to specify that this default should be unicode? a "Unicode" language environment seems too drastic, as the language environment controls much more.]]]] even skipping the Unicode stuff here, we need to survey and list the variables that control coding page behavior and determine how they need to be set for various possible scenarios:
- total binary: no detection at all.
- raw-text only: wants only autodetection of line endings, nothing else.
- "standard Windows environment": tries for Unicode, falls back on code page encoding.
- some sort of East European environment, and Russian.
- some sort of standard Japanese Windows environment.
- standard Chinese Windows environments (traditional and simplified)
- various Unix environments (European, Japanese, Russian, etc.)
- Unicode support in all of these when it’s reasonable

These really require multiple likelihood levels to be fully implementable. We should see what can be done ("gracefully fall back") with single likelihood level. need lots of testing.

need to fix the truename problem.
lots of testing: need to test all of the stuff above and below that’s recently been implemented.

September 4, 2001

mostly everything compiles. currently there is a crash in parse-unicode-translation-table, and Cygwin/Mule won’t run. it may well be a bug in the sscanf() in Cygwin.

working on today:

adding BOM support for Unicode coding systems. mostly there, but need to finish adding BOM support to the detection routines. then test.
adding properties to unicode-to-multibyte to specify the coding system in various flexible ways, e.g. directly specified code page or ansi or oem code page of specified locale, current locale, user-default or system-default locale. need to test.
creating a ‘multibyte’ coding system, with the same parameters as unicode-to-multibyte and which resolves at coding-system-creation time to the appropriate chain. creating the underlying mechanism to allow such under-the-scenes switcheroo. need to test.
set default-value of buffer-file-coding-system to mswindows-multibyte, as Matej said it should be. need to test. need to investigate more generally where all the different default values are that control encoding. (there are six places or so.) need to list them in make-coding-system docs and put pointers elsewhere. #### what interface to specify that this default should be unicode? a "Unicode" language environment seems too drastic, as the language environment controls much more.
thinking about adding multiple levels of certainty to the detection schemes, instead of just a mask. eventually, we need to totally abstract things, but that can easier be done in many steps. (we need multiple levels of likelihood to more reasonably support a Windows environment with code-page type files. currently, in order to get them detected, we have to put them first, because they can look like lots of other things; but then, other encodings don’t get detected. with multiple levels of likelihood, we still put the code-page categories first, but they will return low levels of likelihood. Lower-down encodings may be able to return higher levels of likelihood, and will get taken preferentially.)
making it so you cannot disable file-coding, but you get an equivalent default on Unix non-Mule systems where all defaults are ‘binary’. need to test!!!!!!!!!

Matej (mostly, + some others) notes the following problems, and here are possible solutions:

he wants the defaults to work right. [figure out what those defaults are. i presume they are auto-detection of data in current code page and in unicode, and new files have current code page set as their output encoding.]
too easy to lose data with incorrect encodings. [need to set up an error system for encoding/decoding. extremely important but a little tricky to implement so let’s deal with other issues now.]
EOL isn’t always detected correctly. [#### ?? need examples]
truename isn’t working: ‘c:\t.txt’ and ‘c:\tmp.txt’ have the same truename. [should be easy to fix]
unicode files lose the BOM mark. [working on this]
command-line utilities use OEM. [actually it seems more complicated. it seems they use the codepage of the console. we may be able to set that, e.g. to UTF8, before we invoke a command. need to investigate.]
no way to handle unicode characters not recognized as charsets. [we need to create something like 8 private 2-dimensional charsets to handle all BMP Unicode chars. Obviously this is a stopgap solution. Switching to Unicode internal will ultimately make life far easier and remove the BMP limitation. but for now it will work. we translate all characters where we have charsets into chars in those charsets, and the remainder in a unicode charset. that way we can save them out again and guarantee no data loss with unicode. this creates font problems, though ...]
problems with xemacs font handling. [xemacs font handling is not sophisticated enough. it goes on a charset granularity basis and only looks for a font whose name contains the corresponding windows charset in it. with unicode this fails in various ways. for one the granularity needs to be single character, so that those unicode charsets mentioned above work; and it needs to query the font to see what unicode ranges it supports, rather than just looking at the charset ending.]

August 28, 2001

working on getting everything to compile again: Cygwin, non-MULE, pdump. not there yet.

mswindows-multibyte is now defined using chain, and works. removed most vestiges of the mswindows-multibyte coding system type.

file-coding is on by default; should default to binary only on Unix. Need to test. (Needs to compile first :-)

August 26, 2001

I’ve fixed the issue of inputting non-ASCII text under -nuni, and done some of the work on the Russian <C-x> problem – we now compute the other possibilities. We still need to fix the key-lookup code, though, and that code is unfortunately a bit ugly. the best way, it seems, is to expand the command-builder structure so you can specify different interpretations for keys. (if we do find an alternative binding, though, we need to mess with both the command builder and this-command-keys, as does the function-key stuff. probably need to abstract that munging code.)

high-priority:

[currently doing]

support for WM_IME_CHAR. IME input can work under -nuni if we use WM_IME_CHAR. probably we should always be using this, instead of snarfing input using WM_COMPOSITION. i’ll check this out.
Russian <C-x> problem. see above.

[clean-up]

make sure it compiles and runs under non-mule. remember that some code needs the unicode support, or at least a simple version of it.
make sure it compiles and runs under pdump. see below.
clean up mswindows-multibyte, TSTR_TO_C_STRING. see below. [DONE]
eliminate last vestiges of codepage<->charset conversion and similar stuff.

[other]

cut and paste. see below.
misc issues with handling lang environments. see also August 25, "finally: working on the C-x in ...".
- when switching lang env, needs to set keyboard layout.
- user var to control whether, when moving into text of a particular language, we set the appropriate keyboard layout. we would need to have a lisp api for retrieving and setting the keyboard layout, set text properties to indicate the layout of text, and have a way of dealing with text with no property on it. (e.g. saved text has no text properties on it.) basically, we need to get a keyboard layout from a charset; getting a language would do. Perhaps we need a table that maps charsets to language environments.
- test that the lang env is properly set at startup. test that switching the lang env properly sets the C locale (call setlocale(), set LANG, etc.) – a spawned subprogram should have the new locale in its environment.
look through everything below and see if anything is missed in this priority list, and if so add it. create a separate file for the priority list, so it can be updated as appropriate.

mid-priority:

clean up the chain coding system. its list should specify decode order, not encode; i now think this way is more logical. it should check the endpoints to make sure they make sense. it should also allow for the specification of "reverse-direction coding systems": use the specified coding system, but invert the sense of decode and encode.
along with that, places that take an arbitrary coding system and expect the ends to be anything specific need to check this, and add the appropriate conversions from byte->char or char->byte.
get some support for arabic, thai, vietnamese, japanese jisx 0212: at least get the unicode information in place and make sure we have things tied together so that we can display them. worry about r2l some other time.

August 25, 2001

There is actually more non-Unicode-ized stuff, but it’s basically inconsequential. (See previous note.) You can check using the file nmkun.txt (#### RENAME), which is just a list of all the routines that have been split. (It was generated from the output of ‘nmake unicode-encapsulate’, after removing everything from the output but the function names.) Use something like

grep -F -f ../nmkun.txt -w [a-hj-z]*.[ch]  |m

in the source directory, which does a word match and skips ‘intl-unicode-win32.[ch]’ and ‘intl-win32.[ch]’, which have a whole lot of references to these, unavoidably. It effectively detects what needs to be changed because changed versions either begin ‘qxe...’ or end with A or W, and in each case there’s no whole-word match.

The nasty bug has been fixed below. The -nuni option now works – all specially-written code to handle the encapsulation has been tested by some operation (fonts by loadup and checking the output of (list-fonts ""); devmode by printing; dragdrop tests other stuff).

NOTE: for -nuni (Win 95), areas need work:

cut and paste. we should be able to receive Unicode text if it’s there, and we should be able to receive it even in Win 95 or -nuni. we should just check in all circumstances. also, under 95, when we put some text in the clipboard, it may or may not also be automatically enumerated as unicode. we need to test this out and/or just go ahead and manually do the unicode enumeration.
receiving keyboard input. we get only a single byte, but we should be able to correlate the language of the keyboard layout to a particular code page, so we can then decode it correctly.
mswindows-multibyte. still implemented as its own thing. should be done as a chain of (encoding) unicode | unicode-to-multibyte. need to turn this on, get it working, and look into optimizations in the dfc stuff. (#### perhaps there’s a general way to do these optimizations??? something like having a method on a coding system that can specify whether a pure-ASCII string gets rendered as pure-ASCII bytes and vice-versa.)

ALSO:

we have special macros TSTR_TO_C_STRING and such because formerly the ‘DFC’ macros didn’t know about external stuff that was Unicode encoded and would call strlen() on them. this is fixed, so now we should undo the special macros, make em normal, removal the comments about this, and make sure it works. [DONE]
finally: working on the C-x in Russian key layout problem. in the process will probably end up doing work on cleaning up the handling of keyboard layouts, integrating or deleting the FSF stuff, adding code to change the keyboard layout as we move in and out of text in different languages (implemented as a post-command-hook; we need something like internal-post-command-hook if not already there, for internal stuff that doesn’t want to get mixed up with the regular post-command-hook; similar for pre-command-hook). also, when langenv changes, ways to set the keyboard layout appropriately.
i think the stuff above is higher priority than the other stuff mentioned below. what i’m aiming for is to be able to input and work with multiple languages without weird glitches, both under 95 and NT. the problems above are all basic impediments to such work. we assume for the moment that the user can make use of the existing file i/o conversion stuff, and put that lower in priority, after the basic input is working.
i should get my modem connected and write up what’s going on and send it to the lists; also cvs commit my workspaces and get more testers.

August 24, 2001:

All code has been Unicode-ized except for some stuff in console-msw.c that deals with console output. Much of the Unicode-encapsulation stuff, particularly the hand-written stuff, really needs testing. I added a new command-line option, -nuni, to force use of all ANSI calls – XE_UNICODEP evaluates to false in this case.

There is a nasty bug that appeared recently, probably when the event code got Unicode-ized – bad interactions with OS sticky modifiers. Hold the shift key down and release it, then instead of affecting the next char only, it gets permanently stuck on (until you do a regular shift+char stroke). This needs to be debugged.

Other things on agenda:

go through and prioritize what’s listed below.
make sure the pdump code can compile and work. for the moment we just don’t try to dump any Unicode tables and load them up each time. this is certainly fast but ...
there’s the problem that XEmacs can’t be run in a directory with non-ASCII/Latin-1 chars in it, since it will be doing Unicode processing before we’ve had a chance to load the tables. In fact, even finding the tables in such a situation is problematic using the normal commands. my idea is to eventually load the stuff extremely extremely early, at the same time as the pdump data gets loaded. in fact, the unicode table data (stored in an efficient binary format) can even be stuck into the pdump file (which would mean as a resource to the executable, for windows). we’d need to extend pdump a bit: to allow for attaching extra data to the pdump file. (something like pdump_attach_extra_data (addr, length) returns a number of some sort, an index into the file, which you can then retrieve with pdump_load_extra_data(), which returns an addr (mmap()ed or loaded), and later you pdump_unload_extra_data() when finished. we’d probably also need pdump_attach_extra_data_append(), which appends data to the data just written out with pdump_attach_extra_data(). this way, multiple tables in memory can be written out into one contiguous table. (we’d use the tar-like trick of allowing new blocks to be written without going back to change the old blocks – we just rely on the end of file/end of memory.) this same mechanism could be extracted out of pdump and used to handle the non-pdump situation (or alternatively, we could just dump either the memory image of the tables themselves or the compressed binary version). in the case of extra unicode tables not known about at compile time that get loaded before dumping, we either just dump them into the image (pdump and all) or extract them into the compressed binary format, free the original tables, and treat them like all other tables.
C-x b when using a Russian keyboard layout. XEmacs currently tries to interpret ‘C+cyrillic char’, which causes an error. We want C-x b to still work even when the keyboard normally generates Cyrillic. What we should do is expand the keyboard event structure so that it contains not only the actual char, but what the char would have been in various other keyboard layouts, and in contexts where only certain keystrokes make sense (creating control chars, and looking up in keymaps), we proceed in order, processing each of them until we get something. order should be something like: current keyboard layout; layout of the current language environment; layout of the user’s default language; layout of the system default language; layout of US English.
reading and writing Unicode files. multiple problems:
- EOL’s aren’t handled right. for the moment, just fix the Unicode coding systems; later on, create EOL-only coding systems:
  1. - they would be character->character and operate next to the internal data; this means that coding systems need to be able to handle ends of lines that are either CR, LF, or CRLF. usually this isn’t a problem, as they are just characters like any other and get encoded appropriately. however, coding systems that are line-oriented need to recognize any of the three as line endings.
    - we’d also have to complete the stuff that handles coding systems where either end can be byte or char (four possibilities total; use a single enum such as ENCODES_CHAR_TO_BYTE, ENCODES_BYTE_TO_BYTE, etc.).
    - we’d need ways of specifying the chaining of coding systems. e.g. when reading a coding system, a user can specify more than one with a | symbol between them. when a context calls for a coding system and a chain is needed, the ‘chain’ coding system is useful; but we should really expand the contexts where a list of coding systems can be given, and whenever possible try to inline the chain instead of using a surrounding chain coding system.
    - the chain needs some work so that it passes all sorts of lstream commands down to the chain inside it – it should be entirely transparent and the fact that there’s actually a surrounding coding system should be invisible. more general coding system methods might need to be created.
    - important: we need a way of specifying how detecting works when we have more than one coding system. we might need more than a single priority list. need to think about this.
- Unicode files beginning with the BOM are not recognized as such. we need to fix this; but to make things sensible, we really need to add the idea of different levels of confidence regarding what’s detected. otherwise, Unicode says "yes this is me" but others higher up do too. in the process we should probably finish abstracting the detection system and fix up some stupidities in it.
- When writing a file, we need error detection; otherwise somebody will create a Unicode file without realizing the coding system of the buffer is Raw, and then lose all the non-ASCII/Latin-1 text when it’s written out. We need two levels
  1. - first, a "safe-charset" level that checks before any actual encoding to see if all characters in the document can safely be represented using the given coding system. FSF has a "safe-charset" property of coding systems, but it’s stupid because this information can be automatically derived from the coding system, at least the vast majority of the time. What we need is some sort of alternative-coding-system-precedence-list, langenv-specific, where everything on it can be checked for safe charsets and then the user given a list of possibilities. When the user does "save with specified encoding", they should see the same precedence list. Again like with other precedence lists, there’s also a global one, and presumably all coding systems not on other list get appended to the end (and perhaps not checked at all when doing safe-checking?). safe-checking should work something like this: compile a list of all charsets used in the buffer, along with a count of chars used. that way, "slightly unsafe" charsets can perhaps be presented at the end, which will lose only a few characters and are perhaps what the users were looking for.
    - when actually writing out, we need error checking in case an individual char in a charset can’t be written even though the charsets are safe. again, the user gets the choice of other reasonable coding systems.
    - same thing (error checking, list of alternatives, etc.) needs to happen when reading! all of this will be a lot of work!

Announcement, August 20, 2001:

I’m looking for testers. There is a complete and fast implementation in C of Unicode conversion, translations for almost all of the standardly-defined charsets that load up automatically and instantaneously at runtime, coding systems supporting the common external representations of Unicode [utf-16, ucs-4, utf-8, little-endian versions of utf-16 and ucs-4; utf-7 is sitting there with abort[]s where the coding routines should go, just waiting for somebody to implement], and a nice set of primitives for translating characters<->codepoints and setting the priority lists used to control codepoint->char lookup.

It’s so far hooked into one place: the Windows IME. Currently I can select the Japanese IME from the thing on my tray pad in the lower right corner of the screen, and type Japanese into XEmacs, and you get Japanese in XEmacs – regardless of whether you set either your current or global system locale to Japanese,and regardless of whether you set your XEmacs lang env as Japanese. This should work for many other languages, too – Cyrillic, Chinese either Traditional or Simplified, and many others, but YMMV. There may be some lurking bugs (hardly surprising for something so raw).

To get at this, checkout using ‘ben-mule-21-5’, NOT the simpler *‘mule-21-5’. For example

cvs -d :pserver:xemacs@cvs.xemacs.org:/usr/CVSroot checkout -r ben-mule-21-5 xemacs

or you get the idea. the ‘-r ben-mule-21-5’ is important.

I keep track of my progress in a file called README.ben-mule-21-5 in the root directory of the source tree.

WARNING: Pdump might not work. Will be fixed rsn.

August 20, 2001

still need to sort out demand loading, binary format, etc. figure out what the goals are and how we’re going to achieve them. for the moment let’s just say that running XEmacs in a directory with Japanese or other weird characters in the name is likely to cause problems under MS Windows, but once XEmacs is initialized (and before processing init files), all Unicode support is there.
wrote the size computation routines, although not yet tested.
lots more abstraction of coding systems; almost done.
UNICODE WORKS!!!!!

August 19, 2001

Still needed on the Unicode support:

demand loading: load the Unicode table data the first time a conversion needs to be done.
maybe: table size computation: figure out how big the in-memory tables actually are.
maybe: create a space-efficient binary format for the data, and a way to dump out an existing charset’s data into this binary format. it should allow for many such groups of data to be appended together in one file, such that you can just append the new data onto the end and not have to go back and modify anything previously. (like how tar archives work, and how the UFS? for CD-R’s and CD-RW’s works.)
maybe: figure out how to be able to access the Unicode tables at init_intl() time, before we know how to get at data-directory; that way we can handle the need for unicode conversions that come up very early, for example if XEmacs is run from a directory containing Japanese in it. Presumably we’d want to generalize the stuff in ‘pdump.c’ that deals with the dumper file, so that it can handle other files – putting the file either in the directory of the executable or in a resource, maybe actually attached to the pdump file itself – or maybe we just dump the data into the actual executable. With pdump we could extend pdump to allow for data that’s in the pdump file but not actually mapped at startup, separate from the data that does get mapped – and then at runtime the pointer gets restored not with a real pointer but an offset into the file; another pdump call and we get some way to access the data. (tricky because it might be in a resource, not a file. we might have to just tell pdump to mmap or whatever the data in, and then tell pdump to release it.)
fix multibyte to use unicode. at first, just reverse mswindows-multibyte-to-unicode to be unicode-to-multibyte; later implement something in chain to allow for reversal, for declaring the ends of the coding systems, etc.
actually make sure that the IME stuff is working!!!

Other things before announcing:

change so that the Unicode tables are not pdumped. This means we need to free any table data out there. Make sure that pdump compiles and try to finish the pretty-much-already-done stuff already with XD_STRUCT_ARRAY and dynamic size computation; just need to see what’s going on with LO_LINK.

August 14, 2001

To do a diff between this workspace and the mainline, use the most recent sync tags, currently:

cvs diff -r main-branch-ben-mule-21-5-aug-11-2001-sync -r ben-mule-21-5-post-aug-11-2001-sync

Unicode support:

Unicode support is important for supporting many languages under Windows, such as Cyrillic, without resorting to translation tables for particular Windows-specific code pages. Internally, all characters in Windows can be represented in two encodings: code pages and Unicode. With Unicode support, we can seamlessly support all Windows characters. Currently, the test in the drive to support Unicode is if IME input works properly, since it is being converted from Unicode.

Unicode support also requires that the various Windows API’s be "Unicode-encapsulated", so that they automatically call the ANSI or Unicode version of the API call appropriately and handle the size differences in structures. What this means is:

first, note that Windows already provides a sort of encapsulation of all API’s that deal with text. All such API’s are underlyingly provided in two versions, with an A or W suffix (ANSI or "wide" i.e. Unicode), and the compile-time constant UNICODE controls which is selected by the unsuffixed API. Same thing happens with structures. Unfortunately, this is compile-time only, not run-time, so not sufficient. (Creating the necessary run-time encoding is not conceptually difficult, but very time-consuming to write. It adds no significant overhead, and the only reason it’s not standard in Windows is conscious marketing attempts by Microsoft to cripple Windows 95. FUCK MICROSOFT! They even describe in a KnowledgeBase article exactly how to create such an API [although we don’t exactly follow their procedure], and point out its usefulness; the procedure is also described more generally in Nadine Kano’s book on Win32 internationalization – written SIX YEARS AGO! Obviously Microsoft has such an API available internally.)
what we do is provide an encapsulation of each standard Windows API call that is split into A and W versions. current theory is to avoid all preprocessor games; so we name the function with a prefix – "qxe" currently – and require callers to use the prefixed name. Callers need to explicitly use the W version of all structures, and convert text themselves using Qmswindows_tstr. the qxe encapsulated version will automatically call the appropriate A or W version depending on whether we’re running on 9x or NT, and copy data between W and A versions of the structures as necessary.
We require the caller to handle the actual translation of text to avoid possible overflow when dealing with fixed-size Windows structures. There are no such problems when copying data between the A and W versions because ANSI text is never larger than its equivalent Unicode representation.
We allow for incremental creation of the encapsulated routines by using the coding system Qmswindows_tstr_notyet. This is an alias for Qmswindows_multibyte, i.e. it always converts to ANSI; but it indicates that it will be changed to Qmswindows_tstr when we have a qxe version of the API call that the data is being passed to and change the code to use the new function.

Besides creating the encapsulation, the following needs to be done for Unicode support:

No actual translation tables are fed into XEmacs. We need to provide glue code to read the tables in ‘etc/unicode’. See ‘etc/unicode/README’ for the interface to implement.
Fix pdump. The translation tables for Unicode characters function as unions of structures with different numbers of indirection levels, in order to be efficient. pdump doesn’t yet support such unions. ‘charset.h’ has a general description of how the translation tables work, and the pdump code has constants added for the new required data types, and descriptions of how these should work.
ultimately, there’s no end to additional work (composition, bidi reordering, glyph shaping/ordering, etc.), but the above is enough to get basic translation working.

Merging this workspace into the trunk requires some work. ChangeLogs have not yet been created. Also, there is a lot of additional code in this workspace other than just Windows and Unicode stuff. Some of the changes have been somewhat disruptive to the code base, in particular:

the code that handles the details of processing multilingual text has been consolidated to make it easier to extend it. it has been yanked out of various files (‘buffer.h’, ‘mule-charset.h’, ‘lisp.h’, ‘insdel.c’, ‘fns.c’, ‘file-coding.c’, etc.) and put into ‘text.c’ and ‘text.h’. ‘mule-charset.h’ has also been renamed ‘charset.h’. all long comments concerning the representations and their processing have been consolidated into ‘text.c’.
‘nt/config.h’ has been eliminated and everything in it merged into ‘config.h.in’ and ‘s/windowsnt.h’. see ‘config.h.in’ for more info.
‘s/windowsnt.h’ has been completely rewritten, and ‘s/cygwin32.h’ and ‘s/mingw32.h’ have been largely rewritten. tons of dead weight has been removed, and stuff common to more than one file has been isolated into ‘s/win32-common.h’ and ‘s/win32-native.h’, similar to what’s already done for usg variants.
large amounts of code throughout the code base have been Mule-ized, not just Windows code.
‘file-coding.c/.h’ have been largely rewritten (although still mostly syncable); see below.

June 26, 2001

ben-mule-21-5

this contains all the mule work i’ve been doing. this includes mostly work done to get mule working under ms windows, but in the process i’ve [of course] fixed a whole lot of other things as well, mostly mule issues. the specifics:

it compiles and runs under windows and should basically work. the stuff remaining to do is (a) improved unicode support (see below) and (b) smarter handling of keyboard layouts. in particular, it should (1) set the right keyboard layout when you change your language environment; (2) optionally (a user var) set the appropriate keyboard layout as you move the cursor into text in a particular language.
i added a bunch of code to better support OS locales. it tries to notice your locale at startup and set the language environment accordingly (this more or less works), and call setlocale() and set LANG when you change the language environment (may or may not work).
major rewriting of file-coding. it’s mostly abstracted into coding systems that are defined by methods (similar to devices and specifiers), with the ultimate aim being to allow non-i18n coding systems such as gzip. there is a "chain" coding system that allows multiple coding systems to be chained together. (it doesn’t yet have the concept that either end of a coding system can be bytes or chars; this needs to be added.)
unicode support. very raw. a few days ago i wrote a complete and efficient implementation of unicode translation. it should be very fast, and fairly memory-efficient in its tables. it allows for charset priority lists, which should be language-environment specific (but i haven’t yet written the glue code). it works in preliminary testing, but obviously needs more testing and work. as of yet there is no translation data added for the standard charsets. the tables are in etc/unicode, and all we need is a bit of glue code to process them. see etc/unicode/README for the interface to implement.
support for unicode in windows is partly there. this will work even on windows 95. the basic model is implemented but it needs finishing up.
there is a preliminary implementation of windows ime support courtesy of ikeyama.
if you want to get cyrillic working under windows (it appears to "work" but the wrong chars currently appear), the best way is to add unicode support for iso-8859-5 and use it in redisplay-msw.c. we are already passing unicode codepoints to the text-draw routine (ExtTextOutW). (ExtTextOutW and GetTextExtentPoint32W are implemented on both 95 and NT.)
i fixed the iso2022 handling so it will correctly read in files containing unknown charsets, creating a "temporary" charset which can later be overwritten by the real charset when it’s defined. this allows iso2022 elisp files with literals in strange languages to compile correctly under mule. i also added a hack that will correctly read in and write out the emacs-specific "composition" escape sequences, i.e. ‘ESC 0’ through ‘ESC 4’. this means that my workspace correctly compiles the new file ‘devanagari.el’ that i added (see below).
i copied the remaining language-specific files from fsf. i made some minor changes in certain cases but for the most part the stuff was just copied and may not work.
i fixed post-read-conversion in coding systems to follow fsf conventions. (i also support our convention, for the moment. a kludge, of course.)
make-coding-system accepts (but ignores) the additional properties present in the fsf version, for compatibility.

[ << ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

This document was generated by Aidan Kehoe on December 27, 2016 using texi2html 1.82.

26. Multilingual Support

26.1 Introduction to Multilingual Issues #1

26.2 Introduction to Multilingual Issues #2

Introduction

Definitions and Design Basics

XEmacs Specific Definitions

26.3 Introduction to Multilingual Issues #3

Characters, Character Sets, and Encodings

Internal Representation of Text

Conversion Between Internal and External Representations

Proper Display of Multilingual Text

Inputting of Multilingual Text

26.4 Introduction to Multilingual Issues #4

26.5 Character Sets

26.6 Encodings

26.6.1 Japanese EUC (Extended Unix Code)

26.6.2 JIS7

26.7 Internal Mule Encodings

26.7.1 Internal String Encoding

26.7.2 Internal Character Encoding

26.8 Byte/Character Types; Buffer Positions; Other Typedefs

26.8.1 Byte Types

26.8.2 Different Ways of Seeing Internal Text

26.8.3 Buffer Positions

26.8.4 Other Typedefs

26.8.5 Usage of the Various Representations

26.8.6 Working With the Various Representations

26.9 Internal Text APIs

26.9.1 Basic internal-format APIs

26.9.2 The DFC API

26.9.3 The Eistring API

26.10 Coding for Mule

26.10.1 Character-Related Data Types

26.10.2 Working With Character and Byte Positions

26.10.3 Conversion to and from External Data

26.10.4 General Guidelines for Writing Mule-Aware Code

26.10.5 An Example of Mule-Aware Code

26.10.6 Mule-izing Code

26.11 CCL

26.12 Microsoft Windows-Related Multilingual Issues

26.12.1 Microsoft Documentation

26.12.2 Locales, code pages, and other concepts of “language”

26.12.3 More about code pages

26.12.4 More about locales

26.12.5 Unicode support under Windows

26.12.6 The golden rules of writing Unicode-safe code

26.12.7 The format of the locale in setlocale()

26.12.8 Random other Windows I18N docs

26.13 Modules for Internationalization

26.14 The Great Mule Merge of March 2002

26.14.1 List of changed files in new Mule workspace

Deleted files

Other deleted files

New files

Changed files

26.14.2 Changes to the MULE subsystems

configure changes

26.14.3 Pervasive changes throughout XEmacs sources

Changes to string processing

26.14.4 Changes to specific subsystems

Changes to the init code

Changes to processes

command line (‘startup.el’, ‘emacs.c’)

26.14.5 Mule changes by theme

Lisp-Visible Changes:

Internal Changes

Selection

Menubar

Unicode support:

26.14.6 File-coding rewrite

26.14.7 General User-Visible Changes

Search

Menubar

Changes to key bindings

26.14.8 General Lisp-Visible Changes

gzip support

26.14.9 User documentation

Tutorial

Sample init file

26.14.10 General internal changes

command line (‘`startup.el`’, ‘`emacs.c`’)

‘`make-docfile`’:

‘`startup.el`’

‘`frame.el`’