[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25. Text


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25.1 The Text in a Buffer

The text in a buffer consists of a sequence of zero or more characters. A character is an integer that logically represents a letter, number, space, or other unit of text. Most of the characters that you will typically encounter belong to the ASCII set of characters, but there are also characters for various sorts of accented letters, special symbols, Chinese and Japanese ideograms (i.e. Kanji, Katakana, etc.), Cyrillic and Greek letters, etc. The actual number of possible characters is quite large.

For now, we can view a character as some non-negative integer that has some shape that defines how it typically appears (e.g. as an uppercase A). (The exact way in which a character appears depends on the font used to display the character.) The internal type of characters in the C code is an Ichar; this is just an int, but using a symbolic type makes the code clearer.

Between every character in a buffer is a buffer position or character position. We can speak of the character before or after a particular buffer position, and when you insert a character at a particular position, all characters after that position end up at new positions. When we speak of the character at a position, we really mean the character after the position. (This schizophrenia between a buffer position being “between” two characters and “on” a character is rampant in Emacs.)

Buffer positions are numbered starting at 1. This means that position 1 is before the first character, and position 0 is not valid. If there are N characters in a buffer, then buffer position N+1 is after the last one, and position N+2 is not valid.

The internal makeup of the Ichar integer varies depending on whether we have compiled with MULE support. If not, the Ichar integer is an 8-bit integer with possible values from 0 - 255. 0 - 127 are the standard ASCII characters, while 128 - 255 are the characters from the ISO-8859-1 character set. If we have compiled with MULE support, an Ichar is a 21-bit integer, with the various bits having meanings according to a complex scheme that will be detailed later. The characters numbered 0 - 255 still have the same meanings as for the non-MULE case, though.

Internally, the text in a buffer is represented in a fairly simple fashion: as a contiguous array of bytes, with a gap of some size in the middle. Although the gap is of some substantial size in bytes, there is no text contained within it: From the perspective of the text in the buffer, it does not exist. The gap logically sits at some buffer position, between two characters (or possibly at the beginning or end of the buffer). Insertion of text in a buffer at a particular position is always accomplished by first moving the gap to that position (i.e. through some block moving of text), then writing the text into the beginning of the gap, thereby shrinking the gap. If the gap shrinks down to nothing, a new gap is created. (What actually happens is that a new gap is “created” at the end of the buffer’s text, which requires nothing more than changing a couple of indices; then the gap is “moved” to the position where the insertion needs to take place by moving up in memory all the text after that position.) Similarly, deletion occurs by moving the gap to the place where the text is to be deleted, and then simply expanding the gap to include the deleted text. (Expanding and shrinking the gap as just described means just that the internal indices that keep track of where the gap is located are changed.)

Note that the total amount of memory allocated for a buffer text never decreases while the buffer is live. Therefore, if you load up a 20-megabyte file and then delete all but one character, there will be a 20-megabyte gap, which won’t get any smaller (except by inserting characters back again). Once the buffer is killed, the memory allocated for the buffer text will be freed, but it will still be sitting on the heap, taking up virtual memory, and will not be released back to the operating system. (However, if you have compiled XEmacs with rel-alloc, the situation is different. In this case, the space will be released back to the operating system. However, this tends to result in a noticeable speed penalty.)

Astute readers may notice that the text in a buffer is represented as an array of bytes, while (at least in the MULE case) an Ichar is a 21-bit integer, which clearly cannot fit in a byte. This means (of course) that the text in a buffer uses a different representation from an Ichar: specifically, the 21-bit Ichar becomes a series of one to four bytes. The conversion between these two representations is complex and will be described later.

In the non-MULE case, everything is very simple: An Ichar is an 8-bit value, which fits neatly into one byte.

If we are given a buffer position and want to retrieve the character at that position, we need to follow these steps:

  1. Pretend there’s no gap, and convert the buffer position into a byte index that indexes to the appropriate byte in the buffer’s stream of textual bytes. By convention, byte indices begin at 1, just like buffer positions. In the non-MULE case, byte indices and buffer positions are identical, since one character equals one byte.
  2. Convert the byte index into a memory index, which takes the gap into account. The memory index is a direct index into the block of memory that stores the text of a buffer. This basically just involves checking to see if the byte index is past the gap, and if so, adding the size of the gap to it. By convention, memory indices begin at 1, just like buffer positions and byte indices, and when referring to the position that is at the gap, we always use the memory position at the beginning, not at the end, of the gap.
  3. Fetch the appropriate bytes at the determined memory position.
  4. Convert these bytes into an Ichar.

In the non-Mule case, (3) and (4) boil down to a simple one-byte memory access.

Note that we have defined three types of positions in a buffer:

  1. buffer positions or character positions, typedef Charbpos
  2. byte indices, typedef Bytebpos
  3. memory indices, typedef Membpos

All three typedefs are just ints, but defining them this way makes things a lot clearer.

Most code works with buffer positions. In particular, all Lisp code that refers to text in a buffer uses buffer positions. Lisp code does not know that byte indices or memory indices exist.

Finally, we have a typedef for the bytes in a buffer. This is a Ibyte, which is an unsigned char. Referring to them as Ibytes underscores the fact that we are working with a string of bytes in the internal Emacs buffer representation rather than in one of a number of possible alternative representations (e.g. EUC-encoded text, etc.).


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25.2 Ibytes and Ichars

This is documented under the internationalization support: see section Byte/Character Types; Buffer Positions; Other Typedefs


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25.3 Byte-Char Position Conversion

Oct 2004:

This is what I wrote when describing the previous algorithm:

The basic algorithm we use is to keep track of a known region of characters in each buffer, all of which are of the same width. We keep track of the boundaries of the region in both Charbpos and Bytebpos coordinates and also keep track of the char width, which is 1 - 4 bytes. If the position we’re translating is not in the known region, then we invoke a function to update the known region to surround the position in question. This assumes locality of reference, which is usually the case.

Note that the function to update the known region can be simple or complicated depending on how much information we cache. In addition to the known region, we always cache the correct conversions for point, BEGV, and ZV, and in addition to this we cache 16 positions where the conversion is known. We only look in the cache or update it when we need to move the known region more than a certain amount (currently 50 chars), and then we throw away a “random” value and replace it with the newly calculated value.

Finally, we maintain an extra flag that tracks whether the buffer is entirely ASCII, to speed up the conversions even more. This flag is actually of dubious value because in an entirely-ASCII buffer the known region will always span the entire buffer (in fact, we update the flag based on this fact), and so all we’re saving is a few machine cycles.

A potentially smarter method than what we do with known regions and cached positions would be to keep some sort of pseudo-extent layer over the buffer; maybe keep track of the charbpos/bytebpos correspondence at the beginning of each line, which would allow us to do a binary search over the pseudo-extents to narrow things down to the correct line, at which point you could use a linear movement method. This would also mesh well with efficiently implementing a line-numbering scheme. However, you have to weigh the amount of time spent updating the cache vs. the savings that result from it. In reality, we modify the buffer far less often than we access it, so a cache of this sort that provides guaranteed LOG (N) performance (or perhaps N * LOG (N), if we set a maximum on the cache size) would indeed be a win, particularly in very large buffers. If we ever implement this, we should probably set a reasonably high minimum below which we use the old method, because the time spent updating the fancy cache would likely become dominant when making buffer modifications in smaller buffers.

Note also that we have to multiply or divide by the char width in order to convert the positions. We do some tricks to avoid ever actually having to do a multiply or divide, because that is typically an expensive operation (esp. divide). Multiplying or dividing by 1, 2, or 4 can be implemented simply as a shift left or shift right, and we keep track of a shifter value (0, 1, or 2) indicating how much to shift. Multiplying by 3 can be implemented by doubling and then adding the original value. Dividing by 3, alas, cannot be implemented in any simple shift/subtract method, as far as I know; so we just do a table lookup. For simplicity, we use a table of size 128K, which indexes the “divide-by-3” values for the first 64K non-negative numbers. (Note that we can increase the size up to 384K, i.e. indexing the first 192K non-negative numbers, while still using shorts in the array.) This also means that the size of the known region can be at most 64K for width-three characters.

Unfortunately, it turned out that the implementation had serious problems which had never been corrected. In particular, the known region had a large tendency to become zero-length and stay that way.

So I decided to port the algorithm from FSF 21.3, in markers.c.

This algorithm is fairly simple. Instead of using markers I kept the cache array of known positions from the previous implementation.

Basically, we keep a number of positions cached:

For each position, we CONSIDER() it. This means:

Otherwise, once we have an enclosing range, see which side is closer, and iterate until we find the desired value. As an optimization, I replaced the simple loop in FSF with the use of bytecount_to_charcount(), charcount_to_bytecount(), bytecount_to_charcount_down(), or charcount_to_bytecount_down(). (The latter two I added for this purpose.) These scan 4 or 8 bytes at a time through purely single-byte characters.

If the amount we had to scan was more than our “far away” distance (5000 characters, see above), then cache the new position.

#### Things to do:

Note that FSF’s algorithm checked ALL markers, not just the ones cached by this algorithm. This includes markers created by the user as well as both ends of any overlays. We could do similarly, and our extents could keep both byte and character positions rather than just the former. (But this would probably be overkill. We should just use our cache instead. Any place an extent was set was surely already visited by the char<–>byte conversion routines.)


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

25.4 Searching and Matching

Very incomplete, limited to a brief introduction.

People find the searching and matching code difficult to understand. And indeed, the details are hard. However, the basic structures are not so complex. First, there’s a hard question with a simple answer. What about Mule? The answer here is that it turns out that Mule characters can be matched byte by byte, so neither the search code nor the regular expression code need take much notice of it at all! Of course, we add some special features (such as regular expressions that match only certain charsets), but these do not require new concepts. The main exception is that wild-card matches in Mule have to be careful to swallow whole characters. This is handled using the same basic macros that are used for buffer and string movements.

This will also be true if a UTF-8 representation is used for the internal encoding.

The complex algorithms for searching are for simple string searches. In particular, the algorithm used for fast string searching is Boyer-Moore. This algorithm is based on the idea that if you have a mismatch at a given position, you can precompute where to restart the search. This typically means that you can often make many fewer than N character comparisons, where N is the position at which the match is found, or the size of the text if it contains no match. That’s fast! But it’s not easy. You must “compile” the search string into a jump table. See the source, ‘search.c’, for more information.

Emacs changes the basic algorithms somewhat in order to handle case-insensitive searches without a full-blown regular expression.

Regular expressions, on the other hand, have a trivial search implementation: try a match at each position. (Under POSIX rules, it’s a bit more complex, because POSIX requires that you find the longest match in the text. This means you keep a record of the best match so far, and find all the matches.)

The matching code for regular expressions is quite complex. First, the regular expression itself is compiled. There are two basic approaches that could be taken. The first is to compile the expression into tables to drive a generic finite automaton emulator. This is the approach given in many textbooks (Sedgewick’s Algorithms and Aho, Sethi, and Ullmann’s Compilers: Principles, Techniques, and Tools, aka “The Dragon Book”) as well as being used by the ‘lex’ family of lexical analysis engines.

Emacs uses a somewhat different technique. The expression is compiled into a form of bytecode, which is interpreted by a special interpreter. The interpreter itself basically amounts to an inline implementation of the finite automaton emulator. The advantage of this technique is that it’s easier to add special features, such as control of case-sensitivity via a global variable.

The compiler is not treated here. See the source, ‘regex.c’. The interpreter, although it is divided into several functions, and looks fearsomely complex, is actually quite simple in concept. However, basically what you’re doing there is a strcmp on steroids, right?

 
int
strcmp (char *p,            /* pattern pointer */
        char *b)            /* buffer pointer  */
{
  while (*p++ == *b++)
    ;
  return *(--p) - *(--b);   /* oops, we overshot */
}

Really, it’s no harder than that. (A bit of a white lie, OK?)

How does the regexp code generalize this?

  1. Depending on the pattern, *b may have a general relationship to *p. I.e., direct comparison against *p is generalized to include checks for set membership, and context dependent properties. This depends on &*b. Of course that’s meaningless in C, so we use b directly, instead.
  2. Although to ensure the algorithm terminates, b must advance step by step, p can branch and jump.
  3. The information returned is much greater, including information about subexpressions.

We’ll ignore (3). (2) is mostly interesting when compiling the regular expression. Now we have

 
enum operator_t {
  accept = 0,
  exact,
  any,
  range,
  group,       /* actually, these are probably */
  repeat,      /* turned into conditional code */
  /* etc */
};
enum status_t {
  working = 0,
  matched,
  mismatch,
  end_of_buffer,
  error
  };
struct pattern {
  enum operator_t operator;
  char char_value;
  boolean range_table[256];
  /* etc, etc */
  };
char *p,  /* pattern pointer */
     *b;  /* buffer pointer */

enum status_t
match (struct pattern *p, char *b)
{
  enum status_t done = working;

  while (!(done = match_1_operator (p, b)))
    {
      struct pattern *p1 = p;
      p = next_p (p, b);
      b = next_b (p1, b);
    }
  return done;
}

This format exposes the underlying finite automaton.

All of them have the following structure, except that the ‘next_*’ functions decide where to jump (for ‘p’) and whether or not to increment (for ‘b’), rather than checking for satisfaction of a matching condition.

 
enum status_t
match_1_operator (pattern *p, char *b)
{
  if (! *b) return end_of_buffer;
  switch (p->operator)
    {
    case accept:
      return matched;
    case exact:
      if (*b != p->char_value) return mismatch; else break;
    case any:
      break;
    case range:
      /* range_table is computed in the regexp_compile function */
      if (! p->range_table[*b]) return mismatch;
    /* etc, etc */
    }
  return working;
}

Grouping, repetition, and alternation are handled by compiling the subexpression and calling match (p->subpattern, b) recursively.

In terms of reading the actual code, there are five optimizations (obfuscations, if you like) that have been done.

  1. An explicit “failure stack” has been substituted for recursion.
  2. The match_1_operator, next_p, and next_b functions are actually inlined into the match function for efficiency. Then the pointer movement is interspersed with the matching operations.
  3. If the operator uses buffer context, the buffer pointer movement is sometimes implicit in the operations retrieving the context.
  4. Some cases are combined into short preparation for individual cases, and a “fall-through” into combined code for several cases.
  5. The pattern type is not an explicit ‘struct’. Instead, the data (including, e.g., ‘range_table’) is inlined into the compiled bytecode. This leads to bizarre code in the interpreter like
     
    case range:
      p += *(p + 1); break;
    

    in next_p, because the compiled pattern is laid out

     
    ..., 'range', count, first_8_flags, second_8_flags, ..., next_op, ...
    

But if you keep your eye on the “switch in a loop” structure, you should be able to understand the parts you need.


[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by Aidan Kehoe on December 27, 2016 using texi2html 1.82.