Demystifying “Ispell and its process have different character maps”

Summary

The Emacs package ispell.el can produce the error message “Ispell and its process have different character maps”. This error can be confusing. It actually means that the ispell package and the underlying program differ on how they define word boundaries. That is, what the ispell package thinks is one word, the underlying program thinks is more than one word. One possibility is a character set error (say, UTF-8 vs ISO 8859-1) which will disrupt how byte sequences are turned into characters. There are others. For example, you may want to check the definition of WORDCHARS in your dictionary and compare it to the OTHERCHARS entry in ispell-local-dictionary-alist as this will disrupt how character sequences are turned into words.

War story

I recently spent a very frustrating day trying to drag my Windows HTML text editing set-up into the twenty-first century. That means full Unicode support, smart quotes, and (after the third time I mistyped “lifetime” as “lieftime” in the space of five minutes) on-the-fly spell checking.

Much of this was fairly easy.

I’ve been using Emacs as my preferred editor for decades so the easiest way to get my editor up to Unicode support was to move from the ancient XEmacs 21.4.22 I’d been using to Emacs 25.3. Although it doesn’t come with a full Windows installer, the zip file can be unpacked and copied into C:\Program Files\Emacs without too much trouble.

On-the-fly spell checking is easy. Emacs comes with flyspell mode which handles this just fine once basic spell checking is working.

Smart quotes aren’t a standard part of Emacs but there’s a smart-quotes package that seems to work well.

The problem was initially getting the basic spell checking working at all on Windows and specifically getting it working with smart quotes, notably I wanted to use follow Unicode recommendations and use RIGHT SINGLE QUOTATION MARK (U+2019) for apostrophes rather than APOSTROPHE (U+0027). That is I wanted “isn’t” rather than “isn't”.

There were number of problems to solve and two things led me astray. One was the misleading error messages that led me to write this page. The other was a Windows command line vs code page vs iconv problem or something like that.

In Emacs, basic spell checking is handled in two parts: a package and a program. Emacs has an ispell package that interfaces to a range of spelling programs which do the heavy lifting.

After some research I ended up selecting hunspell mostly because there was a pre-compiled Windows binary of version 1.3.2 available for download. Again, like Emacs, it’s not a full Windows installer but it does tolerate unzipping and copying into C:\Program Files (x86)\HunSpell. This is the latest version for which I could find pre-compiled Windows binaries. The latest version of the source code is 1.6.2 which may well solve some of these problems (don’t ask about my attempts to compile from sources, you’ll get an earful).

Then, following various instructions, I got Emacs pointed at the right binary path, tripped over and fixed the LANG problem, redefined the dictionaries in Emacs, selected a dictionary of choice (British) and got to the point of most of the spell checking working. I’d even followed the smart-quotes instructions and told it that U+2019 is part of a word. That had given me an Emacs initialisation file that looked something like:

(if (string= (getenv "LANG") "ENG")
    (setenv "LANG" "en_US"))
(add-to-list 'exec-path "C:/Program Files (x86)/HunSpell/bin/")
(setq ispell-program-name "hunspell")
(setq ispell-personal-dictionary "~/.ispell")
(setq ispell-local-dictionary-alist
 (quote
  (
   ("english" "[[:alpha:]]" "[^[:alpha:]]" "['’]" t ("-d" "en_US") nil utf-8)
   ("british" "[[:alpha:]]" "[^[:alpha:]]" "['’]" t ("-d" "en_GB") nil utf-8)
   )))
(setq ispell-local-dictionary "british")

(add-hook 'text-mode-hook 'turn-on-flyspell)
(add-hook 'html-mode-hook 'turn-on-flyspell)

(load-library "smart-quotes")

(add-hook 'html-mode-hook 'turn-on-smart-quotes)

However, at this point the problems really started. If I asked Emacs to spell the individual word “we’ve” with a U+2019 apostrophe then I got the error “Ispell and its process have different character maps”

At other times I would avoid that error but any word with a U+2019 would be flagged as an error.

It didn’t help that when I was testing things at the Windows command line everything appeared to work fine. It was only much, much later I realised that was because although I was using EnableHexNumpad to allow me to type U+2019 and although it was rendering differently from U+0027 on the screen, something was translating it to U+0027 before it reached hunspell. Therefore I thought I had hunspell correctly configured to accept words with U+2019 and that the problem had to be upstream but in fact it wasn’t.

Running hunspell from a Windows Subsystem for Linux Bash window also appeared to show everything was working but that was a false negative because I hadn’t realised I’d switched my test word to “we’re” and there was a tiny subtle clue to what was going on that I’d missed.

It was only after I inserted debugging into ispell.el that I was able to make some progress.

The message “Ispell and its process have different character maps” is generated, in the middle of a long function, by this code:

      (if (and ispell-filter (listp ispell-filter))
	  (if (> (length ispell-filter) 1)
	      (error "Ispell and its process have different character maps")
	    (setq poss (ispell-parse-output (car ispell-filter)))))

There’s nothing obvious in this code about character maps or coding schemes so it looks like something complex is going on. It isn’t. It’s just non-obvious.

At this point, ispell-filter holds the output of the ispell process for the word you’re trying to spell. It’s a list with one word per line of output. So the key expression (> (length ispell-filter) 1) is saying “if ispell returned more than one line of output.”

The trick is that ispell (and other programs that are compatible with ispell) produce one line for each word on their input. Since Emacs has passed just one word to the ispell program it expects one line of output.

If it gets more than one line of output then there’s a difference between what Emacs thinks constitutes a word and what ispell thinks constitutes a word. In this case the message is trying to say that Emacs things “we’ve” is one word but ispell thinks it’s two words “we” and “ve”.

It’s true that a possible cause of this is a mismatch in character encodings. For example, if Emacs is sending an “é” as UTF-8 it well send bytes C3 A9 however, if the program interprets the bytes as Latin-1 it will think that they are “Ô and “©”. For a word with an “é” in the middle, this will split it into two words.

When I was accidentally sending “we’re” to hunspell from Bash, I was getting two lines with stars. That’s because “we” and “re” are both words. However, “ve” is not a word. The two stars was the tiny subtle clue I missed. Now I’ve done more investigation, I understand what it was telling me. I’d read the stars as indicating no problems. If I’d used “we’ve” then I would have seen a report of a misspelt word. Reports of misspellings contain the word that it thinks is wrong. This would have told me that it was treating “we” and “ve” as separate words. However, in that case I get a mass of iconv errors which reinforces an idea that there’s a character map error so I’m pretty sure I did that several times too. Here’s the output for the two cases:

$ echo "we’re" | '/mnt/c/Program Files (x86)/HunSpell/bin/hunspell.exe' -d en_GB -i utf-8
Hunspell 1.3.2
*
*

$ echo "we’ve" | '/mnt/c/Program Files (x86)/HunSpell/bin/hunspell.exe' -d en_GB -i utf-8
Hunspell 1.3.2
*
& ve 15 3: be, v, e, eve, vie, ave, vet, veg, vex, re, le, de, vu, me, he

error - iconv_open: CP65001 -> UTF-8
error - iconv_open: CP65001 -> ISO8859-1
error - iconv_open: CP65001 -> ISO8859-1
error - iconv_open: CP65001 -> ISO8859-1
error - iconv_open: CP65001 -> ISO8859-1
error - iconv_open: CP65001 -> ISO8859-1
error - iconv_open: CP65001 -> ISO8859-1
error - iconv_open: CP65001 -> ISO8859-1
error - iconv_open: CP65001 -> ISO8859-1
error - iconv_open: CP65001 -> ISO8859-1
error - iconv_open: CP65001 -> ISO8859-1
error - iconv_open: CP65001 -> ISO8859-1
error - iconv_open: CP65001 -> ISO8859-1
error - iconv_open: CP65001 -> ISO8859-1
error - iconv_open: CP65001 -> ISO8859-1
error - iconv_open: CP65001 -> ISO8859-1

Emacs gets its definition of a word from the definition of ispell-local-dictionary-alist. Specifically, the fourth value in each dictionary’s definition—“['’]”—tells Emacs that both types of quotes are allowed in the middle of words.

In contrast, hunspell was getting its definition of which characters were allowed in the middle of words from the following line in the .aff (which was missing from the British dictionary but I’d added as without it even U+0027 was not accepted):

WORDCHARS 0123456789'

Adding both apostrophes to that aligned the definition of word boundaries with what Emacs had been told:

WORDCHARS 0123456789'’

Of course, that file declares itself to be using character coding ISO8859-1 which doesn’t support U+2019 which Emacs rather nicely spots for you (because it worked out the coding of the file automatically). So, I needed to change the SET line to SET UTF-8 and then use set-buffer-file-coding-system to tell Emacs to encode the file using UTF-8 when saving. This then gave me a dictionary that avoided the misleading character map error.

As it happened I had tried this several times already and it hadn’t worked. This is because hunspell (at least version 1.3.2) still doesn’t know that words can be spelt with U+2019. This is why, although I’d made the correct step with WORDCHARS several times, I kept backing it out as unhelpful. Even adding the words to my personal dictionary didn’t seem to work. I suspect later versions of hunspell avoid this issue but they’re not available as pre-compiled Windows binaries.

At this point, having decoded the error message, I knew what was going on and brute force triumphed. I forked the British dictionary I was using, went through the .dic and .aff files duplicating words and affix rules that contained apostrophes using query-replace-regexp (C-M-%) “^\(.*\)'\(.*\)” to “\1'\2^J\1’\2” (where ^J is typed as C-Q C-J and the U+2019 can be typed with C-x 8 ]), fixed up the word and suffix counts manually, pointed Emacs at the new dictionaries and finally things worked.


[Up] Up to the welcome page.
Comments should be addressed to webmaster@pertinentdetail.org
Copyright © 2018 Steven Singer.