https://prog21.dadgum.com/29.html

programming in the
twenty-first century

It's not about technology for its own sake. It's about being able to
implement your ideas.

A Spellchecker Used to Be a Major Feat of Software Engineering

Here's the situation: it's 1984, and you're assigned to write the
spellchecker for a new MS-DOS word processor. Some users, but not
many, will have 640K of memory in their PCs. You need to support
systems with as little as 256K. That's a quarter megabyte to contain
the word processor, the document being edited, and the memory needed
by the operating system. Oh, and the spellchecker.

For reference, on my MacBook, the standard dictionary in /usr/share/
dict/words is 2,486,813 bytes and contains 234,936 words.

An enticing first option is a data format that's more compressed than
raw text. The UNIX dictionary contains stop and stopped and stopping,
so there's a lot of repetition. A clever trie implementation might do
the trick...but we'll need a big decrease to go from 2+ megabytes to
a hundred K or so.

In fact, even if we could represent each word in the spellchecker
dictionary as a single byte, we need almost all the full 256K just
for that, and of course the single byte representation isn't going to
work. So not only does keeping the whole dictionary in RAM look
hopeless, but so does keeping the actual dictionary on disk with only
an index in RAM.

Now it gets messy. We could try taking a subset of the dictionary,
one containing the most common words, and heavily compressing that so
it fits in memory. Then we come up with a slower, disk-based
mechanism for looking up the rest of the words. Or maybe we jump
directly to a completely disk-based solution using a custom database
of sorts (remembering, too, that we can't assume the user has a hard
disk, so the dictionary still needs to be crunched onto a 360K floppy
disk).

On top of this, we need to handle some other features, such as the
user adding new words to the dictionary.

Writing a spellchecker in the mid-1980s was a hard problem.
Programmers came up with some impressive data compression methods in
response to the spellchecker challenge. Likewise there were some very
clever data structures for quickly finding words in a compressed
dictionary. This was a problem that could take months of focused
effort to work out a solution to. (And, for the record, reducing the
size of the dictionary from 200,000+ to 50,000 or even 20,000 words
was a reasonable option, but even that doesn't leave the door open
for a naive approach.)

Fast forward to today. A program to load /usr/share/dict/words into a
hash table is 3-5 lines of Perl or Python, depending on how terse you
mind being. Looking up a word in this hash table dictionary is a
trivial expression, one built into the language. And that's it. Sure,
you could come up with some ways to decrease the load time or reduce
the memory footprint, but that's icing and likely won't be needed.
The basic implementation is so mindlessly trivial that it could be an
exercise for the reader in an early chapter of any Python tutorial.

That's progress.

permalink June 8, 2008

previously

  * Coding as Performance
  * Don't Be Afraid of Special Cases
  * Purely Functional Retrogames, Part 4
  * Purely Functional Retrogames, Part 3
  * Purely Functional Retrogames, Part 2

archives

twitter / mail

I'm James Hague, a recovering programmer who has been designing video
games since the 1980s. Programming Without Being Obsessed With
Programming and Organizational Skills Beat Algorithmic Wizardry are
good starting points. For the older stuff, try the 2012 Retrospective
.

Where are the comments?