[HN Gopher] Douglas McIlroy responds to Unix spell article with ...
___________________________________________________________________
Douglas McIlroy responds to Unix spell article with new
implementation details
I originally wrote this article explaining the data model as
implemented in the original Unix "spell":
https://blog.codingconfessions.com/p/how-unix-spell-ran-in-6....
This reached Doug, who then responded to it with the follow up work
he did which wasn't published anywhere.
Author : abhi9u
Score : 52 points
Date : 2025-02-06 14:03 UTC (3 days ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| dang wrote:
| Recent and related:
|
| _How Unix spell ran in 64kb RAM_ -
| https://news.ycombinator.com/item?id=42752604 - Jan 2025 (51
| comments)
| re wrote:
| The text of the email (courtesy of Firefox/macOS Text
| recognition):
|
| ___________________
|
| Subject: yes, it's me, the author of "spell"
|
| You did a nice job of gently describing "spell"s data
| representation. But the formula x = 2^{log_2(m)}-m is a
| complicated way to say x=0. Did you mean that?
|
| When memory got bigger, I kept the literal dictionary in memory,
| but compressed it on secondary memory for fast transfer. The
| compression was trivial: store a suffix preceded by one byte that
| contained the length of the prefix that the word shared with its
| predecessor in dictionary order.
|
| We also added one byte per enry for affixing info, e.g. whether
| in- (or im- or ir-) is preferred over un-, whether a final
| consonant is doubled before adding a suffix that begins with a
| vowel, part of speech, etc. Although there were lots of such
| attributes, the number of distinct combinations of attributes
| that actually occurred was fewer than 256, so could be coded into
| one byte, and decoded by lookup in a 256-entry table. I automated
| the construction of that table, which would change if a
| dictionary revision created a new combination of attributes.
|
| With affixing info we could at the same time be more aggressive
| about affixing, and make fewer mistakes like allowing -s on a
| word that cannot be used as a noun or a verb.
| cb321 wrote:
| > The compression was trivial: store a suffix preceded by one
| byte that contained the length of the prefix that the word shared
| with its predecessor in dictionary order
|
| This is more or less what was speculated upon as compatible with
| his stemming ideas [1], but I (and partly McIlroy1982) doubt you
| ever really needed the fancy hashing-compression in the first
| place (though I'm sure it was fun to figure out!). I got better
| perf (3x faster on a 5500 word document with an 82,000 word
| "unabridged" dictionary that prefix-compresses to only 278K which
| would fit on a 360K floppy of the era) than v7spell on v7 with
| just a merge algorithm somewhat more finely tuned against that
| format. Of course, on a shared computer getting either the
| streaming disk access (without seeks spoiling the party) or all
| that RAM to yourself (without swapping spoiling the party) is a
| guess. And, yeah, probably the background Doug mentions is all in
| the combined-unix-git-history by now { and it would not have been
| as attractive an academic paper :-) }.
|
| [1] https://news.ycombinator.com/item?id=42931145
___________________________________________________________________
(page generated 2025-02-09 23:00 UTC)