[HN Gopher] Douglas McIlroy responds to Unix spell article with ...
       ___________________________________________________________________
        
       Douglas McIlroy responds to Unix spell article with new
       implementation details
        
       I originally wrote this article explaining the data model as
       implemented in the original Unix "spell":
       https://blog.codingconfessions.com/p/how-unix-spell-ran-in-6....
       This reached Doug, who then responded to it with the follow up work
       he did which wasn't published anywhere.
        
       Author : abhi9u
       Score  : 52 points
       Date   : 2025-02-06 14:03 UTC (3 days ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | dang wrote:
       | Recent and related:
       | 
       |  _How Unix spell ran in 64kb RAM_ -
       | https://news.ycombinator.com/item?id=42752604 - Jan 2025 (51
       | comments)
        
       | re wrote:
       | The text of the email (courtesy of Firefox/macOS Text
       | recognition):
       | 
       | ___________________
       | 
       | Subject: yes, it's me, the author of "spell"
       | 
       | You did a nice job of gently describing "spell"s data
       | representation. But the formula x = 2^{log_2(m)}-m is a
       | complicated way to say x=0. Did you mean that?
       | 
       | When memory got bigger, I kept the literal dictionary in memory,
       | but compressed it on secondary memory for fast transfer. The
       | compression was trivial: store a suffix preceded by one byte that
       | contained the length of the prefix that the word shared with its
       | predecessor in dictionary order.
       | 
       | We also added one byte per enry for affixing info, e.g. whether
       | in- (or im- or ir-) is preferred over un-, whether a final
       | consonant is doubled before adding a suffix that begins with a
       | vowel, part of speech, etc. Although there were lots of such
       | attributes, the number of distinct combinations of attributes
       | that actually occurred was fewer than 256, so could be coded into
       | one byte, and decoded by lookup in a 256-entry table. I automated
       | the construction of that table, which would change if a
       | dictionary revision created a new combination of attributes.
       | 
       | With affixing info we could at the same time be more aggressive
       | about affixing, and make fewer mistakes like allowing -s on a
       | word that cannot be used as a noun or a verb.
        
       | cb321 wrote:
       | > The compression was trivial: store a suffix preceded by one
       | byte that contained the length of the prefix that the word shared
       | with its predecessor in dictionary order
       | 
       | This is more or less what was speculated upon as compatible with
       | his stemming ideas [1], but I (and partly McIlroy1982) doubt you
       | ever really needed the fancy hashing-compression in the first
       | place (though I'm sure it was fun to figure out!). I got better
       | perf (3x faster on a 5500 word document with an 82,000 word
       | "unabridged" dictionary that prefix-compresses to only 278K which
       | would fit on a 360K floppy of the era) than v7spell on v7 with
       | just a merge algorithm somewhat more finely tuned against that
       | format. Of course, on a shared computer getting either the
       | streaming disk access (without seeks spoiling the party) or all
       | that RAM to yourself (without swapping spoiling the party) is a
       | guess. And, yeah, probably the background Doug mentions is all in
       | the combined-unix-git-history by now { and it would not have been
       | as attractive an academic paper :-) }.
       | 
       | [1] https://news.ycombinator.com/item?id=42931145
        
       ___________________________________________________________________
       (page generated 2025-02-09 23:00 UTC)