hngopher.com

       [HN Gopher] How to Write a Spelling Corrector (2007)
       ___________________________________________________________________
        
       How to Write a Spelling Corrector (2007)
        
       Author : fybs
       Score  : 64 points
       Date   : 2021-09-16 13:25 UTC (1 days ago)
        
 (HTM) web link (norvig.com)
 (TXT) w3m dump (norvig.com)
        
       | pandatigox wrote:
       | I've had a thought and am curious how people would solve it.
       | Sometimes, if you copy words off a PDF lecture slide, all the
       | words are mashed together (eg. Hello Foo bar - HelloFoobar). Is
       | this an AI domain or can it solved by simple programming?
        
         | ghusbands wrote:
         | It's worth noting that different PDF-to-text tools (and
         | different PDF-display tools that have text-copying) get
         | different results and it can be worth trying a few. For the
         | most part, the vector parts of a PDF and the association with
         | the text is sufficient to discover spacing information.
        
         | eutectic wrote:
         | I got nerd-sniped and wrote a simple Python solver for this
         | problem. You can find the ngram files at Norvig's site
         | (https://norvig.com/ngrams/).                   import
         | collections         import math         import heapq
         | with open('count_1w.txt', 'r') as f:             unigrams =
         | [l.split() for l in f]         unigram_map =
         | collections.defaultdict(lambda: 0)         for word, count in
         | unigrams:             unigram_map[word] = int(count)
         | with open('count_2w.txt', 'r') as f:             bigrams =
         | [l.split() for l in f]         bigram_map =
         | collections.defaultdict(lambda: collections.defaultdict(lambda:
         | {}))         for word0, word1, count in bigrams:
         | bigram_map[word0][word1] = int(count)         log_p_unseen =
         | collections.defaultdict(lambda: 0.)         for word0, counts
         | in bigram_map.items():             for word1, count in
         | counts.items():                 unigram_map[word1] += count
         | total = sum(counts.values())             #smoothing for unseen
         | words             mn, mx = min(counts.values()),
         | max(counts.values())             if mn == mx:
         | p_unseen = 0.5             else:                 #geometric
         | series approximation                 r = (mn / mx) ** (1. /
         | (len(counts) - 1))                 n = mn * r / (1. - r)
         | p_unseen = n / (n + total)             log_p_unseen[word0] =
         | math.log(p_unseen)             c = (1. - p_unseen) / total
         | bigram_map[word0] = {word1: math.log(c * count) for word1,
         | count in counts.items()}         c = 1. /
         | sum(unigram_map.values())         unigram_map = {word:
         | math.log(c * count) for word, count in unigram_map.items()}
         | max_len = max(map(len, unigram_map))              def
         | optimal_parse(text):             word_spans = {j: [] for j in
         | range(len(text) + 1)}             for i in range(len(text)):
         | for j in range(i + 1, min(i + max_len, len(text)) + 1):
         | if text[i:j] in unigram_map:
         | word_spans[i].append(j)             min_cost =
         | collections.defaultdict(lambda: float('inf'))
         | parent = {}             queue = [(0., 0, 0)]             while
         | queue:                 cost, i, j = heapq.heappop(queue)
         | if cost > min_cost[(i, j)]:                     continue
         | if j == len(text):                     break                 if
         | j == 0:                     word0 = '<s>'                 else:
         | word0 = text[i:j]                 for k in word_spans[j]:
         | word1 = text[j:k]                     if word1 in
         | bigram_map[word0]:                         word1_cost =
         | -bigram_map[word0][word1]                     else:
         | #It would technically be more correct to normalize the unigram
         | probability only over unseen words.
         | word1_cost = -(log_p_unseen[word0] + unigram_map[word1])
         | cost1 = cost + word1_cost                     if cost1 <
         | min_cost[(j, k)]:                         min_cost[(j, k)] =
         | cost1                         parent[(j, k)] = i
         | heapq.heappush(queue, (cost1, j, k))             words = []
         | while j != 0:                 words.append(text[i:j])
         | i, j = parent.get((i, j)), i             return words[::-1]
         | print(optimal_parse('ivehadathoughtandamcurioushowpeoplewouldso
         | lveit'))
        
         | gattilorenz wrote:
         | "AI" and "simple programming" are not mutually exclusive :)
         | 
         | look at the Speech and Language processing book, particularly
         | chapter 3 about language models
         | https://web.stanford.edu/~jurafsky/slp3/
         | 
         | You can implement a language model based on character n-grams
         | to calculate whether a sequence is more likely with or without
         | a space. Of course you would need a way of estimating the
         | proability of each sequence, which means you need a corpus to
         | train your language model on.
        
         | eutectic wrote:
         | I would try an n-gram model with dynamic programming.
         | 
         | e.g.
         | 
         | logp(_, "") = 0
         | 
         | logp(word0, text) = max(logp_bigram(word0, word1) + logp(word1,
         | rest) for word1, rest in prefix_words(text))
        
         | wodenokoto wrote:
         | This is a common problem in Japanese NLP, and while state of
         | the art is using deep learning, almost everyone use dynamic
         | programming or conditional random fields together with a
         | dictionary to solve it.
         | 
         | There also exists research on solving this problem unsupervised
         | which basically invents new word boundaries for a language
         | (remember that spoken languages doesn't have word boundaries -
         | it was invented for writing and strictly speaking, current
         | spelling isn't the only way to solve word boundaries for a
         | given language)
        
         | tester34 wrote:
         | Have you tried using an ~~XML~~ PDF parser instead? /s
         | 
         | and tried to find which sentence without spaces matches your
         | sentence
        
         | abecedarius wrote:
         | Short answer in Python:
         | https://stackoverflow.com/questions/195010/how-can-i-split-m...
         | 
         | This version assumes non-dictionary "words" have probability 0.
         | But it's the same basic idea as a fancier answer, and it's
         | quick.
        
         | epilys wrote:
         | It's called word segmentation. There are libraries for it:
         | 
         | https://github.com/grantjenks/python-wordsegment
         | https://github.com/InstantDomain/instant-segment (rust)
         | 
         | I used the latter to process pdf plain text output quite
         | successfully.
        
         | [deleted]
        
       | killingtime74 wrote:
       | Those interested in toy implementations in this area might also
       | enjoy this blog https://blog.burntsushi.net/transducers/ on FSMs.
       | 
       | Also the NLP Book on the data side
       | https://web.stanford.edu/~jurafsky/slp3/
        
       | the-smug-one wrote:
       | Could use a Hidden Markov Model, how to implement:
       | https://www.cs.sjsu.edu/~stamp/RUA/HMM.pdf
       | 
       | Here's an impl of some kind: https://github.com/crisbal/hmm-
       | spellcheck
        
       | eigenhombre wrote:
       | One of the more interesting parts of the post, for me, is the
       | list of implementations in other languages, including: one for
       | Clojure written by that language's author, Rich Hickey; an
       | interesting one in R that clocks in at 2 lines (with a longer,
       | more readable version further down in the linked post); and one
       | written in functional Java. The first one in Awk is also
       | interesting.
        
       | blondin wrote:
       | just wait for old hn to show up and tell new hn that the famous
       | Norvig's spelling corrector is not efficient and is not teaching
       | people how to do it right.
        
       | mtreis86 wrote:
       | One of my most common spelling mistakes is physical mistypes on
       | the keyboard, yet no spell checker seems to account for keyboard
       | layout and locality of keys, or for something like my hand being
       | one position off on the board but typing all the keys relatively
       | correct only positionally shifted.
        
         | brian_cloutier wrote:
         | spell checkers which account for keyboard layout absolutely
         | exist. [1] is one example, but in general it's hard to believe
         | that any trained model would fail to notice that certain
         | mistakes are more likely than others, and it's hard to believe
         | that any spell checker google releases these days would not be
         | based on a trained model.
         | 
         | [1] https://ai.googleblog.com/2017/05/the-machine-
         | intelligence-b...
        
         | tyingq wrote:
         | I would guess that the iPhone autocorrect does this implicitly
         | since it's ML trained.
        
         | z3t4 wrote:
         | The problem is that if you take an english word and change one
         | letter it will likely become another valid word.
        
         | wootest wrote:
         | Many designs does this implicitly. At least one, the original
         | iPhone keyboard autocomplete/suggestion algorithm, did it
         | explicitly.
         | 
         | Ken Kocienda's book Creative Selection has a very good chapter
         | on the algorithm being built piece by piece, but finding out
         | words being created by surrounding keys was part of collecting
         | all the candidates.
         | 
         | They even used this in marketing. One of the pre-original-
         | iPhone-launch videos was focused just on the on-screen keyboard
         | (probably because almost everyone thought it was a really kooky
         | idea at the time), and used the example of explicitly pressing
         | "O-U-Z-Z-A" but still getting "pizza" as the autocomplete
         | because it was the closest recommendation.
         | 
         | One of the iOS versions a few years ago became incredibly fond
         | of including the space bar and considering alternatives with
         | slightly off key presses near the space bar split into two or
         | more words. When you're using a language with a lot of compound
         | words like Swedish, this yielded some almost postmodern
         | corrections with one or more words often completely different
         | (but close on the keyboard, of course). I don't know if this
         | was a tweak to the manual algorithm going off the rails or an
         | AI version that wasn't quite tuned well yet.
        
         | m463 wrote:
         | I think the problem is that there are multiple layers of
         | correction - both the physical and the spelling/semantic.
         | 
         | On ios I just had to turn off autocorrect, because I had
         | reminders I jotted down quickly which missed the physical
         | correction and were irretrievably corrected semantically into
         | garbage.
         | 
         | now I find a note with "spekk" and I can figure out I meant
         | "spell" (need a better example)
        
       | Joker_vD wrote:
       | And after you've read that, here's a related blogpost: "A
       | Spellchecker Used to Be a Major Feat of Software Engineering"
       | [0], because Python being "fast enough" and having enough memory
       | for large dictionaries hasn't always been the case.
       | 
       | [0] https://prog21.dadgum.com/29.html
        
         | tgv wrote:
         | I'll repeat my feat in this area: a spelling corrector that had
         | 32kB memory (because it was tied to a keyboard trap, or
         | something like that) and could do four 8kB reads from CD-ROM
         | before telling the user. It contained not only orthographic
         | similarity (the actual letters), but also phonetic similarity.
         | You rarely see that in older English spell checkers, because
         | it's such an irregular language, but for other languages it's
         | quite informative, although in extremely regular languages,
         | such as Spanish, you could also put it in the
         | weighting/probability model (e.g. v<->b is a pretty common
         | confusion).
         | 
         | The CD-ROM spelling corrector was not really great, BTW, but at
         | least it replied in 1s on a typical end-user PC.
         | 
         | Edit: this was late 1980s.
        
       | da39a3ee wrote:
       | From a pedagogical point of view presenting that as a Bayesian
       | model and then using the error model he does, is a bit
       | questionable. But as always his python style is inspiring.
        
       | graycat wrote:
       | My favorite, long standard spell checker is Aspell long part of a
       | TeX distribution.
        
       ___________________________________________________________________
       (page generated 2021-09-17 23:02 UTC)