[HN Gopher] How to Write a Spelling Corrector (2007)
___________________________________________________________________
How to Write a Spelling Corrector (2007)
Author : fybs
Score : 64 points
Date : 2021-09-16 13:25 UTC (1 days ago)
(HTM) web link (norvig.com)
(TXT) w3m dump (norvig.com)
| pandatigox wrote:
| I've had a thought and am curious how people would solve it.
| Sometimes, if you copy words off a PDF lecture slide, all the
| words are mashed together (eg. Hello Foo bar - HelloFoobar). Is
| this an AI domain or can it solved by simple programming?
| ghusbands wrote:
| It's worth noting that different PDF-to-text tools (and
| different PDF-display tools that have text-copying) get
| different results and it can be worth trying a few. For the
| most part, the vector parts of a PDF and the association with
| the text is sufficient to discover spacing information.
| eutectic wrote:
| I got nerd-sniped and wrote a simple Python solver for this
| problem. You can find the ngram files at Norvig's site
| (https://norvig.com/ngrams/). import
| collections import math import heapq
| with open('count_1w.txt', 'r') as f: unigrams =
| [l.split() for l in f] unigram_map =
| collections.defaultdict(lambda: 0) for word, count in
| unigrams: unigram_map[word] = int(count)
| with open('count_2w.txt', 'r') as f: bigrams =
| [l.split() for l in f] bigram_map =
| collections.defaultdict(lambda: collections.defaultdict(lambda:
| {})) for word0, word1, count in bigrams:
| bigram_map[word0][word1] = int(count) log_p_unseen =
| collections.defaultdict(lambda: 0.) for word0, counts
| in bigram_map.items(): for word1, count in
| counts.items(): unigram_map[word1] += count
| total = sum(counts.values()) #smoothing for unseen
| words mn, mx = min(counts.values()),
| max(counts.values()) if mn == mx:
| p_unseen = 0.5 else: #geometric
| series approximation r = (mn / mx) ** (1. /
| (len(counts) - 1)) n = mn * r / (1. - r)
| p_unseen = n / (n + total) log_p_unseen[word0] =
| math.log(p_unseen) c = (1. - p_unseen) / total
| bigram_map[word0] = {word1: math.log(c * count) for word1,
| count in counts.items()} c = 1. /
| sum(unigram_map.values()) unigram_map = {word:
| math.log(c * count) for word, count in unigram_map.items()}
| max_len = max(map(len, unigram_map)) def
| optimal_parse(text): word_spans = {j: [] for j in
| range(len(text) + 1)} for i in range(len(text)):
| for j in range(i + 1, min(i + max_len, len(text)) + 1):
| if text[i:j] in unigram_map:
| word_spans[i].append(j) min_cost =
| collections.defaultdict(lambda: float('inf'))
| parent = {} queue = [(0., 0, 0)] while
| queue: cost, i, j = heapq.heappop(queue)
| if cost > min_cost[(i, j)]: continue
| if j == len(text): break if
| j == 0: word0 = '<s>' else:
| word0 = text[i:j] for k in word_spans[j]:
| word1 = text[j:k] if word1 in
| bigram_map[word0]: word1_cost =
| -bigram_map[word0][word1] else:
| #It would technically be more correct to normalize the unigram
| probability only over unseen words.
| word1_cost = -(log_p_unseen[word0] + unigram_map[word1])
| cost1 = cost + word1_cost if cost1 <
| min_cost[(j, k)]: min_cost[(j, k)] =
| cost1 parent[(j, k)] = i
| heapq.heappush(queue, (cost1, j, k)) words = []
| while j != 0: words.append(text[i:j])
| i, j = parent.get((i, j)), i return words[::-1]
| print(optimal_parse('ivehadathoughtandamcurioushowpeoplewouldso
| lveit'))
| gattilorenz wrote:
| "AI" and "simple programming" are not mutually exclusive :)
|
| look at the Speech and Language processing book, particularly
| chapter 3 about language models
| https://web.stanford.edu/~jurafsky/slp3/
|
| You can implement a language model based on character n-grams
| to calculate whether a sequence is more likely with or without
| a space. Of course you would need a way of estimating the
| proability of each sequence, which means you need a corpus to
| train your language model on.
| eutectic wrote:
| I would try an n-gram model with dynamic programming.
|
| e.g.
|
| logp(_, "") = 0
|
| logp(word0, text) = max(logp_bigram(word0, word1) + logp(word1,
| rest) for word1, rest in prefix_words(text))
| wodenokoto wrote:
| This is a common problem in Japanese NLP, and while state of
| the art is using deep learning, almost everyone use dynamic
| programming or conditional random fields together with a
| dictionary to solve it.
|
| There also exists research on solving this problem unsupervised
| which basically invents new word boundaries for a language
| (remember that spoken languages doesn't have word boundaries -
| it was invented for writing and strictly speaking, current
| spelling isn't the only way to solve word boundaries for a
| given language)
| tester34 wrote:
| Have you tried using an ~~XML~~ PDF parser instead? /s
|
| and tried to find which sentence without spaces matches your
| sentence
| abecedarius wrote:
| Short answer in Python:
| https://stackoverflow.com/questions/195010/how-can-i-split-m...
|
| This version assumes non-dictionary "words" have probability 0.
| But it's the same basic idea as a fancier answer, and it's
| quick.
| epilys wrote:
| It's called word segmentation. There are libraries for it:
|
| https://github.com/grantjenks/python-wordsegment
| https://github.com/InstantDomain/instant-segment (rust)
|
| I used the latter to process pdf plain text output quite
| successfully.
| [deleted]
| killingtime74 wrote:
| Those interested in toy implementations in this area might also
| enjoy this blog https://blog.burntsushi.net/transducers/ on FSMs.
|
| Also the NLP Book on the data side
| https://web.stanford.edu/~jurafsky/slp3/
| the-smug-one wrote:
| Could use a Hidden Markov Model, how to implement:
| https://www.cs.sjsu.edu/~stamp/RUA/HMM.pdf
|
| Here's an impl of some kind: https://github.com/crisbal/hmm-
| spellcheck
| eigenhombre wrote:
| One of the more interesting parts of the post, for me, is the
| list of implementations in other languages, including: one for
| Clojure written by that language's author, Rich Hickey; an
| interesting one in R that clocks in at 2 lines (with a longer,
| more readable version further down in the linked post); and one
| written in functional Java. The first one in Awk is also
| interesting.
| blondin wrote:
| just wait for old hn to show up and tell new hn that the famous
| Norvig's spelling corrector is not efficient and is not teaching
| people how to do it right.
| mtreis86 wrote:
| One of my most common spelling mistakes is physical mistypes on
| the keyboard, yet no spell checker seems to account for keyboard
| layout and locality of keys, or for something like my hand being
| one position off on the board but typing all the keys relatively
| correct only positionally shifted.
| brian_cloutier wrote:
| spell checkers which account for keyboard layout absolutely
| exist. [1] is one example, but in general it's hard to believe
| that any trained model would fail to notice that certain
| mistakes are more likely than others, and it's hard to believe
| that any spell checker google releases these days would not be
| based on a trained model.
|
| [1] https://ai.googleblog.com/2017/05/the-machine-
| intelligence-b...
| tyingq wrote:
| I would guess that the iPhone autocorrect does this implicitly
| since it's ML trained.
| z3t4 wrote:
| The problem is that if you take an english word and change one
| letter it will likely become another valid word.
| wootest wrote:
| Many designs does this implicitly. At least one, the original
| iPhone keyboard autocomplete/suggestion algorithm, did it
| explicitly.
|
| Ken Kocienda's book Creative Selection has a very good chapter
| on the algorithm being built piece by piece, but finding out
| words being created by surrounding keys was part of collecting
| all the candidates.
|
| They even used this in marketing. One of the pre-original-
| iPhone-launch videos was focused just on the on-screen keyboard
| (probably because almost everyone thought it was a really kooky
| idea at the time), and used the example of explicitly pressing
| "O-U-Z-Z-A" but still getting "pizza" as the autocomplete
| because it was the closest recommendation.
|
| One of the iOS versions a few years ago became incredibly fond
| of including the space bar and considering alternatives with
| slightly off key presses near the space bar split into two or
| more words. When you're using a language with a lot of compound
| words like Swedish, this yielded some almost postmodern
| corrections with one or more words often completely different
| (but close on the keyboard, of course). I don't know if this
| was a tweak to the manual algorithm going off the rails or an
| AI version that wasn't quite tuned well yet.
| m463 wrote:
| I think the problem is that there are multiple layers of
| correction - both the physical and the spelling/semantic.
|
| On ios I just had to turn off autocorrect, because I had
| reminders I jotted down quickly which missed the physical
| correction and were irretrievably corrected semantically into
| garbage.
|
| now I find a note with "spekk" and I can figure out I meant
| "spell" (need a better example)
| Joker_vD wrote:
| And after you've read that, here's a related blogpost: "A
| Spellchecker Used to Be a Major Feat of Software Engineering"
| [0], because Python being "fast enough" and having enough memory
| for large dictionaries hasn't always been the case.
|
| [0] https://prog21.dadgum.com/29.html
| tgv wrote:
| I'll repeat my feat in this area: a spelling corrector that had
| 32kB memory (because it was tied to a keyboard trap, or
| something like that) and could do four 8kB reads from CD-ROM
| before telling the user. It contained not only orthographic
| similarity (the actual letters), but also phonetic similarity.
| You rarely see that in older English spell checkers, because
| it's such an irregular language, but for other languages it's
| quite informative, although in extremely regular languages,
| such as Spanish, you could also put it in the
| weighting/probability model (e.g. v<->b is a pretty common
| confusion).
|
| The CD-ROM spelling corrector was not really great, BTW, but at
| least it replied in 1s on a typical end-user PC.
|
| Edit: this was late 1980s.
| da39a3ee wrote:
| From a pedagogical point of view presenting that as a Bayesian
| model and then using the error model he does, is a bit
| questionable. But as always his python style is inspiring.
| graycat wrote:
| My favorite, long standard spell checker is Aspell long part of a
| TeX distribution.
___________________________________________________________________
(page generated 2021-09-17 23:02 UTC)