[HN Gopher] Rebuilding the spellchecker
___________________________________________________________________
Rebuilding the spellchecker
Author : zverok
Score : 134 points
Date : 2021-01-15 10:19 UTC (12 hours ago)
(HTM) web link (zverok.github.io)
(TXT) w3m dump (zverok.github.io)
| bjarneh wrote:
| Never thought about the problem of compound words like that
| before. Every time I read about this problem I'm impressed about
| the solutions; but I'm also reminded of a comment that I think I
| saw on Reddit.
|
| > Learn how to spell, or use a spell-checker!
|
| Why would I use a spell-checker, am I Gandalf?
| taneq wrote:
| Really, what we want isn't a spell-checker at all, but an
| intention-checker. "Does what you wrote locally seem to be
| consistent with the intention of your overall document?"
|
| But of course, something that actually worked like that would
| be indistinguishable from magic.
| bjarneh wrote:
| That certainly seems like a complex problem to solve :-)
| Anthony-G wrote:
| I'm so used to abutocompletion in Bash that sometimes I hit
| the Tab key after typing the first few characters of the
| argument to the `mkdir` command. The computer responds by
| beeping at me to remind me that it can't read my mind.
| rijoja wrote:
| Maybe a bit off-topic, but are there good spellcheckers written
| in JavaScript?
| zverok wrote:
| For what I know (I've mentioned it in the first part[0]), the
| nspell[1] is the most close to "port (some) of Hunspell", and
| typo.js[2] ports even less (but might be enough for some, we
| used it in my previous company: it uses dictionaries for
| lookup, but uses its own simplistic suggest, which I needed to
| tweak a lot).
|
| SymSpell algorithm (which is quite different, I'll go into it
| in the next part to some extent) is much easier to port, so
| there is a JS SymSpell port[3] (which seems abandoned though).
|
| 0: https://zverok.github.io/blog/2021-01-05-spellchecker-1.html
|
| 1: https://github.com/wooorm/nspell
|
| 2: https://github.com/cfinke/Typo.js/
|
| 3: https://github.com/IceCreamYou/SymSpell
| rijoja wrote:
| Thanks for the information!
| nicoburns wrote:
| https://github.com/wolfgarbe/SymSpell lists 5 JS
| implementations (+ a Rust one that compiles to web assembly)
| zverok wrote:
| Ah, indeed :) I just googled the first one for
| "symspell+js"
| bababooeyxo wrote:
| Really great series of articles about spellcheckers. I wish there
| was a similar project written in Ruby. There is
| https://github.com/omohokcoj/ruby-spellchecker but it serves a
| bit different purpose - to do safe autocorrections.
|
| Are you considering https://github.com/wolfgarbe/SymSpell algo to
| do suggestions? If I recall native hunspell suggestions are quite
| slow - the same algo must be even slower on python.
| zverok wrote:
| > I wish there was a similar project written in Ruby.
|
| Hehe... Actually, Ruby is my primary language, but I have
| chosen Python for this project for a complicated set of reasons
| I tried to explain[0] in the first article.
|
| > If I recall native hunspell suggestions are quite slow - the
| same algo must be even slower on python.
|
| Pretty slow, yes. But the current project's goal is to
| "uncover" how the Hunspell works--so, I implement it the
| Hunspell's way. The next (several) parts of the series would
| explain a lot on suggest, including "why is it hard", and "why
| SymSpell might not be enough" ;)
|
| 0:
| https://zverok.github.io/blog/2021-01-05-spellchecker-1.html...
| saadalem wrote:
| Grammarly matches patterns, the differences among Grammarly and
| other grammar checkers are essentially in the sizes of their
| lists and their cosmetics: how many writing problems can they
| find and correct, and (far less important) how friendly is their
| interface to the writer?
|
| I think Grammarly isn't even top 5 in checking grammar tools or
| top 10 in helping improve style.
|
| I think the problem for grammar checkers is that they are
| designed to help nonnative English speakers is vastly larger than
| the problem of helping native English speakers, and no grammar
| checker I am aware of does much to help ESL speakers who are not
| already fluent in English.
|
| Also Rahul (founder of Superhuman) said there is not currently an
| automatic spell/grammar check library that developers could use
| to integrate in apps and software they create. "When you type I
| would love to be able to autocorrect errors in your typing in the
| same way the MacOS does natively", so there is maybe an
| opportunity out there ! Don't forget me if you make it through x)
| and I'm spazzed to check your upcoming writings !
| Ygg2 wrote:
| > I think Grammarly isn't even top 5 in checking grammar tools
| or top 10 in helping improve style.
|
| What are top 5 tools in checking grammar/improve style?
| edoceo wrote:
| In my top 5 is one called Hemingway - I put my marketing
| materials in there and work them down to an 8th grade level
| :/
| andy_ppp wrote:
| Isn't this a problem deep learning could solve without huge
| amounts of difficulty in implementation? Or am I just imagining
| getting a list of misspellings of words and phonetics for them is
| not intractable?
| zverok wrote:
| That's a huge topic, which I am planning to cover towards the
| end of the article series <s>please like and subscribe</s>, but
| in short: yes, my opinion is that spellchecking is actually a
| "machine learning problem in disguise", and most of existing
| dictionaries are more a roundabout way of storing something-
| not-unlike-models than analytical data.
|
| But ML approach will raise a question of data availability.
| What good your "deep learning OSS spellchecker" will do if
| there aren't good (and open) models for it which cover as much
| languages as existing Hunspell dictionaries do? And what if
| adding a bunch of new words requires laborous model retraining?
| It is not unsolvable, but non-trivial.
|
| I believe all the giants have something like this inside (I
| don't think spelling correction in Google search bar is handled
| with Hunspell, right?), but it is much harder to do as an open
| tool, ready to embedding into other software.
|
| There are a notable attempts, though: JamSpell for one
| (https://github.com/bakwc/JamSpell), which has an open "free"
| models, and more precise commercial ones; source code is open
| (maybe also only for using "simplistic" models, haven't dug
| deeper).
| ilovefood wrote:
| Lovely work, I was looking for something like this the other day
| and I'd like to thank you for sharing it! Especially since it's
| in Python I can understand it without too much hassle. What a
| good job on the documentation as well!!
| rijoja wrote:
| Since I'm not a native English speaker I find the Grammarly
| extension quite good. This creates a problem in professional
| settings though, since there'd be a sharing with a 3rd party of
| all the text that you write basically.
|
| What if any open source projects would implement the
| corresponding functionality? Also would anybody have experience
| with this?
| The_Colonel wrote:
| https://github.com/languagetool-org/languagetool works very
| well (although I use the online version).
| zverok wrote:
| I believe that LanguageTool[0] is the closest open-source
| counterpart to Grammarly. Though, in my experience, it is not a
| half as useful... But multilinugal and open-source.
|
| I have a distant dream of doing to it what I did to Hunspell
| (write a code/series of articles explaining how it works and
| why it is so hard), but we'll see.
|
| For what I know, LanguageTool is based just on a huge set of
| rules (you can see them in the repo[1]); and Grammarly is a mix
| of rule-based and machine-learning suggestions (I heard a rumor
| that it is 99% rule-based, and talks about ML are mostly
| marketing, but I don't know how reliable this rumor was).
|
| 0: https://languagetool.org
|
| 1: https://github.com/languagetool-
| org/languagetool/tree/master...
| lqet wrote:
| Indeed, I have an installation of LanguageTool on my private
| server to avoid the privacy issues mentioned above. I have
| plugins for Thunderbird, Chromium and vim running. The
| browser plugin is by far the best.
| atum47 wrote:
| Great tool, it was my first contact with spellchecks. Back that I
| was working for a company that does translations powered by
| machine learning. Back then I was a student and as the article
| mentioned I was one of the naive ones to think that a spellcheck
| is an easy thing to build.
|
| https://github.com/victorqribeiro/goSpellcheck
|
| I wrote this originally in python, then I ported it to go. Back
| then I had plans to improve it. I believe that the most erros
| would be due to miss press of keys. I was sketching an algorithm
| to find similar words given a dictionary. Soon I had to deal with
| other projects (from college) and I let the spellcheck to the
| smart people.
| steve_g wrote:
| This is great. I read all three articles.
|
| It's amazing how difficult it is to encode rules for dealing with
| natural language, considering how easy it is for a person to
| resolve ambiguities, misspellings, and the like. Of course, we
| forget how much knowledge we have encoded in our own brains.
|
| I'm trying to do named entity recognition for chemicals and
| materials in quasi-natural language texts that we've developed
| over many years at work. It's brutal.
| zverok wrote:
| Thanks!
___________________________________________________________________
(page generated 2021-01-15 23:02 UTC)