[HN Gopher] Rebuilding the spellchecker
       ___________________________________________________________________
        
       Rebuilding the spellchecker
        
       Author : zverok
       Score  : 134 points
       Date   : 2021-01-15 10:19 UTC (12 hours ago)
        
 (HTM) web link (zverok.github.io)
 (TXT) w3m dump (zverok.github.io)
        
       | bjarneh wrote:
       | Never thought about the problem of compound words like that
       | before. Every time I read about this problem I'm impressed about
       | the solutions; but I'm also reminded of a comment that I think I
       | saw on Reddit.
       | 
       | > Learn how to spell, or use a spell-checker!
       | 
       | Why would I use a spell-checker, am I Gandalf?
        
         | taneq wrote:
         | Really, what we want isn't a spell-checker at all, but an
         | intention-checker. "Does what you wrote locally seem to be
         | consistent with the intention of your overall document?"
         | 
         | But of course, something that actually worked like that would
         | be indistinguishable from magic.
        
           | bjarneh wrote:
           | That certainly seems like a complex problem to solve :-)
        
           | Anthony-G wrote:
           | I'm so used to abutocompletion in Bash that sometimes I hit
           | the Tab key after typing the first few characters of the
           | argument to the `mkdir` command. The computer responds by
           | beeping at me to remind me that it can't read my mind.
        
       | rijoja wrote:
       | Maybe a bit off-topic, but are there good spellcheckers written
       | in JavaScript?
        
         | zverok wrote:
         | For what I know (I've mentioned it in the first part[0]), the
         | nspell[1] is the most close to "port (some) of Hunspell", and
         | typo.js[2] ports even less (but might be enough for some, we
         | used it in my previous company: it uses dictionaries for
         | lookup, but uses its own simplistic suggest, which I needed to
         | tweak a lot).
         | 
         | SymSpell algorithm (which is quite different, I'll go into it
         | in the next part to some extent) is much easier to port, so
         | there is a JS SymSpell port[3] (which seems abandoned though).
         | 
         | 0: https://zverok.github.io/blog/2021-01-05-spellchecker-1.html
         | 
         | 1: https://github.com/wooorm/nspell
         | 
         | 2: https://github.com/cfinke/Typo.js/
         | 
         | 3: https://github.com/IceCreamYou/SymSpell
        
           | rijoja wrote:
           | Thanks for the information!
        
           | nicoburns wrote:
           | https://github.com/wolfgarbe/SymSpell lists 5 JS
           | implementations (+ a Rust one that compiles to web assembly)
        
             | zverok wrote:
             | Ah, indeed :) I just googled the first one for
             | "symspell+js"
        
       | bababooeyxo wrote:
       | Really great series of articles about spellcheckers. I wish there
       | was a similar project written in Ruby. There is
       | https://github.com/omohokcoj/ruby-spellchecker but it serves a
       | bit different purpose - to do safe autocorrections.
       | 
       | Are you considering https://github.com/wolfgarbe/SymSpell algo to
       | do suggestions? If I recall native hunspell suggestions are quite
       | slow - the same algo must be even slower on python.
        
         | zverok wrote:
         | > I wish there was a similar project written in Ruby.
         | 
         | Hehe... Actually, Ruby is my primary language, but I have
         | chosen Python for this project for a complicated set of reasons
         | I tried to explain[0] in the first article.
         | 
         | > If I recall native hunspell suggestions are quite slow - the
         | same algo must be even slower on python.
         | 
         | Pretty slow, yes. But the current project's goal is to
         | "uncover" how the Hunspell works--so, I implement it the
         | Hunspell's way. The next (several) parts of the series would
         | explain a lot on suggest, including "why is it hard", and "why
         | SymSpell might not be enough" ;)
         | 
         | 0:
         | https://zverok.github.io/blog/2021-01-05-spellchecker-1.html...
        
       | saadalem wrote:
       | Grammarly matches patterns, the differences among Grammarly and
       | other grammar checkers are essentially in the sizes of their
       | lists and their cosmetics: how many writing problems can they
       | find and correct, and (far less important) how friendly is their
       | interface to the writer?
       | 
       | I think Grammarly isn't even top 5 in checking grammar tools or
       | top 10 in helping improve style.
       | 
       | I think the problem for grammar checkers is that they are
       | designed to help nonnative English speakers is vastly larger than
       | the problem of helping native English speakers, and no grammar
       | checker I am aware of does much to help ESL speakers who are not
       | already fluent in English.
       | 
       | Also Rahul (founder of Superhuman) said there is not currently an
       | automatic spell/grammar check library that developers could use
       | to integrate in apps and software they create. "When you type I
       | would love to be able to autocorrect errors in your typing in the
       | same way the MacOS does natively", so there is maybe an
       | opportunity out there ! Don't forget me if you make it through x)
       | and I'm spazzed to check your upcoming writings !
        
         | Ygg2 wrote:
         | > I think Grammarly isn't even top 5 in checking grammar tools
         | or top 10 in helping improve style.
         | 
         | What are top 5 tools in checking grammar/improve style?
        
           | edoceo wrote:
           | In my top 5 is one called Hemingway - I put my marketing
           | materials in there and work them down to an 8th grade level
           | :/
        
       | andy_ppp wrote:
       | Isn't this a problem deep learning could solve without huge
       | amounts of difficulty in implementation? Or am I just imagining
       | getting a list of misspellings of words and phonetics for them is
       | not intractable?
        
         | zverok wrote:
         | That's a huge topic, which I am planning to cover towards the
         | end of the article series <s>please like and subscribe</s>, but
         | in short: yes, my opinion is that spellchecking is actually a
         | "machine learning problem in disguise", and most of existing
         | dictionaries are more a roundabout way of storing something-
         | not-unlike-models than analytical data.
         | 
         | But ML approach will raise a question of data availability.
         | What good your "deep learning OSS spellchecker" will do if
         | there aren't good (and open) models for it which cover as much
         | languages as existing Hunspell dictionaries do? And what if
         | adding a bunch of new words requires laborous model retraining?
         | It is not unsolvable, but non-trivial.
         | 
         | I believe all the giants have something like this inside (I
         | don't think spelling correction in Google search bar is handled
         | with Hunspell, right?), but it is much harder to do as an open
         | tool, ready to embedding into other software.
         | 
         | There are a notable attempts, though: JamSpell for one
         | (https://github.com/bakwc/JamSpell), which has an open "free"
         | models, and more precise commercial ones; source code is open
         | (maybe also only for using "simplistic" models, haven't dug
         | deeper).
        
       | ilovefood wrote:
       | Lovely work, I was looking for something like this the other day
       | and I'd like to thank you for sharing it! Especially since it's
       | in Python I can understand it without too much hassle. What a
       | good job on the documentation as well!!
        
       | rijoja wrote:
       | Since I'm not a native English speaker I find the Grammarly
       | extension quite good. This creates a problem in professional
       | settings though, since there'd be a sharing with a 3rd party of
       | all the text that you write basically.
       | 
       | What if any open source projects would implement the
       | corresponding functionality? Also would anybody have experience
       | with this?
        
         | The_Colonel wrote:
         | https://github.com/languagetool-org/languagetool works very
         | well (although I use the online version).
        
         | zverok wrote:
         | I believe that LanguageTool[0] is the closest open-source
         | counterpart to Grammarly. Though, in my experience, it is not a
         | half as useful... But multilinugal and open-source.
         | 
         | I have a distant dream of doing to it what I did to Hunspell
         | (write a code/series of articles explaining how it works and
         | why it is so hard), but we'll see.
         | 
         | For what I know, LanguageTool is based just on a huge set of
         | rules (you can see them in the repo[1]); and Grammarly is a mix
         | of rule-based and machine-learning suggestions (I heard a rumor
         | that it is 99% rule-based, and talks about ML are mostly
         | marketing, but I don't know how reliable this rumor was).
         | 
         | 0: https://languagetool.org
         | 
         | 1: https://github.com/languagetool-
         | org/languagetool/tree/master...
        
           | lqet wrote:
           | Indeed, I have an installation of LanguageTool on my private
           | server to avoid the privacy issues mentioned above. I have
           | plugins for Thunderbird, Chromium and vim running. The
           | browser plugin is by far the best.
        
       | atum47 wrote:
       | Great tool, it was my first contact with spellchecks. Back that I
       | was working for a company that does translations powered by
       | machine learning. Back then I was a student and as the article
       | mentioned I was one of the naive ones to think that a spellcheck
       | is an easy thing to build.
       | 
       | https://github.com/victorqribeiro/goSpellcheck
       | 
       | I wrote this originally in python, then I ported it to go. Back
       | then I had plans to improve it. I believe that the most erros
       | would be due to miss press of keys. I was sketching an algorithm
       | to find similar words given a dictionary. Soon I had to deal with
       | other projects (from college) and I let the spellcheck to the
       | smart people.
        
       | steve_g wrote:
       | This is great. I read all three articles.
       | 
       | It's amazing how difficult it is to encode rules for dealing with
       | natural language, considering how easy it is for a person to
       | resolve ambiguities, misspellings, and the like. Of course, we
       | forget how much knowledge we have encoded in our own brains.
       | 
       | I'm trying to do named entity recognition for chemicals and
       | materials in quasi-natural language texts that we've developed
       | over many years at work. It's brutal.
        
         | zverok wrote:
         | Thanks!
        
       ___________________________________________________________________
       (page generated 2021-01-15 23:02 UTC)