[HN Gopher] Computer Based Spellchecking Techniques
___________________________________________________________________
Computer Based Spellchecking Techniques
Author : pcr910303
Score : 14 points
Date : 2022-07-24 12:27 UTC (1 days ago)
(HTM) web link (web.archive.org)
(TXT) w3m dump (web.archive.org)
| homodeus wrote:
| In 2022, "state of the art" is throwing a deep net at it. It will
| likely pick up on all of these findings (and better ones,
| incomprehensible to us) by itself given correct architecture and
| enough data, but I can't help but feel a bit saddened by this -
| seeing the ingenuity and mastery of all these cited names be
| obscured and superseded so easily, in a way.
|
| I love advancement in the field and what machine learning will
| enable us to do, but I don't know what to make of this. One
| argument is that now we have engineers who design the machine
| learning models, but it is still depressing to me, for some
| reason. Never knew I would feel like this, am I the only one?
|
| P.S.: I'm commenting purely on this topic, which is an ideal big
| data case - of course, we still have a long way to go with
| machine learning, one where human minds will have to especially
| shine.
| marcodiego wrote:
| > The hashing function described above is too simple to do the
| job properly - dcd, hdb and various other non-words would all
| hash to 223 and be accepted - but it's possible to devise more
| complicated hashing functions so that hardly any non-words will
| be accepted. You may use more than one hashing function; you
| could derive, say, six numbers from the same word and check them
| all in the bit map (or in six separate bit maps), accepting the
| word only if all six bits were set.
|
| Just described how a Bloom filter works.
| jwstarr wrote:
| A more quantitative approach can be found in a pair of papers
| from John C Nesbit, who analyzed ten algorithms in 1985/86
| (https://archive.org/details/sim_journal-of-computer-based-in...
| ; https://archive.org/details/sim_journal-of-computer-based-
| in...). Generalized edit distance performed best, but also took
| the most time. The PLATO algorithm, which used a feature vector-
| esque approach, came in third in quality and was also efficient.
| Phonetic approaches came in third. Since the charts are hard to
| read and summarize, I converted the result into F1 scores
| (https://ztoz.blog/posts/nesbit/).
___________________________________________________________________
(page generated 2022-07-25 23:01 UTC)