[HN Gopher] Whichlang - Fast, OSS for Language Detection in Rust
___________________________________________________________________
Whichlang - Fast, OSS for Language Detection in Rust
Author : yujian
Score : 23 points
Date : 2023-05-19 15:12 UTC (7 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| pipo234 wrote:
| Nice!
|
| Did a similar thing in C++ way back when; routing short sentence
| fragments (questions in a bank website search box to the correct
| FAQs). One of the things we quickly discovered was how hard it is
| to separate Spanish (Castilian) from Catalan (and to less extent
| Bask). Similarly for Mandarin versus Cantonese.
|
| At first we suspected that low precision could be attributed to
| the fact that users would typically only type short 2-5 word
| sentences (think Google searches). We later discovered talking to
| Barcelonean linguists (and Chinese native speakers) that another
| major factor is that so many words, idiom and even grammar is
| borrowed that it's hard to discern pure "uncontaminated" training
| / testing corpus to start with.
| alexott wrote:
| Yes from my previous experience there were common language
| pairs, especially on the short texts - Spanish / Catalan, often
| German / Dutch, Bulgarian / Ukrainian, Dutch / Afrikaans, and
| few more...
| allan_s wrote:
| shameless plug
|
| for Tatoeba.org I used to code "tatodetect" to do this,
| https://github.com/allan-simon/Tatodetect
|
| it's was my first enlighting that with simple math and stats you
| can achieve pretty good result without ressorting to "machine
| learning"
|
| basically the idea was
|
| 1. use tatoeba to generate the trigram count 2. remove low
| frequency trigram 3. apply a bonus score for "uniq" trigram that
| appears in other languages (it helped to differentiate very
| similar languages like spanish/french or Chinese dialects)
|
| then you simply sum the trigram with their frequency ponderation
| and voila
|
| Tatodetect should differentiate 200+ languages and dialect,
| regarding of their script (alphabet, sinograms etc. )
| jszymborski wrote:
| I should probably bench this against fasttext, which has a pretty
| simple/fast/accurate lang detection model ootb.
| francoismassot wrote:
| This blog post introduces the algorithm behind:
| https://quickwit.io/blog/whichlang-language-detection-librar...
| alexott wrote:
| Looking into generated ngrams, I'm not sure that it's good idea
| of getting rid of spaces between words, and not having markers
| for begin/end of word.
|
| It would be interesting to check on the dataset linked to this
| blog post from 6 years ago, evaluating fasttext model for
| language detection:
| https://alexott.blogspot.com/2017/10/evaluating-fasttexts-mo...
___________________________________________________________________
(page generated 2023-05-19 23:02 UTC)