hngopher.com

       [HN Gopher] Whichlang - Fast, OSS for Language Detection in Rust
       ___________________________________________________________________
        
       Whichlang - Fast, OSS for Language Detection in Rust
        
       Author : yujian
       Score  : 23 points
       Date   : 2023-05-19 15:12 UTC (7 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | pipo234 wrote:
       | Nice!
       | 
       | Did a similar thing in C++ way back when; routing short sentence
       | fragments (questions in a bank website search box to the correct
       | FAQs). One of the things we quickly discovered was how hard it is
       | to separate Spanish (Castilian) from Catalan (and to less extent
       | Bask). Similarly for Mandarin versus Cantonese.
       | 
       | At first we suspected that low precision could be attributed to
       | the fact that users would typically only type short 2-5 word
       | sentences (think Google searches). We later discovered talking to
       | Barcelonean linguists (and Chinese native speakers) that another
       | major factor is that so many words, idiom and even grammar is
       | borrowed that it's hard to discern pure "uncontaminated" training
       | / testing corpus to start with.
        
         | alexott wrote:
         | Yes from my previous experience there were common language
         | pairs, especially on the short texts - Spanish / Catalan, often
         | German / Dutch, Bulgarian / Ukrainian, Dutch / Afrikaans, and
         | few more...
        
       | allan_s wrote:
       | shameless plug
       | 
       | for Tatoeba.org I used to code "tatodetect" to do this,
       | https://github.com/allan-simon/Tatodetect
       | 
       | it's was my first enlighting that with simple math and stats you
       | can achieve pretty good result without ressorting to "machine
       | learning"
       | 
       | basically the idea was
       | 
       | 1. use tatoeba to generate the trigram count 2. remove low
       | frequency trigram 3. apply a bonus score for "uniq" trigram that
       | appears in other languages (it helped to differentiate very
       | similar languages like spanish/french or Chinese dialects)
       | 
       | then you simply sum the trigram with their frequency ponderation
       | and voila
       | 
       | Tatodetect should differentiate 200+ languages and dialect,
       | regarding of their script (alphabet, sinograms etc. )
        
       | jszymborski wrote:
       | I should probably bench this against fasttext, which has a pretty
       | simple/fast/accurate lang detection model ootb.
        
       | francoismassot wrote:
       | This blog post introduces the algorithm behind:
       | https://quickwit.io/blog/whichlang-language-detection-librar...
        
         | alexott wrote:
         | Looking into generated ngrams, I'm not sure that it's good idea
         | of getting rid of spaces between words, and not having markers
         | for begin/end of word.
         | 
         | It would be interesting to check on the dataset linked to this
         | blog post from 6 years ago, evaluating fasttext model for
         | language detection:
         | https://alexott.blogspot.com/2017/10/evaluating-fasttexts-mo...
        
       ___________________________________________________________________
       (page generated 2023-05-19 23:02 UTC)