hngopher.com

       [HN Gopher] Meta's Omnilingual MT for 1,600 Languages
       ___________________________________________________________________
        
       Meta's Omnilingual MT for 1,600 Languages
        
       Author : j0e1
       Score  : 110 points
       Date   : 2026-03-18 05:00 UTC (3 days ago)
        
 (HTM) web link (ai.meta.com)
 (TXT) w3m dump (ai.meta.com)
        
       | stingraycharles wrote:
       | I find that meta's translations are very poor compared to others,
       | at least for relatively obscure languages, which I figured was
       | relevant considering the article.
       | 
       | Google Translate is a good default, but LLMs are _really_ good at
       | translations, as they're better capable at understanding context
       | and providing culturally appropriate translations.
       | 
       | (I live in Cambodia where they speak Khmer)
        
         | smallerize wrote:
         | *they're
         | 
         | (Sorry I had to)
        
           | stingraycharles wrote:
           | I could have sworn I edited it! I did notice myself as well,
           | but thanks for the correction.
        
           | tomrod wrote:
           | *buakgegyyjaa
        
         | djsamseng wrote:
         | Hello from Siem Reap, Cambodia! Awesome to see a fellow tech
         | enthusiast from Cambodia.
         | 
         | I actually found Facebook's translations pretty good (better
         | than Google Translate for things longer than a sentence). From
         | my understanding of Khmer, Khmer is a bit more verbose and
         | context dependent, hence LLMs in Khmer would be a big help
         | understand those nuances.
         | 
         | In the inverse case (LLMs generating khmer from English) I
         | heard from locals that it sounds formal and "robotic" which I
         | found quite interesting.
        
         | pseudocomposer wrote:
         | Kagi Translate is fantastic. Multilingual support is honestly
         | one of the best things about LLMs, imo.
        
         | yellow_lead wrote:
         | It's not even good for Chinese
        
         | ks2048 wrote:
         | So, LLMs are noticeably better in Khmer than Google Translate?
         | I wonder why Google Translate doesn't use Gemini under-the-
         | hood. Perhaps it's more prone to hallucinations.
         | 
         | I'm interested in find some thorough testing of translations on
         | different LLMs vs Translation APIs.
        
           | pattilupone wrote:
           | There's a dropdown on Google Translate that lets you choose
           | "Advanced" mode or "Classic" mode. Advanced mode uses Gemini
           | but it's only available for select languages.
        
       | psychoslave wrote:
       | That's a high count, but still a bit away from "Omni". Usual
       | count is between 4k and 8k depending the source. But the first 1k
       | might be the hardest, certainly.
        
         | simultsop wrote:
         | when you market, you use frontier and edge terms, so it sounds
         | pro max
        
       | croes wrote:
       | Off topic, since the AI craze MS' documentation translation has
       | ridiculous errors like translating try catch keywords to
       | "versuchen" and "fangen" for German pages
        
         | Tarq0n wrote:
         | Yes their translations offer negative value, which is annoying
         | because at work you can't usually choose your locale settings.
         | 
         | And the errors are really basic, like translating shortly to
         | short, not the same thing at all!
        
       | bikeshaving wrote:
       | I'm very wary of celebrating Meta's language work when the
       | company was credibly found to have contributed to the genocide
       | against the Rohingya in Myanmar, and separately, to human rights
       | abuses against Tigrayans during the conflict in northern
       | Ethiopia. Be careful whose sins you're laundering.
       | 
       | https://www.amnesty.org/en/latest/news/2025/02/meta-new-poli...
       | https://www.amnesty.org/en/latest/news/2023/10/meta-failure-...
        
         | 0x3f wrote:
         | Do you also boycott Toyota for the Hilux?
        
           | bikeshaving wrote:
           | I don't own a car :)
        
       | garyclarke27 wrote:
       | They can translate 1600 languages, but they cannot do basic text
       | formatting, where are the paragraphs?
        
         | canjobear wrote:
         | It's an abstract for a paper, so it's officially supposed to be
         | one paragraph.
        
           | BalinKing wrote:
           | In the paper itself, the abstract actually does have a
           | paragraph break, so it's probably just an autoformatting
           | issue or something.
        
       | djoldman wrote:
       | Just spent a long time trying to find where you can download any
       | of these weights.
       | 
       | Is it open weight? If so, why isn't there just a straight link to
       | the models?
        
         | ks2048 wrote:
         | I haven't seen anywhere claiming they are open weight (although
         | their last similar model, NLLB was).
         | 
         | They say their leaderboard and evaluation datasets are freely
         | available. Closest statement I've seen in the paper, "Our
         | translation models are built on top of freely available
         | models."
        
       | ks2048 wrote:
       | I'll be looking at this in detail. I've started a company to do
       | similar things, https://6k.ai
       | 
       | I'm currently concentrating on better data gathering for low-
       | resource languages.
       | 
       | When you look in detail at data like Common Crawl, finepdfs, and
       | fineweb, (1) they are really lacking quality data sources if you
       | know where to look, and (2) the sources they have are not
       | processed "finely" enough (e.g. finepdfs classify each page of
       | PDF as having a specific language, where-as many language
       | learning sources have language pairs, etc.
        
         | intended wrote:
         | There's many nation states working on this, have you looked
         | into availability of those data sets?
         | 
         | What languages are you prioritizing?
        
           | ks2048 wrote:
           | Yes, there are government datasets, languge "acadamies" (or
           | "regulators") - organizations focused on preserving /
           | teaching the language, and often smaller, local publishers
           | that publish material in their local language.
           | 
           | I'm living in Guatemala, so have been focusing on the Mayan
           | languages here (22 languages, millions of speakers).
        
             | dhosek wrote:
             | As an aside, I remember visiting Guatemala (in the border
             | area near Chiapas) in the early 90s and discovering that
             | "Mayan" was not the monolith that I had been led to believe
             | by my culturally narrow American education, but was a
             | diverse collection of related cultures with multiple
             | languages.
             | 
             | In one of the villages we visited, there was a language
             | school where foreigners were learning Jacalteco. One
             | student was from Israel and where most of the students had
             | vocabulary lists in three columns (Jacalteco - Spanish -
             | English), his had four columns where he did one more step
             | of translation to Hebrew.
        
         | ccgreg wrote:
         | Common Crawl has been running a low-resource language project
         | for 1.5 years now -- it's a hard problem.
        
         | omneity wrote:
         | Hey, this is super cool! I've been working on a similar
         | problem, focusing on low-resource and underserved languages
         | including the Mayan family, and have published some research
         | and open resources around that [0, 1].
         | 
         | On the data side, I've found that the biggest bottleneck isn't
         | collecting text (it's out there!) but reliable language
         | identification. It's often difficult or ambiguous to separate
         | languages cleanly in datasets like Common Crawl, Fineweb, or
         | others. I worked on improving this a bit for Fineweb 2 for my
         | native language, that might inspire you [3].
         | 
         | Many of the challenges you mention seem to recur across regions
         | and language families, so I'd love to connect and compare notes
         | sometime. Feel free to reach me at omar [at] the labs site
         | below.
         | 
         | 0: https://wikilangs.org
         | 
         | 1: https://omneitylabs.com
         | 
         | 2: https://huggingface.co/blog/omarkamali/gherbal-
         | multilingual-...
        
       | ks2048 wrote:
       | Meta released No Language Left Behind (NLLB) [1], I think in
       | 2022. I wonder why this in not "NLLB 2.0"? These companies love
       | introducing new names to confuse things
       | 
       | [1] https://ai.meta.com/research/no-language-left-behind/
        
       | ks2048 wrote:
       | Another interesting thing mentioned here is: BOUQuET: Benchmark
       | and Open-initiative for Universal Quality Evaluation in
       | Translation.
       | 
       | https://huggingface.co/spaces/facebook/bouquet
        
       | intended wrote:
       | Didn't research show that models get worse at translation the
       | more languages get added in? The curse of multilinguality?
       | Lauscher 2020?
       | 
       | It looks like meta found a way forward.
       | 
       | Reading meta's abstract, it seems that they have found ways to
       | improve the quality of the training data, and also new evaluation
       | tools?
       | 
       | They are also saying that OMT-LLaMA does a better job at text
       | generation than other baseline models.
        
       | gojomo wrote:
       | Can translate between 1600 languages.
       | 
       | Can't achieve subject-verb agreement in 1st sentence of their
       | English abstract.
       | 
       |  _Advances made through No Language Left Behind (NLLB) have
       | demonstrated that high-quality machine translation (MT) scale to
       | 200 languages._
        
       ___________________________________________________________________
       (page generated 2026-03-21 23:00 UTC)