[HN Gopher] Meta's Omnilingual MT for 1,600 Languages
___________________________________________________________________
Meta's Omnilingual MT for 1,600 Languages
Author : j0e1
Score : 110 points
Date : 2026-03-18 05:00 UTC (3 days ago)
(HTM) web link (ai.meta.com)
(TXT) w3m dump (ai.meta.com)
| stingraycharles wrote:
| I find that meta's translations are very poor compared to others,
| at least for relatively obscure languages, which I figured was
| relevant considering the article.
|
| Google Translate is a good default, but LLMs are _really_ good at
| translations, as they're better capable at understanding context
| and providing culturally appropriate translations.
|
| (I live in Cambodia where they speak Khmer)
| smallerize wrote:
| *they're
|
| (Sorry I had to)
| stingraycharles wrote:
| I could have sworn I edited it! I did notice myself as well,
| but thanks for the correction.
| tomrod wrote:
| *buakgegyyjaa
| djsamseng wrote:
| Hello from Siem Reap, Cambodia! Awesome to see a fellow tech
| enthusiast from Cambodia.
|
| I actually found Facebook's translations pretty good (better
| than Google Translate for things longer than a sentence). From
| my understanding of Khmer, Khmer is a bit more verbose and
| context dependent, hence LLMs in Khmer would be a big help
| understand those nuances.
|
| In the inverse case (LLMs generating khmer from English) I
| heard from locals that it sounds formal and "robotic" which I
| found quite interesting.
| pseudocomposer wrote:
| Kagi Translate is fantastic. Multilingual support is honestly
| one of the best things about LLMs, imo.
| yellow_lead wrote:
| It's not even good for Chinese
| ks2048 wrote:
| So, LLMs are noticeably better in Khmer than Google Translate?
| I wonder why Google Translate doesn't use Gemini under-the-
| hood. Perhaps it's more prone to hallucinations.
|
| I'm interested in find some thorough testing of translations on
| different LLMs vs Translation APIs.
| pattilupone wrote:
| There's a dropdown on Google Translate that lets you choose
| "Advanced" mode or "Classic" mode. Advanced mode uses Gemini
| but it's only available for select languages.
| psychoslave wrote:
| That's a high count, but still a bit away from "Omni". Usual
| count is between 4k and 8k depending the source. But the first 1k
| might be the hardest, certainly.
| simultsop wrote:
| when you market, you use frontier and edge terms, so it sounds
| pro max
| croes wrote:
| Off topic, since the AI craze MS' documentation translation has
| ridiculous errors like translating try catch keywords to
| "versuchen" and "fangen" for German pages
| Tarq0n wrote:
| Yes their translations offer negative value, which is annoying
| because at work you can't usually choose your locale settings.
|
| And the errors are really basic, like translating shortly to
| short, not the same thing at all!
| bikeshaving wrote:
| I'm very wary of celebrating Meta's language work when the
| company was credibly found to have contributed to the genocide
| against the Rohingya in Myanmar, and separately, to human rights
| abuses against Tigrayans during the conflict in northern
| Ethiopia. Be careful whose sins you're laundering.
|
| https://www.amnesty.org/en/latest/news/2025/02/meta-new-poli...
| https://www.amnesty.org/en/latest/news/2023/10/meta-failure-...
| 0x3f wrote:
| Do you also boycott Toyota for the Hilux?
| bikeshaving wrote:
| I don't own a car :)
| garyclarke27 wrote:
| They can translate 1600 languages, but they cannot do basic text
| formatting, where are the paragraphs?
| canjobear wrote:
| It's an abstract for a paper, so it's officially supposed to be
| one paragraph.
| BalinKing wrote:
| In the paper itself, the abstract actually does have a
| paragraph break, so it's probably just an autoformatting
| issue or something.
| djoldman wrote:
| Just spent a long time trying to find where you can download any
| of these weights.
|
| Is it open weight? If so, why isn't there just a straight link to
| the models?
| ks2048 wrote:
| I haven't seen anywhere claiming they are open weight (although
| their last similar model, NLLB was).
|
| They say their leaderboard and evaluation datasets are freely
| available. Closest statement I've seen in the paper, "Our
| translation models are built on top of freely available
| models."
| ks2048 wrote:
| I'll be looking at this in detail. I've started a company to do
| similar things, https://6k.ai
|
| I'm currently concentrating on better data gathering for low-
| resource languages.
|
| When you look in detail at data like Common Crawl, finepdfs, and
| fineweb, (1) they are really lacking quality data sources if you
| know where to look, and (2) the sources they have are not
| processed "finely" enough (e.g. finepdfs classify each page of
| PDF as having a specific language, where-as many language
| learning sources have language pairs, etc.
| intended wrote:
| There's many nation states working on this, have you looked
| into availability of those data sets?
|
| What languages are you prioritizing?
| ks2048 wrote:
| Yes, there are government datasets, languge "acadamies" (or
| "regulators") - organizations focused on preserving /
| teaching the language, and often smaller, local publishers
| that publish material in their local language.
|
| I'm living in Guatemala, so have been focusing on the Mayan
| languages here (22 languages, millions of speakers).
| dhosek wrote:
| As an aside, I remember visiting Guatemala (in the border
| area near Chiapas) in the early 90s and discovering that
| "Mayan" was not the monolith that I had been led to believe
| by my culturally narrow American education, but was a
| diverse collection of related cultures with multiple
| languages.
|
| In one of the villages we visited, there was a language
| school where foreigners were learning Jacalteco. One
| student was from Israel and where most of the students had
| vocabulary lists in three columns (Jacalteco - Spanish -
| English), his had four columns where he did one more step
| of translation to Hebrew.
| ccgreg wrote:
| Common Crawl has been running a low-resource language project
| for 1.5 years now -- it's a hard problem.
| omneity wrote:
| Hey, this is super cool! I've been working on a similar
| problem, focusing on low-resource and underserved languages
| including the Mayan family, and have published some research
| and open resources around that [0, 1].
|
| On the data side, I've found that the biggest bottleneck isn't
| collecting text (it's out there!) but reliable language
| identification. It's often difficult or ambiguous to separate
| languages cleanly in datasets like Common Crawl, Fineweb, or
| others. I worked on improving this a bit for Fineweb 2 for my
| native language, that might inspire you [3].
|
| Many of the challenges you mention seem to recur across regions
| and language families, so I'd love to connect and compare notes
| sometime. Feel free to reach me at omar [at] the labs site
| below.
|
| 0: https://wikilangs.org
|
| 1: https://omneitylabs.com
|
| 2: https://huggingface.co/blog/omarkamali/gherbal-
| multilingual-...
| ks2048 wrote:
| Meta released No Language Left Behind (NLLB) [1], I think in
| 2022. I wonder why this in not "NLLB 2.0"? These companies love
| introducing new names to confuse things
|
| [1] https://ai.meta.com/research/no-language-left-behind/
| ks2048 wrote:
| Another interesting thing mentioned here is: BOUQuET: Benchmark
| and Open-initiative for Universal Quality Evaluation in
| Translation.
|
| https://huggingface.co/spaces/facebook/bouquet
| intended wrote:
| Didn't research show that models get worse at translation the
| more languages get added in? The curse of multilinguality?
| Lauscher 2020?
|
| It looks like meta found a way forward.
|
| Reading meta's abstract, it seems that they have found ways to
| improve the quality of the training data, and also new evaluation
| tools?
|
| They are also saying that OMT-LLaMA does a better job at text
| generation than other baseline models.
| gojomo wrote:
| Can translate between 1600 languages.
|
| Can't achieve subject-verb agreement in 1st sentence of their
| English abstract.
|
| _Advances made through No Language Left Behind (NLLB) have
| demonstrated that high-quality machine translation (MT) scale to
| 200 languages._
___________________________________________________________________
(page generated 2026-03-21 23:00 UTC)