[HN Gopher] No Language Left Behind
___________________________________________________________________
No Language Left Behind
Author : pesenti
Score : 86 points
Date : 2022-07-06 19:52 UTC (3 hours ago)
(HTM) web link (ai.facebook.com)
(TXT) w3m dump (ai.facebook.com)
| TaupeRanger wrote:
| So they have a system that can translate to languages for which
| there isn't as much data as English, Spanish, etc. Waiting for a
| Twitter thread from a native speaker of one of these "low
| resource languages" to let us know how good the actual
| translations are. Cynically, I'd venture that they hired some
| native speakers to cherry pick their best translations for the
| story books. But mostly this just seems like a nice bit of PR
| (calling it a "breakthrough", etc.). I can't imagine this is
| going to help anyone who actually speaks a random, e.g., Nilo-
| Saharan language.
| hello_im_angela wrote:
| If you're curious to try the system yourself, it's actually
| being used to help Wikipedia editors write articles for low-
| resource language Wikipedias:
| https://twitter.com/Wikimedia/status/1544699850960281601
| onurcel wrote:
| in this work we tried to rely not only on automated evaluation
| scores but also on human evaluation for exactly this reason: we
| wanted to have a better understanding of how our model actually
| performs and how it correlates to automated scores.
| alexott wrote:
| Twitter may not be representative imho because of the short
| text. It should first come to a problem of reliable language
| detection, and Twitter is quite often wrong there
| microtherion wrote:
| As a native Swiss German speaker, my native language is not only
| low resource in general, but has the additional difficulty of not
| having a standardized orthography (many native speakers will
| exclusively write in Standard German, and use Swiss German only
| for spoken communication).
|
| So you have a language with some economic opportunity (a few
| million speakers in a fairly wealthy country) but no clearly
| defined written interface, and an ambivalent attitude of many
| speakers towards the very idea of writing the language.
| rmbyrro wrote:
| This only makes the problem behind the NLLB project even more
| interesting to solve
| hello_im_angela wrote:
| sooo real. Many low-resource languages have many different
| natural variants, can be written in multiple scripts, don't
| have as much written standardization, or are mainly oral. As
| part of the creation of our benchmark, FLORES-200, we tried to
| support languages in multiple scripts (if they are naturally
| written like that) and explored translating regional variants
| (such as Moroccan Arabic, not just Arabic).
|
| As an aside, the question of how to think about language
| standardization is really complex. We wrote some thoughts in
| Appendix A of our paper:
| https://research.facebook.com/publications/no-language-left-...
| Etheryte wrote:
| I'll believe it when I actually see it. I'm a native of a
| reasonably small language spoken by about a million people and
| never have I ever seen a good automatic translation for it. The
| only translations that are good are the ones that have been
| manually entered, and those that match the structure of the
| manually entered ones. I think the sentiment is laudable and wish
| godspeed to the people working on this, but for the time being I
| don't see it becoming a reality yet. When Google Translate
| regularly struggles even with big pairs such as German-English-
| German, I have reservations about someone making it work for
| languages where datasets are orders of magnitude smaller.
| bobsmooth wrote:
| There's a section where you can try reading translated
| children's books. See if your language is supported and how
| good the translation is.
| hello_im_angela wrote:
| It's an extremely difficult problem indeed. A lot of people on
| the team speak low-resource languages too (my native language
| as well!), so definitely resonate with what you're saying. My
| overall feeling is: yeah it's hard, and after decades we can't
| even do German translation perfectly. But if we don't work on
| it, it's not gonna happen. I really hope that people who are
| excited about technology for more languages can use what we've
| open sourced.
| azinman2 wrote:
| > But if we don't work on it, it's not gonna happen.
|
| That's exactly right. There's too much bias in society that
| if something isn't perfect, then why bother? Nothing is
| perfect, so with that attitude there can be no progress.
| Thank you for doing important work!
| Tabular-Iceberg wrote:
| My concern with this is that in low resource languages the
| unavoidable biases of the ML models might overpower their own
| organic development.
|
| We shrug off all the little quirks of machine translated text
| because it usually gets the point across, and we recognize them
| as quirks because most of what we read was written by real people
| with no such quirks. But when most of what you read contain those
| quirks, I fear those will quickly become the standard way of
| writing and even speaking in those languages.
| texaslonghorn5 wrote:
| In a worst case you can end up with the Scots Wikipedia
| situation, where some power editor created a bunch of pages
| using an entirely fabricated, overly stereotypical language and
| that influenced what people thought Scots actually was.
| onurcel wrote:
| This is one of the examples we keep in mind and that's also
| why we can't 100% trust public dataset labels. This motivated
| us to train a Language IDentification system for all the
| languages we wanted to handle in order to build the
| monolingual dataset. More details in the paper ;) Or here, if
| you have questions
| protomyth wrote:
| I think it will interesting when it runs into a language (e.g.
| Dakota) where the women and men speak differently. Should be an
| interesting test.
| zen_1 wrote:
| Doesn't seem to be a big issue for Arabic, where verbs are
| gendered (so in the sentence "I am going to the store", the
| verb "to go" will be either masculine or feminine, reflecting
| the speaker's gender).
| nemothekid wrote:
| Arabic is the 5th or 6th most spoken language. I think the
| concern for low resource languages is that nuances like
| that won't get picked up.
| pesenti wrote:
| Blog post: https://ai.facebook.com/blog/nllb-200-high-quality-
| machine-t...
|
| Paper: https://research.facebook.com/publications/no-language-
| left-...
|
| Github: https://github.com/facebookresearch/fairseq/tree/nllb/
| robocat wrote:
| Also note comments from _hello_im_angela_ (= Angela Fan) and
| _jw4ng_ (= Jeff Wang). Those are the HN accounts for Angela and
| Jeff from No Languages left Behind.
| albertzeyer wrote:
| Note that very recently Google has done something very similar:
| "Building Machine Translation Systems for the Next Thousand
| Languages": https://arxiv.org/abs/2205.03983
| https://ai.googleblog.com/2022/05/24-new-languages-google-tr...
|
| The Facebook paper has some direct comparison to that work.
| jkw wrote:
| Evaluation was important to us, and we really wanted to have a
| benchmark that covers all 200 languages
| enos_feedler wrote:
| I was two sentences in before I realized the headline wasn't "No
| Luggage Left Behind"
| onurcel wrote:
| this is actually our recurring joke for our team meeting
| offsites!
| mikewarot wrote:
| The analogy I like the most is that they've found the "shape" of
| languages in high dimensions, and if you rotate the shape for
| English the right way, you get an unreasonably good fit for the
| shape of Spanish, again for all the other languages.
|
| We're at a point where it's now possible to determine the shape
| of every language, provided there are enough speakers of the
| language left who are both able and willing to help.
|
| <Snark> Once done, Facebook can then commodify their dissent, and
| sell it back to them in their native language. </Snark>
| goldemerald wrote:
| The shape analogy doesn't really apply with modern language
| models. Each word gets its own context dependent high
| dimensional point. With everything being context dependent,
| simple transformations like rotations are impossible. A more
| accurate perception is that any concept expressible in language
| now has its own high dimensional representation, which can then
| be decoded into any other language.
| labrador wrote:
| I'll know AI translators are any good when the United Nations
| starts using them
|
| _" Skills required: United Nations translators are required to
| have a perfect command of their main language and an excellent
| knowledge of, in most cases, two other official languages"_
|
| https://www.un.org/dgacm/en/content/translation
| kwhitefoot wrote:
| What is a "low resource language"?
| pesenti wrote:
| https://datascience.stackexchange.com/questions/62868/high-l...
| jw4ng wrote:
| hey there, I work on this project. We categorize a language as
| low-resource if there are fewer than 1M publicly available, de-
| duplicated bitext samples.
|
| also see section 3, table 1 in the paper:
| https://research.facebook.com/publications/no-language-left-...
| maestrae wrote:
| hey, this sounds silly but I can't seem to find a link of all
| the languages covered in the 200 hundred languages. I've
| looked at the website and the blogpost and neither have a
| readily available link. Seems like a major oversight. There
| is of course a drop down in both but the languages there are
| a lot less than 200. I'm particularly interested in a list of
| the 55 African languages for example.
| hello_im_angela wrote:
| We have a full list here (copy pastable): https://github.co
| m/facebookresearch/flores/tree/main/flores2... and Table 1
| of our paper
| (https://research.facebook.com/publications/no-language-
| left-...) has a complete list as well.
| goodside wrote:
| Nice to see Esperanto made the cut -- the only artificial
| language to do so, AFAICT.
| hello_im_angela wrote:
| ha yes, that's correct. If you have thoughts on specific
| constructed languages where having translation would
| really help people, let us know!
| maestrae wrote:
| thank you!
| protomyth wrote:
| Looking at the list, I see a lack of Native American
| languages. Did anyone try to contact the tribes during this?
| hello_im_angela wrote:
| We interviewed speakers of low-resource languages from all
| over the world to understand the human need for this kind
| of technology --- what do people actually want, how would
| they use it, and what's the quality they would find useful?
| Many low-resource languages lack data online, but are
| spoken by millions. However, many indigenous languages are
| spoken by smaller numbers of people, and we are definitely
| interested in partnering with local communities to co-
| develop technology and have been actively investigating
| these collaborations but don't have much to share yet.
| vjerancrnjak wrote:
| What are hardware requirements to run this?
|
| I see the mixture model is ~ 300 GB and was trained on 256 GPUs.
|
| I assume distilled versions can easily be run on one GPU.
| hello_im_angela wrote:
| We release several smaller models as well:
| https://github.com/facebookresearch/fairseq/tree/nllb/exampl...
| that are 1.3B and 615M parameters. These are usable on smaller
| GPUs. To create these smaller models but retain good
| performance, we use knowledge distillation. If you're curious
| to learn more, we describe the process and results in Section
| 8.6 of our paper:
| https://research.facebook.com/publications/no-language-left-...
| jkw wrote:
| Hey all, I work on this project. Full list of languages can be
| found here:
| https://github.com/facebookresearch/flores/tree/main/flores2...
|
| As well as in the research paper:
| https://research.facebook.com/publications/no-language-left-...
| jw4ng wrote:
| Jeff Wang here with my fellow Meta AI colleague Angela Fan from
| No Languages left Behind, seeing the comments flowing through. If
| you want to ask us anything, go for it!
| dangom wrote:
| What is the greatest insight you gained and could share with
| non-experts from working on this project?
| jw4ng wrote:
| I gained a deeper understanding of what it truly means to be
| inclusive. Every language is unique just like everybody and
| making sure content works for all and including as many
| people as possible is really really hard, but through this
| project i'm hopeful we are taking it one step further
| Jabbles wrote:
| > Every language is unique just like everybody
|
| TBH it just sounds like you've redefined the word "unique".
| mike8889 wrote:
| pagekicker wrote:
| Hi, I'm putting together an online event called 31 Days of AI
| for Book-Lovers to coincide with US National Book Month,
| October 2022. I was struck by the specific call-out to
| translating literature on your demo page and would like to
| feature a specifically book-related application of NLLB on one
| of 'anchor days'. Can someone work with me on this?
| shuraih wrote:
| Hey Jeff, I'm a native speaker of Dhivehi -- the language
| spoken by the people of Maldives. Since I couldn't find a full
| list of supported languages I was wondering if Dhivehi is /
| would be integrated.
| jkw wrote:
| Dhivehi is currently not supported, unfortunately. We view
| this as a starting point and are committed to expanding to
| many other languages as in the spirit of our project name.
|
| Full list of currently supported languages can be found here:
| https://github.com/facebookresearch/flores/tree/main/flores2.
| ..
| jefflombardjr wrote:
| Gangi ther vel!
| pesenti wrote:
| Are all the 200x200 translations going directly or is English
| (or another language) used as an intermediate for some of them?
| jw4ng wrote:
| All translation directions are direct from language X to
| language Y, with no intermediary. We evaluate the quality
| through 40,602 different translation directions using
| FLORES-200. 2,440 directions contain supervised training data
| created through our data effort, and the remaining 38,162 are
| zero-shot.
| btheshoe wrote:
| I'm not entirely sure why low resource languages are seen as such
| a high priority for AI research. It seems that by definition
| there's little payoff to solving translation for them.
| goodside wrote:
| "Low-resource language" isn't just a euphemism for "language
| almost nobody speaks". There are many languages that are widely
| spoken but nonetheless are hard to obtain training data for.
| Getting something like Wikipedia going for a minority language
| can be a difficult chicken-and-egg problem because users will
| use English for its completeness/recency, despite their limited
| fluency, and the native-language Wikipedia remains neglected.
| So you can end up in a situation where users use one language
| for social media and another for news/research, and Facebook is
| in a unique position to care about the former.
| quink wrote:
| The examples given are, with native speaker numbers, Assamese
| (15 million), Catalan (4 million) and Kinyarwanda (10 million).
| These alone are more than an Australia.
|
| Furthermore, Facebook considers the internet to consist of
| Facebook and Wikipedia (Zero).
|
| I view this as just another extension of their Next Billion
| initiative, an effort to ensure that another billion people are
| monopolised by Facebook.
|
| That's the payoff.
| dunefox wrote:
| Small data, big meaning is much more important than big data,
| little meaning. Much closer to real intelligence.
| onurcel wrote:
| hi @btheshoe, I work on this project in the data part. As
| others mentioned, the amount of data available for a language
| is not correlated to the number of speakers of that language,
| which explains the potential impact of focusing on these.
| Jabbles wrote:
| Surely the fact that they did all the high-resource languages
| first and are only now getting round to the less-popular ones
| demonstrates that that is not, in fact, the case?
| tehsauce wrote:
| I think the reason low resource languages are prioritized is to
| compensate for the fact that AI research normally has a
| tendency to marginalize these languages.
| btheshoe wrote:
| yes, but what principles justify the importance placed on low
| resource languages?
| froskur wrote:
| Low resource in this context means that there are few
| resources available to train a neural network with, not
| that there are few speakers. Although many low resource
| languages have relatively few speakers, there are also ones
| with tens of millions of speakers.
|
| The reason for emphasis is in my opinion twofold: 1)
| Allowing these people to use the fancy language technology
| in their own language is good in and of itself. 2) Training
| neural networks on fewer resources is more difficult than
| using more resources and therefore a fun and interesting
| challenge.
| macintux wrote:
| Plus presumably we learn more from solving harder
| problems, and we prepare for one day needing to translate
| some alien language in a hurry.
| jw4ng wrote:
| We think it's important for AI to truly support everyone in the
| world. A world where AI only serves a subset of the population
| is not ideal. In machine translation, this means supporting as
| many language as possible at high quality. We also imagine a
| future where anyone will be able to communicate with anyone
| else seamlessly; this also means solving translations for all
| languages.
| daniel-cussen wrote:
| Wouldn't that also entail a bot speaking in any language?
| wilde wrote:
| The point is that there are lots of humans who speak these
| languages and use tech. They just don't use Wikipedia so
| getting a good translation corpus going was harder.
| gwern wrote:
| And it's both cumulative across all those languages (see
| above), cheap/amortized (if you can do a good multilingual
| NMT for 50 languages, how hard can 50+1 languages be?), and
| many of those languages are likely to grow both in terms of
| sheer population and in GDP. (Think about South Asian or
| African countries like Indonesia or Nigeria.) The question
| isn't why are FB & Google investing so much in powerful
| multilingual models which handle hundreds of languages, but
| why aren't other entities as well?
| ausbah wrote:
| what other entities would really have access to the text
| resources that FB & Google? outside of a few other large
| companies I can't imagine many
| munificent wrote:
| Cynical answer: It's good PR.
| albertzeyer wrote:
| I don't really remember the exact numbers anymore, but covering
| only the top 5 languages will cover maybe 40% of the world
| population, while covering the top 200 languages (many of them
| low resource) will cover maybe 90% of the world population.
|
| Some numbers (but you can not exactly infer from them such
| accumulated numbers):
| https://en.wikipedia.org/wiki/List_of_languages_by_total_num...
|
| Some more numbers from here:
| https://www.sciencedirect.com/science/article/pii/S016763931...
|
| "96% of the world's languages are spoken by only 4% of its
| people."
|
| Although this statement is more about the tail from the approx
| 7000 languages.
| bvanderveen wrote:
| Great! Facebook no longer have to provide content moderation in
| all the various corners of the world where they could
| accidentally enable the dissemination of misinformation and hate
| speech in minority languages. They can simply transform it into
| English and run it back through the existing moderation tooling!
|
| Understanding foreign culture is about reading automated
| translations of online comments into your native language. It has
| nothing to do with putting the effort into learning a language
| and understanding the nuances and current events and issues of
| the culture it embeds.
|
| The ESL (English as a single language) speakers over at Facebook
| don't even need to understand foreign cultures, because they
| already know everyone in the world needs to spend their lives
| staring into the Metaverse. So grateful that they are working on
| the world's fattest pipeline for exporting Anglophone culture to
| every corner of the planet!
| LtWorf wrote:
| Facebook translations are horrifying for the mainstream languages
| already. They go from completely wrong to kinda understandable
| but still wrong.
| rmbyrro wrote:
| Looks like they're investing to get better. The model is also
| available and they called for contributions to improve it.
___________________________________________________________________
(page generated 2022-07-06 23:00 UTC)