[HN Gopher] Show HN: LibreTranslate - Open-source neural machine...
___________________________________________________________________
Show HN: LibreTranslate - Open-source neural machine translation
API
Author : pjfin123
Score : 91 points
Date : 2021-02-06 18:48 UTC (4 hours ago)
(HTM) web link (libretranslate.com)
(TXT) w3m dump (libretranslate.com)
| fartcannon wrote:
| Good. Now can we finally stop pretending Cantonese doesn't exist?
| bmicraft wrote:
| Well it would be nice if it worked, but it could not even
| translate "merry christmas" to German, just left it as is.
| Apparently it needs the C to be capitalized ...
| dheera wrote:
| On a positive note I think it's great that we're seeing efforts
| in this direction.
|
| Fixing capitalization and spelling is a fairly easy thing to
| do, just put a spell-checker before the input. Maybe that would
| be a good pull request.
| pjfin123 wrote:
| There's more in depth discussion of this issue here:
| https://github.com/uav4geo/LibreTranslate/issues/20
|
| In some cases using all lower case can help avoid this risk
| if capitalization isn't important.
| jarym wrote:
| Interesting if I do English to French on the following:
|
| _Hello Sarah, what time is it?_
|
| it translates to
|
| _Bonjour Sarah, quelle heure est-il ?_
|
| Now if I change the input to
|
| _Hello Sara, what time is it?_
|
| it translates instead to:
|
| _Quelle heure est-il ?_
|
| Any idea why the one character difference in the name affects the
| translation in this way?
| pjfin123 wrote:
| The process for translation is to "tokenize" a stream of
| characters into "tokens" like this:
|
| "Hello Sarah, what time is it?"
|
| <Hello><_Sarah><,><_what><_time><_is><_it><?>
|
| Then the tokens are used for inference with a Transformer net.
| There are no guarantees that output will be consistent to even
| small changes in the input. The network, based on the data it
| was trained on/luck, has slightly different connections for the
| <_Sara> token than for the <_Sarah> token leading to different
| output.
|
| Here's a video of some Linux YouTubers reviewing Argos
| Translate (https://github.com/argosopentech/argos-translate),
| the underlying translation library, and getting unexpected
| outputs: https://www.youtube.com/watch?v=geMs9dxl1N8
| tejtm wrote:
| pigpen hole principle?
|
| More verb tenses in French than in English means ambiguity in
| where things may come from and go to.
| yorwba wrote:
| Hmm... "Let's see how well it works." seems to be handled
| correctly when translating from English into any language except
| Chinese, where the apostrophe is turned into <unk> and the
| sentence is otherwise untranslated.
|
| Does that mean there's a different model for each language pair?
| pjfin123 wrote:
| There are different models for each language pair. Currently
| there are only pre-trained models to and from English and other
| language pairs "pivot" through English.
|
| ex:
|
| es -> en ->fr
|
| Chinese is the weakest language pair currently, but I'm
| currently working on improving it:
| https://github.com/argosopentech/argos-translate/issues/17
| yorwba wrote:
| Thanks for the explanation. Pivoting through English isn't
| ideal, but I'm just glad someone is working on this at all.
|
| Thinking about it a bit more, it's a bit weird that the
| failure mode of a weak model would be to regurgitate the
| input unchanged. I'd rather have expected random Chinese
| gibberish in that case. Doesn't that mean the model has seen
| at least a few cases where English sentences were left
| untranslated in the training data?
|
| I wanted to download the training data to check, but the
| instructions here https://github.com/argosopentech/onmt-
| models#download-data say to use OPUS-Wikipedia, which has no
| en-zh pairs, so the Chinese data must be from some other
| source.
| pjfin123 wrote:
| Pivoting through English isn't inherent to Argos Translate,
| you could train a French-German model or whatever you want
| I've just been focusing on training models to add new
| languages. The ideal strategy is to have models that know
| multiple languages.
|
| Quoting a previous HN comment:
|
| I think cloud translation is still pretty valuable in a lot
| of cases since the model for one single direction
| translation is ~100MB. In addition to having more language
| options without a large download cloud translations let you
| use more specialized models for example French to Spanish.
| I just have a model to and from English for each language
| and any other translations have to "pivot" through English.
| For cloud translations you can also use one model with
| multiple input and output languages which gives you better
| quality translation between languages that don't have as
| much data available and lets you support direct translation
| between a large number of languages. Here's a talk where
| Google explains how they do this for Google Translate:
| https://youtu.be/nR74lBO5M3s?t=1682. You could do this
| locally but it would have its own set of challanges for
| getting the right model for the languages you want to
| translate.
|
| > Thinking about it a bit more, it's a bit weird that the
| failure mode of a weak model would be to regurgitate the
| input unchanged. I'd rather have expected random Chinese
| gibberish in that case. Doesn't that mean the model has
| seen at least a few cases where English sentences were left
| untranslated in the training data?
|
| This was added last week, it's just not live on
| libretranslate.com yet:
|
| https://github.com/uav4geo/LibreTranslate/issues/33
|
| The training scripts are just an example for English-
| Spanish, Opus(http://opus.nlpl.eu/) has data for English-
| Chinese.
| otagekki wrote:
| I had tried translating a test sentence and I got a rate limit
| related error...
|
| Would be glad to see this for Malagasy though
| yorwba wrote:
| Data availability is going to be a problem, I think. Checking
| Malagasy Wikipedia https://mg.wikipedia.org , there are only
| 93k articles. That's even less than Latin Wikipedia
| https://la.wikipedia.org at 134k. And much of the text in these
| articles probably isn't a direct translation of an article in
| another language, so the amount usable for parallel-text mining
| is going to be very small.
| yamrzou wrote:
| Well done. The UI is nice and easy to use. The results looked
| good for the few sentences I tried (Arabic <--> English).
|
| May I know which datasets were used to train the models?
| pjfin123 wrote:
| OPUS parallel corpus: http://opus.nlpl.eu/
|
| It's really great, they have a large amount of data and
| organize it to make it easy to access.
| Mizza wrote:
| Very glad to see something like this, this was on my high-
| priority Free software needs list.
|
| I'd very much like if it could be used programmatically, and not
| just via API - C/Python/Rust bindings, etc. I'd like to build
| some automatically translating forum software with it.
| rahimnathwani wrote:
| It's based on argos-translate, which has python bindings:
|
| https://github.com/argosopentech/argos-translate
| pjfin123 wrote:
| And a PyQt native Desktop app.
| drusepth wrote:
| OT: I'd definitely be interested in seeing the rest of your
| high-priority Free software needs list!
___________________________________________________________________
(page generated 2021-02-06 23:00 UTC)