[HN Gopher] Mozilla Common Voice Adds 16 New Languages and 4,600...
___________________________________________________________________
Mozilla Common Voice Adds 16 New Languages and 4,600 New Hours of
Speech
Author : heyhillary
Score : 554 points
Date : 2021-08-05 12:54 UTC (10 hours ago)
(HTM) web link (foundation.mozilla.org)
(TXT) w3m dump (foundation.mozilla.org)
| pkz wrote:
| Openly licensed speech data for smaller languages is great! I
| hope as many as possible contribute in order to get better
| representation across ages and pronunciation. In the end, this
| may be what is needed for the hyperscale companies to support
| speech assistants in more languages?
| dabinat wrote:
| Common Voice is a great project that I'm glad Mozilla kept alive.
|
| One problem is that data for speech recognition needs to be
| extremely accurate (i.e. the speech matches the transcript
| perfectly) and the human review process is infallible and there
| are quite a number of bad clips that made it past the review
| process (to be fair, Mozilla provides no official guidance to
| reviewers or recorders).
|
| Plus in the early days, they were recording the same small
| sentence pool over and over again, so the first 700 hours or so
| are duplicates.
|
| I hope there will be efforts in the future to clean up the
| existing dataset to improve its quality.
| lunixbochs wrote:
| I'm an ASR researcher shipping high quality English models
| trained on limited resources, and while I've needed to include
| other datasets to make the model more robust to different kinds
| of text, Common Voice is a substantial part of my training
| process. I did not do any manual cleanup. Most of my automated
| cleanup was done with very basic (low quality) models. My
| latest models trained this way are competitive with e.g. Google
| or Apple English speech recognition accuracy.
|
| I'm going to disagree that there's ultimately a need for
| perfect training data in ASR. I'm sure it helps with some model
| types and training processes, but it simply hasn't been a
| factor in my use of Common Voice (English). I'll also note my
| best model can hit around 10% WER on Common Voice Test without
| any language model, which is better than any public numbers
| I've seen posted for it so far (I'm not even using a separate
| transformer decoder or RNN decoder layers for this number, just
| the raw output of CTC greedy decode).
|
| None of the above even factors in techniques like wav2vec and
| IPL (iterative pseudo labeling) with noisy student, which
| suggest you can hit extremely competitive accuracy with very
| little correctly labeled data. These techniques are the
| underpinnings of the current state of the art models.
| ma2rten wrote:
| Why does data for speech recognition need to be prefect. That's
| certainly not the case for other machine learning applications.
| Can you train the less clean data and fine-tune on a clean
| subset?
| dabinat wrote:
| Well that was kind of my point: you need to manually figure
| out what's clean and what isn't.
| stegrot wrote:
| Here are some draft guidelines for validation that have been
| translated a lot: https://discourse.mozilla.org/t/discussion-
| of-new-guidelines...
|
| But you are right, the process has some flaws. Maybe we can
| review the dataset automatically on some common errors, once an
| STT system is ready for a language?
|
| The only other option I can think about is a validation process
| that includes more people per sentence. Right now, only two
| people validate a sentence, and if they disagree a third person
| decides. We could at least double check sentences with one "no"
| vote one more time.
| dabinat wrote:
| The community guidelines are good but they're hidden away on
| the forum. I was asking them for years to just make those the
| official guidelines and link them prominently on the CV site
| but they never did.
|
| However, Hillary, the new community manager, seems good and
| she's making a lot of positive changes so hopefully this will
| be addressed soon.
|
| Long-term the best approach may be some kind of user
| onboarding before they can record / validate.
| fareesh wrote:
| Is voice transcription accessible to mere mortals yet?
|
| I have tried pretty much every API offered by big tech, and also
| various open source models. All of them seem to have incredibly
| high word error rates. This is mostly for conversations with
| various Indian accents.
| nshm wrote:
| Did you try Vosk Indian English model? It is specifically built
| for Indian accent English
|
| https://alphacephei.com/vosk/models/vosk-model-en-in-0.4.zip
|
| In case you want more accuracy you can share a file with an
| example, we can take a look on how to make the best accuracy.
|
| For Indian ASR it is also worth to mention recently introduced
| Vakyansh project which builds model for major Indian languages:
|
| https://github.com/Open-Speech-EkStep/vakyansh-models
| Edman274 wrote:
| I'm guessing that of the 4,600 new hours of speech, maybe 4,100
| of those hours are of men's voices and 500 hours are of women's
| voices, yeah?
| LoriP wrote:
| To be fair not sure that's the best guess :) there seem to be
| more female voices than men to me. Anyhow, I'd wager there's at
| least a 50:50 mix.
| johnnyApplePRNG wrote:
| Just tried rating some of the English voices and I am conflicted.
|
| Most of them were definitely speaking English, but in an Indian
| intonation that I was barely able to understand coming from an
| English as a First Language country.
|
| Some of them were reading words syllable by syllable, which is
| definitely English, but I would hate to have to listen to an
| ebook or webpage read aloud to me in that manner.
|
| By clicking yes am I training the system to speak English with an
| Indian intonation?
|
| Should I click no, not English?
|
| Should/does english even have a "proper" intonation?
| jturpin wrote:
| Wow you're right. This is conflicting as many of the words are
| not pronounced properly at all. Maybe it doesn't matter to the
| accuracy of the speech-to-text system, but it feels like
| training it with bad data.
| ohgodplsno wrote:
| Different accents isn't bad data. Your vision of the world of
| "english is only spoken with an american accent" is what
| leads to horrendous speech recognition APIs, like Google's.
|
| If your ML model can't handle multiple accents, it is
| worthless.
| topspin wrote:
| "english is only spoken with an american accent"
|
| Which american accent?
| jturpin wrote:
| There's a difference between an accent and pronouncing
| words wrong. I would expect an English speech recognition
| system to handle the various accents there are in the world
| (the US has several accents of course), but it shouldn't
| handle incorrect pronunciation of syllables if it comes at
| the expense of recognizing clean data. If it doesn't come
| at its expense then I guess it's fine.
| ma2rten wrote:
| I think this dataset is mainly for speech recognition and not
| text to speech. Speech recognition should be able to recognize
| as many different accents as possible.
| marc_abonce wrote:
| From https://commonvoice.mozilla.org/en/criteria
|
| > Varying Pronunciations
|
| > Be cautious before rejecting a clip on the ground that the
| reader has mispronounced a word, has put the stress in the
| wrong place, or has apparently ignored a question mark. There
| are a wide variety of pronunciations in use around the world,
| some of which you may not have heard in your local community.
| Please provide a margin of appreciation for those who may speak
| differently from you.
|
| > On the other hand, if you think that the reader has probably
| never come across the word before, and is simply making an
| incorrect guess at the pronunciation, please reject. If you are
| unsure, use the skip button.
| magicalhippo wrote:
| Common Voice is not for generating speech, it's for detecting
| speech.
|
| So don't worry about weird intonation as long as they correctly
| pronounce the sentences, that way even more people can enjoy
| the fruit of this labor.
| fisxoj wrote:
| If anyone has interest contributing, I've found this app for
| Android makes it very easy!
| https://www.saveriomorelli.com/commonvoice/
| nmstoker wrote:
| Why on Earth would anyone use an app for this when mobile
| browsers work perfectly well for adding audio to Common Voice?
|
| We could possibly give the developer the benefit of the doubt
| that they're not doing anything inappropriate with the data but
| frankly why pass your data through a third party that's not
| part of the project.
|
| And why install an app requiring access to your shared local
| storage? The GitHub repo claims the website an animations are
| slow which sounds like BS to me. It works fine on a five year
| old phone I use for submitting.
|
| Just contribute here if you're so inclined, much more sensible:
|
| https://commonvoice.mozilla.org/en
| commoner wrote:
| The unofficial CV Project Android app is entirely open source
| and available on F-Droid:
|
| https://github.com/Sav22999/common-voice-android
|
| https://f-droid.org/packages/org.commonvoice.saverio/
| nmstoker wrote:
| Yes, I referenced the GitHub repo comments.
|
| Sure, you can get the source but as I said it's still a
| pointless step to go via a third party
| totetsu wrote:
| because mozilla fired all the cv team, and the app is under
| active development?
| nmstoker wrote:
| You aren't distinguishing the projects correctly. The CV
| project isn't the same as the DeepSpeech project (even
| though they were related).
|
| And your point makes little sense, because if the site was
| not working how could the app get voice data into the
| project. I've had some involvement with these projects over
| the years so I'm not just firing off arm-chair comments on
| this. They wouldn't have been able to add this new voice
| data if the site was under developed as you imply.
| stegrot wrote:
| The app has a few nice features the website doesn't have,
| such as changing the speed during validation. It always
| surprises me as well, but many people hate to use web apps on
| mobile. I don't really know why, they simply ask for an app
| and refuse to use a browser.
| alpb wrote:
| This may be off-topic but: What's the relationship between Coqui
| (an OSS TTS startup) https://coqui.ai/about and Mozilla? I recall
| that the project at one point was called mozilla/TTS
| (https://github.com/mozilla/TTS/) and now I see that has a fork
| in the startup's own repo (https://github.com/coqui-ai/TTS).
| Presumably Common Voice is used to train mozilla/TTS and other
| OSS TTS solutions?
| ftyers wrote:
| Common Voice is mostly used for STT not TTS. TTS requires
| single speaker, clean audio. STT requires multi speaker, noisy
| audio.
| arghwhat wrote:
| People seem to speak extremely mechanically in these samples,
| which I suspect may lead to training bias against native as
| speech if used.
|
| I think it should be explained that one should speak naturally
| when reading the lines.
| tsjq wrote:
| nice. !
|
| news from the past about this :
|
| Initial Release of Mozilla's Open Source Speech Recognition Model
| and Voice Data : https://news.ycombinator.com/item?id=15808124
|
| Mozilla releases the largest to-date public domain transcribed
| voice dataset https://news.ycombinator.com/item?id=19270646
| junon wrote:
| Historically not been the biggest fan of Mozilla but I really,
| really love this project. I'm glad they're keeping it alive.
| ftyers wrote:
| One of the most noticeable additions in my opinion is Guarani,
| the first Indigenous language of the Americas to be added.
| Indigenous languages are extremely poorly supported and forgotten
| by all of the major platforms and companies, and it's great to
| see one getting the attention they deserve. (Disclaimer: I was
| involved)
| runarberg wrote:
| As an Icelander I am always really impressed with how well my
| language--a language spoken by a few hundred thousand people
| worldwide--is supported on various platforms and technologies.
| This is probably in no small part thanks to active
| participation by native speakers and even some government
| funding.
|
| However I at the same time I'm also deeply disappointed by the
| lack of support for Iceland's closest neighbour's language--
| Greenlandic--which is an indigenous language, the sole official
| language of an autonomous country.
| matsemann wrote:
| I saw the same when I was younger for Norwegian. Bokmal is
| the most commonly written form of Norwegian, but New
| Norwegian is used by about ~15%. Most software included
| Bokmal support, but you could bet some hardcore user of New
| Norwegian had made a language pack available as well.
| necovek wrote:
| Ah, I remember "Nynorsk" (sorry for the bad spelling and
| ASCIIation) localisation of GNOME from early 2000s!
|
| Generally, it takes only a few dedicated people to get
| software localised if good enough infrastructure is
| provided by the community!
|
| I hope that's what we see with Mozilla Common Voice too!
| Sharlin wrote:
| "Nynorsk" is correct, no non-ASCII shenanigans in that
| word :)
| neartheplain wrote:
| Whoah, 6.5 million native speakers! That's several orders of
| magnitude more than I was expecting. It's also significantly
| larger than the native-speaking populations of languages like
| Catalan, Basque, or Romansh, which might be more familiar to
| North Americans or Europeans.
| victorlf wrote:
| Catalan has about 10 million speakers.
| andrepd wrote:
| >It is one of the official languages of Paraguay (along with
| Spanish), where it is spoken by the majority of the
| population, and where half of the rural population is
| monolingual.
|
| Wow, I had no idea
| djoldman wrote:
| Or, 20x more than Icelandic:
|
| https://en.wikipedia.org/wiki/Icelandic_language
| hkt wrote:
| Without wishing to get political, is the difference that
| Iceland is a country but Guarani speakers don't have a
| nation-state of their own? Or something else?
| moron4hire wrote:
| Nation-states are political entities, so choosing
| languages by such a distinction would absolutely be
| political.
| air7 wrote:
| Like any feature, perhaps it has to do with the volume of
| anticpated use vs the effort to support.
| arp242 wrote:
| Note that Icelandic is currently not well supported
| either ("In progress" with 384/5000 sentences and 86%
| Localized). Actually, Guarani is better supported at the
| moment, and quite a number of other common smaller-ish
| languages aren't well supported yet either such as
| Hebrew, Danish, and even Korean (which is not small or
| even small-ish at all). Some other smaller languages are,
| such as Breton or Irish. Overall, it's a bit
| inconsistent. I suppose that this is because in the end,
| these things depend on the number of people contributing;
| there's a reason Esperanto is near the top, as it has a
| very active community of enthusiasts who love to promote
| the language.
| chudi wrote:
| It's an official language of Paraguay
| rudyfink wrote:
| In case anyone else wanted to know more, there are,
| apparently, 2 official languages and the other is
| Spanish.
| https://www.servat.unibe.ch/icl/pa00000_.html#A140_
| interactivecode wrote:
| The difference is completely and inherently political.
| caymanjim wrote:
| I think this is overly dismissive of other factors.
| Whether or not a language is supported by something on
| the Internet has a lot more to do with financial
| incentives than politics. If there were a huge consumer
| market clamoring to give their money to a site and the
| only barrier were language, it'd get exploited pretty
| quickly.
| runarberg wrote:
| No it has a lot to do with politics as well. A sovereign
| nation may find it important to have their languages
| supported widely on the internet so they might use some
| of the public funds into funding translation efforts and
| voice recognition/speech synthesizer contributions.
|
| I know the Icelandic government spends some money for
| this and it shows. This tiny language has way more
| support then other way more spoken languages. If the
| Norwegian government wanted I bet the Sami languages
| could have just as good of a support as Icelandic. Or if
| the Greenlandic government had more funds available I bet
| we would see Kalaallisut in more places online.
| ftyers wrote:
| The Norwegian government and Sami parliament put a lot of
| effort into language technology for the Samo languages. A
| big problem is lack of openness in platform support. E.g.
| Google and Apple make it very difficult for external
| developers to do localisation.
| necovek wrote:
| What you are saying is that a small, relatively rich
| country can invest in supporting their own language:
| that, to me, is not political, but as raised previously,
| financial. It's also a good incentive for other big
| players (Google, Microsoft, Apple) to invest in a
| language that has prospective customers willing to spend
| more.
|
| Serbian government would certainly support Serbian
| language voice recognition and synthesis, but probably
| not with as much money as Iceland would.
| monocasa wrote:
| > Politics (from Greek: Politika, politika, 'affairs of
| the cities') is the set of activities that are associated
| with making decisions in groups, or other forms of power
| relations between individuals, such as the distribution
| of resources or status.
|
| It certainly sounds like this is a political situation to
| me, almost to a tautology. The fact that these decisions
| was made on the basis of financial gain doesn't make them
| any less political.
| eropple wrote:
| _> that, to me, is not political, but as raised
| previously, financial_
|
| The idea that there is a difference between these two
| things is one of the more pernicious ones of the last
| hundred years.
|
| Money is power. The exercise of power is politics. They
| can't be separated.
| singlow wrote:
| I'm sure having a nation-state is a major factor, but I
| bet it also has to do with the average wealth, geographic
| location, historical alliances. However, I'd put my money
| on skin color as the biggest factor.
| runarberg wrote:
| As an example in favor of your conclusion, I propose
| Greenlandic. Geographically really close to Iceland, is
| the sole official language of an autonomous country,
| significant cultural heritage (with even a famous
| [possible] dwarf planet named after one of their historic
| gods). However--unlike Iceland--Greenland is not a
| wealthy country, and tend to have darker skin color then
| Icelanders.
| puchatek wrote:
| Autonomous territory, not a country.
| kspacewalk2 wrote:
| There are a number of Native American languages that have
| numerous speakers, but until recently have been marginalized,
| repressed and ignored (and some to this day). Guarani is the
| most numerous, but also Quechua, Nahuatl, and the various
| Mayan languages (spoken by around half of Guatemalans, and
| another 2.5 million Mexicans).
| olejorgenb wrote:
| I find the recording UI a bit annoying. They make it unnecessary
| hard to re-record a clip. Re-recording the previous clip is
| likely to be a common thing to do. Instead of providing a
| shortcut for this, they have shortcuts for re-recording each of
| the individual 5 clips..
|
| It's also impossible (?) to undo a clip. Eg.: If I've already
| recorded 3 clips and mistakenly begin a clip I simply can't
| pronounce correctly, there's no way of removing that clip without
| discarding the whole set. (EDIT: it is possible by re-recording
| that clip and pressing skip)
| Vinnl wrote:
| Re-recording a clip is very rare for me. Keep in mind that it's
| supposed to emulate real-world conditions, with all its
| messiness.
| jalopy wrote:
| Going along with this: What are the latest and greatest open
| source speech-to-text models and/or tools out there?
|
| Would love to hear from experienced practitioners and a bit of
| detail on the experience.
|
| Thanks HN community!
| thom wrote:
| Same question for text-to-speech!
| orra wrote:
| Mozilla announced Deep Speech[1] around the same time as Common
| Voice.
|
| Mozilla Deep Speech is an open source speech recognition
| engine, based upon Baidu's Deep Speech research paper[2].
|
| Unsurprisingly, Deep Speech requires a corpus such as... Common
| Voice.
|
| [1] https://github.com/mozilla/DeepSpeech
|
| [2] https://arxiv.org/abs/1412.5567
| rasz wrote:
| They killed this after Nvidia grant.
| orra wrote:
| Ah, damn. Didn't realise.
|
| It also looks like Baidu are now developing their Deep
| Speech as open source?
| https://github.com/PaddlePaddle/DeepSpeech
| mazoza wrote:
| https://github.com/coqui-ai/STT
| zerop wrote:
| Vosk is my favourite. I have used deep speech too. Vosk works
| better.
| nshm wrote:
| Thank you. I deeply appreciate you mention our efforts. We
| spend quite some time and knowledge to build accurate speech
| recognition. Not that easy to get as much mentions as
| Mozilla, so we are thankful for every single one!
| kcorbitt wrote:
| I've had good results with https://github.com/flashlight/flashl
| ight/blob/master/flashli.... Seems to work well with spoken
| english in a variety of accents. Biggest limitation is that the
| architecture they have pretrained models for doesn't really
| work well with clips longer than ~15 seconds, so you have to
| segment your input files.
| blackcat201 wrote:
| I created edgedict [0] a year ago part of my side projects. At
| that time this is the only open source STT with streaming
| capabilities. If anyone is interested the pretrained weights
| for english and chinese is available.
|
| [0] https://github.com/theblackcat102/edgedict
| woodson wrote:
| NVidia NeMo: https://github.com/NVIDIA/NeMo
| jononor wrote:
| Have used VOSK a bit recently. The out-of-the-box experience
| was great compared to earlier projects (looking at you Kaldi
| and Sphinx...). Word-level audio segmentation was one usecase,
| https://stackoverflow.com/a/65370463/1967571
| woodson wrote:
| Vosk is built on Kaldi.
| stegrot wrote:
| Kdenlive supports automatic subtitles created with VOSK now
| btw. This makes it a lot more accessible for non-tech folks.
| [deleted]
| rasz wrote:
| Whats the point when they killed DeepSpeech in exchange for
| adapting closed Nvidia thing?
|
| https://venturebeat.com/2021/04/12/mozilla-winds-down-deepsp...
|
| https://blog.mozilla.org/en/mozilla/mozilla-partners-with-nv...
|
| $1.5mil for shutting down open source initiative, almost half of
| CEO salary right there.
| jononor wrote:
| What closed NVidia thing did they adopt? I don't see any
| evidence of that here.
| option wrote:
| https://github.com/NVIDIA/NeMo which is open source, Pytorch
| based and regularly publishes new models and checkpoints.
| Seirdy wrote:
| The source code is under a FLOSS license, but it only works
| on Nvidia GPUs and uses proprietary Nvidia-specific
| technologies like CUDA.
|
| It's significantly closer to "nonfree" on the free-nonfree
| spectrum than it should be, and is another example of the
| difference between the guiding philosophies behind "free
| software" and "open source"
| yorwba wrote:
| Can't you run it on CPU? And looking at the code, it
| seems like they're using Numba to JIT their CUDA kernels,
| so I guess someone could come along and provide a
| compatibility shim to make the kernels run on a non-CUDA
| accelerator?
| rasz wrote:
| Im sure they signed on adopting "something", otherwise it
| would be receiving $1.5 million grant for closing open source
| initiative. $3 million a year lawyer wouldn never be this
| blatant.
| stegrot wrote:
| Deepspeech is still alive in a way, the team founded the
| company coqui.ai after the Mozilla layoffs and they keep
| everything open source.
| mazoza wrote:
| I know the old speech team continues as Coqui
| https://github.com/coqui-ai/
| tmalsburg2 wrote:
| About their TTS system: "These models provide speech
| synthesis with ~0.12 real-time factor on a GPU and ~1.02 on a
| CPU." The quality of the samples is really impressive but,
| wow, but isn't this computationally too expensive for many
| applications?
| jononor wrote:
| Open-source speech recognition is doing pretty good with
| projects such as VOSK, Athena, ESPNet and SpeechBrain. These
| days models are the easy part of ML, and data is the hard one.
| So for Mozilla to focus on Common Voice over DeepSpeech seems
| reasonable.
| tkinom wrote:
| Would one use the youtube as training date?
|
| Especially for the videos with Close Caption....
|
| As simple as extracting the Audio and CC text?
| soapdog wrote:
| You can't really do it because of licensing reasons. One
| cool thing Common Voice brings to the table, besides all
| the fantastic data, is the licensing.
| anonymfus wrote:
| YouTube still allows uploaders to mark their videos as CC
| BY 3.0 licensed, and it's still possible to check that
| via YouTube's API.
|
| (See https://support.google.com/youtube/answer/2797468
| and the part about status.license here:
| https://developers.google.com/youtube/v3/docs/videos)
| NavinF wrote:
| This is incorrect. Pretty much every state of the art
| model uses copyrighted data. This is considered fair use
| and it has never been a problem outside of concern
| trolling.
| ma2rten wrote:
| Are you sure it's not fair use? I believe most legal
| experts agree that language models such as GPT-3 are not
| violating copyright due to fair use.
| amelius wrote:
| Source?
| hkt wrote:
| Having an open corpus means that researchers building the next
| thing in voice research - which may or may not follow
| DeepSpeech - have something to work with. This is enormously
| important and their change of direction lets a thousand flowers
| bloom. Meanwhile, their partnership with Nvidia provides a
| fertile ground to prove the value of the open corpus in action.
| Nvidia get access to Mozilla's (presumably superior) ability to
| build said corpus, while Mozilla lay the foundations for others
| to contribute work in the open. It is a great example of
| comparative advantage, and a win win choice, IMO.
| rasz wrote:
| So in other words we provide data for free to Mozilla, and
| Mozilla turns around and sells it for millions to Nvidia to
| fund ... not open source, they killed that so umm ee, to fund
| ceo salary?
| nmstoker wrote:
| You seem to imply that Nvidia are paying for data that is
| freely available.
|
| Anyone can use the Common Voice data within the terms of
| the license and NVIDIA contributing towards the continued
| gathering of data (that will continue to be made publicly
| available) won't change that.
|
| It's a huge shame that Mozilla didn't continue the
| DeepSpeech project but Coqui is taking on the mantle there
| and there are plenty of others working on open source
| solutions too, all whilst the existence of CV will make a
| big difference to research, in the academic, commercial and
| open source spheres.
| robbedpeter wrote:
| Coqui is phenomenally good and well done, so this new
| data should lower the barrier to entry for the
| represented languages.
| danShumway wrote:
| > and sells it
|
| If that was true that would be a profoundly bad purchase
| for NVidia since the data is already freely licensed and
| available for anyone to use at no cost.
|
| This is like saying that Epic "bought" Blender when they
| gave it a development grant, or that Google contributing
| patches to upstream Linux means they own it now. Mozilla
| didn't give NVidia any kind of special license, when NVidia
| contributes data to Common Voice they're doing so under
| _Common Voice 's_ license, not their own.
|
| We want to encourage more companies to treat software and
| training data as a public commons that is collectively
| maintained, this is a good thing.
| rasz wrote:
| Its the kind of "bad" Nvidia purchase like when they pay
| game publishers for incorporation of
| physx/cuda/hairworks/gameworks resulting in
|
| https://techreport.com/news/14707/ubisoft-comments-on-
| assass...
|
| https://techreport.com/review/21404/crysis-2-tessellation
| -to...
|
| https://arstechnica.com/gaming/2015/05/amd-says-nvidias-
| game...
|
| Here it appears they purchased this
| https://venturebeat.com/2021/04/12/mozilla-winds-down-
| deepsp...
| moralestapia wrote:
| Lol, these guys sell themselves for peanuts.
| say_it_as_it_is wrote:
| "The top five languages by total hours are English (2,630 hours),
| Kinyarwanda (2,260) , German (1,040), Catalan (920), and
| Esperanto (840)."
|
| How did they get almost as much training for Kinyarwanda as they
| have English?
| stegrot wrote:
| The German Federal Ministry for Economic Cooperation and
| Development supported this language:
| https://www.bmz.de/de/aktuelles/intelligente-sprachtechnolog...
| say_it_as_it_is wrote:
| Interesting! There's a market for this kind of audio data
| entry? What was the total cost for that many hours? The
| English data was entirely volunteer driven, correct? Maybe
| it's worth funding the English corpus for the additional
| hours needed to reach the sweet spot?
| russian_nukes wrote:
| What is this voice database? Do they have russian voices?
| bravura wrote:
| Is anyone aware of classification (e.g. word prediction) datasets
| for low-resource and endangered languages?
|
| If so, we would like to use it for the HEAR NeurIPS competition:
| https://github.com/microsoft/DNS-Challenge/tree/master/datas...
|
| The challenge is restricted only to classification tasks, and
| sequence modeling like full ASR is unfortunately beyond the scope
| of the competition.
| danShumway wrote:
| I don't really have anything of substance to add here, but I'm
| very happy to see Mozilla continuing to put effort into this,
| happy to see effort being put into broadening the support beyond
| just English and major languages, and I'm grateful for the work
| that people (inside and outside of Mozilla) have already put into
| getting the project this far.
| mgarciaisaia wrote:
| You arguably have something of substance to add - you can help
| improve the datasets by speaking or validating phrases in the
| project's website
|
| https://commonvoice.mozilla.org/
|
| There are many languages available to pick from.
| orra wrote:
| Indeed, it's great to see open data corpuses expand.
| _gtly wrote:
| A direct link to where you can donate your voice here:
| https://commonvoice.mozilla.org/en
| donhaker wrote:
| Let's take the time to appreciate the effort of Mozilla. To add
| new languages with others came from the minorities, we can't deny
| that they are continuously putting effort into the community.
| Jnr wrote:
| The great open source community around Mozilla helps a lot.
|
| When I did not see my own language in the list a year ago, and
| I had no clue how to get it there, I reached out to my
| university contacts that I know used to translate Firefox years
| ago.
|
| With their help we quickly translated the whole common voice
| site (it was a prerequisite to start contributing a language)
| and provided first sets of text to start contributing.
|
| In about a week we started contributing voice for a new
| language. The Common Voice project is awesome and very well
| made.
| satya71 wrote:
| > The top five languages by total hours are English (2,630
| hours), Kinyarwanda (2,260), German (1,040), Catalan (920), and
| Esperanto (840)
|
| Some unusual suspects among the top languages, there!
| ftyers wrote:
| That's what happens when people have the opportunity and tools
| to support their own languages and not just rely on hand outs
| from big tech :)
| umeshunni wrote:
| Ah yes, major world languages with 10s or 100s of millions of
| speakers (Bengali, Korean, Malayalam) are ignored or are
| perpetually stuck "in progress" while hobby languages like
| Esperanto are supported.
| stegrot wrote:
| Hey, I work on the Esperanto version of CV. You are right,
| many languages should be bigger than Esperanto, and we
| never planned to become this big, it just happened. We are
| around ten active people and a telegram group with a few
| hundred motivated donors. Plus, we write about the project
| in Esperanto magazines and talk about it on Esperanto
| congresses.
|
| The point is: the only reason Bengali Korean and Malayalam
| are stuck "in progress" is that no one is working on them.
| No language but English is actively supported by Mozilla,
| it all comes from the communities. And the success of
| Esperanto shows that every language can make it. I hope
| that people take our work as a motivation. Every language
| can become big if a few motivated people work on it for a
| year or two. Even the smallest language can make it. You
| just need a lot of public domain sentences, a few thousand
| donors and some technical knowledge then your language will
| grow as well :)
| umeshunni wrote:
| Sure, I was responding to the factitious comment above.
|
| When I can use Google or Facebook in any of these
| languages for 10+ years, it's silly of this project to
| claim some high moral ground when you can't support some
| of the most widely spoken languages in the world and
| stick to languages that hipsters in San Francisco think
| is cool.
| yorwba wrote:
| It _can_ support those languages, they just need some
| people who actually speak them to come along and make it
| happen. If you can help, I 'm sure it will be
| appreciated.
| Anon1096 wrote:
| Esperanto is a hobby language for upper-middle class people
| in developed countries. It isn't anyone's "own language".
| ndkwj wrote:
| Is "upper-middle class in developed countries" meant to be
| an expletive?
| bradrn wrote:
| Well, it has native speakers:
| https://en.wikipedia.org/wiki/Native_Esperanto_speakers
| crvdgc wrote:
| > Esperanto is a hobby language for upper-middle class
| people in developed countries.
|
| I wonder what gave you such an impression of Esperanto. My
| personal experience of Esperanto is quite different.
|
| I started to casually self-learn Esperanto about one year
| ago as my second foreign language apart from English. After
| about half a year, I was confident enough to join online
| Esperanto communities and it gave me a surprisingly much
| more diverse experience than any community I had
| encountered on the Internet.
|
| For example, in an online chat group, active users mainly
| come from US, South America, and Russia. As an person from
| East Asia, there is little chance for me to get in touch
| with the latter two groups otherwise. And there are often
| new users from South America who speak only Spanish and
| Esperanto.
|
| I myself do not identify as a upper-middle class person,
| and I don't know enough to assess other Esperanto speakers'
| class status.
|
| The impression of Esperanto speakers being upper-middle
| class may come from the fact people learn Esperanto as a
| hobby. But people not in the upper-middle class can have
| other hobbies, why is Esperanto different? It doesn't come
| with the many benefits that people may expect from learning
| a "practical" language, but it takes significantly less
| effort. I'd say it's about as hard as learning a new
| instrument. So it is not that exclusive to only upper-
| middle class people.
|
| After one year of casual learning, I am now able to
| contribute to the Common Voice project in Esperanto (175
| recordings and 123 validations) and I actually use it as a
| source of learning material.
| krrrh wrote:
| Technically there are a few hundred L1 speakers of
| Esperanto, but that doesn't really contradict your point.
|
| https://cogsci.ucsd.edu/~bkbergen/papers/NEJCL.pdf
| stegrot wrote:
| You are not wrong, but besides the upper-middle-class hobby
| people, there is also a 130 years old culture that exists
| parallel to it. I've met a few native Esperanto speakers,
| and for them Esperanto is their identity. Traditional
| Esperanto clubs exists in countries like Iran, Japan,
| China, Burundi, Nigeria and many more. So Esperanto is
| both, a nerdy hobby and an old culture.
| hkt wrote:
| Weirdly judgemental.
|
| Esperanto was designed to be easy to learn. It isn't an
| elite pursuit in the way you suggest, because its community
| isn't gatekept. I personally have met people of all social
| classes who have been interested in it.
|
| It was also never meant to be a first language, it is an
| auxiliary language. It is possible for an English speaker
| to have a conversation with a Mandarin speaker with no
| intermediary if both know the (comparatively easy to learn)
| Esperanto. Its original purpose wasn't trivial either: it
| was created to stop groups without a common language in the
| same city (Warsaw, I think?) fighting, created on the basis
| that they'd stop doing so if only they could speak a common
| language.
|
| Think of it as JVM bytecode for people.
| least wrote:
| Auxiliary languages are kind of inherently doomed to fail
| to function as they're intended because in order for them
| to function as such, commitment needs to be made to adopt
| it multilaterally by governments with sufficient
| influence. If today the United States and China
| bilaterally decided to force Esperanto into their school
| curriculum it'd likely be adopted very quickly by
| everyone else, but that isn't the case and I doubt it
| ever would be under almost any circumstance, because
| learning English is just immediately more practical, even
| if it's a significantly more difficult language to be
| picked up.
|
| And that's how it's played out. Nearly every developed
| nation teaches English as a second language or is a
| native population of English speakers. The universal
| language is English. The JVM bytecode for people is
| English.
| voidnullnil wrote:
| > The JVM bytecode for people is English.
|
| What are you telling me? That I need to drop English?
| jl6 wrote:
| My takeaway is that nobody should speak English, but
| instead people should compose their sentences in a
| different language and then translate them to English at
| the point of speaking (with small pauses in the
| conversation for you to collect your thoughts on this
| garbage).
| hkt wrote:
| Spoken like an anglophone. Tell that to Latin America and
| East Asia..
| least wrote:
| I don't have to, you can look at pretty much any of their
| language curriculum and find a huge presence of English
| in nearly all their education systems.
|
| Certainly you will find people learning other languages
| for trade depending on the region, but even in East Asia,
| as you say, English is taught in China, Japanese, Korea.
| In Singapore English is the language everyone learns (and
| is taught in). In Vietnam the primary foreign language
| taught is English. In the Philippines one of its official
| languages is English. Argentina teaches English in
| elementary school. In Brazil students from grade 6 have
| to learn a language, which is usually English. In
| Venezuela English is taught from age 5.
|
| So what exactly do I have to tell them?
| yongjik wrote:
| Not sure about Latin America, but bring someone from each
| of China/Japan/Korea and they'll talk to each other in
| English.
| samtheDamned wrote:
| They weren't exclusively talking about Esperanto. I read it
| as a reference to Kinyarwanda and Catalan more than
| anything else. In the bigger scheme of things there are a
| lot of languages here that are definitely a product of
| being able to share your own language. There's multiple
| native languages that are being shared here, like the
| thread above about Guarani.
| 1-6 wrote:
| You have a point there. I've been disappointed that Korean
| has been stuck in the 'In Progress' state. The Korean tech
| giants already have APIs to do common speech recognition. I
| hope more Korean grassroots efforts focus on tools that are
| open and accessible so it can be built scalable and better.
| yorwba wrote:
| It looks like Korean still needs a fully localized
| interface and a sufficiently large collection of sentences
| to record. You can help by translating the interface
| https://pontoon.mozilla.org/projects/common-voice/ and
| collecting public-domain sentences
| https://commonvoice.mozilla.org/sentence-collector/ and of
| course by getting Koreans you know excited about the
| project so they'll help, too.
| fleaaaa wrote:
| Thank you for pointing it out, I had no idea but I'd happy
| to contribute on this one. There is indeed a decent korean
| natural language process engine but it's severely tied to
| own ecosystem AFAIK.
|
| https://papago.naver.com/
| yorwba wrote:
| The project seems to have some serious government backing in
| Rwanda: https://digitalumuganda.com/
| nyx-aiur wrote:
| I love the datasets but they are still way to small especially
| for exotic languages.
| [deleted]
| LoriP wrote:
| Tips & Tricks incoming... I find that if I can't sleep and want
| something that's kind of useful to do without getting too
| involved, contributing to common voice is a great way to spend
| half an hour and relax/forget whatever it is I was churning
| about. I would recommend it for that, plus it's a great project.
| Both listening and voicing...
___________________________________________________________________
(page generated 2021-08-05 23:00 UTC)