[HN Gopher] Common Voice
___________________________________________________________________
Common Voice
Author : oblib
Score : 174 points
Date : 2023-12-05 16:13 UTC (6 hours ago)
(HTM) web link (commonvoice.mozilla.org)
(TXT) w3m dump (commonvoice.mozilla.org)
| rwmj wrote:
| I wish they'd concentrate on the browser.
| dingnuts wrote:
| voice integration in a browser for control and feedback would
| be great if you were blind
| culi wrote:
| And text-to-speech. Which is already a standard:
| https://developer.mozilla.org/en-
| US/docs/Web/API/Web_Speech_...
|
| The web is in a hilarious state where it's harder to style an
| option in a drop down than it is to generate speech from some
| text
| dragonwriter wrote:
| Better in the DE than an app, even the browser, unless its
| like ChromeOS and the browser is the DE.
| joomooru wrote:
| Accessibility is an important part of the browser :)
| OfSanguineFire wrote:
| Mycroft users really wished that Mozilla had kept up efforts in
| this direction, because otherwise the only option for reliable
| speech-to-text is uploading every command you give your agent
| to Google or Baidu. The browser is important, and I don't
| support Mozilla's vacuous projects for social-justice cred, but
| there are a handful of areas where we need some non-profit to
| provide a privacy-respecting solution.
| rwmj wrote:
| That is indeed important, so I take it back (can't edit
| original post now).
| user_7832 wrote:
| Didn't mozilla also have a related speech to text software that
| got canned/moved to a different company? Or was that different?
| salynchnew wrote:
| DeepSpeech? https://github.com/mozilla/DeepSpeech
| posguy wrote:
| Mozilla didn't want to fund further development, most of the
| team ended up at Coqui.ai
| rasz wrote:
| Mozilla shut that project down same day (Apr 12, 2021) as:
| "Mozilla is partnering with NVIDIA, which is investing $1.5
| million in Mozilla Common Voice,". Aka they got paid off by
| Nvidia to not compete.
| bitvoid wrote:
| This is an open dataset of voice samples to train models, so
| not really STT/TTS software.
| sxp wrote:
| FF's TTS is an important project for anyone who wants a trivial
| to use text-to-speech system. It's built into the browser so you
| can just run wss = window.speechSynthesis;
| for (let i = 0; i < wss.getVoices().length; ++i){ str =
| `Voice ${i} is ${wss.getVoices()[i].name}`; s = new
| SpeechSynthesisUtterance(str); s.voice =
| wss.getVoices()[i]; wss.speak(s);
| console.log(str); } in the console to get
| various TTS examples. For some browsers, this can be done offline
| while others use a cloud based TTS system.
| j45 wrote:
| This is handy to know, thanks. I was just trying out Common
| Voice a few days ago.
|
| They have a good example of a community page for folks wanting
| to help with a particular language.
|
| I was just thinking today that Firefox is worthy of switching
| back to because it was so fast,except I hadn't had a chance to
| do it.
|
| If anyone else thinks it's important for there to be an
| independent browser dedicated to privacy and security (and
| independence), they could as many casual browser switchers. I'm
| happy to be back on a few FF extension that didn't work quite
| the same on any chrome based browser.
| vlod wrote:
| This also works in Chrome (My version is: 119.0.6045.199)
|
| FF has 8611 voices, chrome has 19.
| joshstrange wrote:
| That's odd, my Chrome (119.0.6045.199) has 176 voices. Not
| all are English though.
| vlod wrote:
| Maybe it's because I'm linux? (Pop!_OS 22.04 LTS)
|
| Also I have 3 English only.
| rollcat wrote:
| On macOS, it's say "enter text here"
|
| To pick a different voice: say -v Fred "enter
| text here"
|
| To list voices: say -v "?"
|
| (The quoting is necessary to prevent ZSH from interpreting the
| question mark as a glob.)
|
| I hear Firefox's TTL is important, yet prior to your comment I
| didn't even know it existed. This sort of stuff should be more
| discoverable, and have a more accessible (ahem) API.
| fzzzy wrote:
| It's part of the web apis, it's not just firefox. Chrome and
| Safari have supported it since 2013/2014.
| marcellus23 wrote:
| It looks like speechSynthesis is supported in all the major
| browsers, not just FF. https://developer.mozilla.org/en-
| US/docs/Web/API/Window/spee...
| dan-robertson wrote:
| Do you know if it's been extracted into a standalone library?
| The state of the open source TTS seems to not be great.
| Presumably the data for a voice is harder to put together than
| training a speech to text system like whisper.
| miki123211 wrote:
| The voices don't come from the browsers themselves, but from
| operating systems and their underlying TTS APIs, SAPI on
| Windows, Speech Dispatcher on Linux and AVSpeechSynthesizer
| on Apple Devices. If you install a third-party voice
| compatible with one of these, the browsers will pick that up.
| amelius wrote:
| Is there a handy demo website somewhere to access that?
| imjonse wrote:
| While this dataset is orders of magnitude smaller than what
| recent speech models like Whisper and Seamless got trained on,
| and while it is meant for supervised as opposed to self-
| supervised learning where data is more abundant, it can still be
| useful for finetuning an existing model for improving its score
| on a specific language.
| skrebbel wrote:
| I'm sad that this is English only. I'll love to contribute lots
| of voice for a Dutch TTS from an nonprofit org like Mozilla
| meepmorp wrote:
| They do collect other languages - there's a setting for it in
| the annotation section, and the dataset downloads let you
| choose other languages.
|
| e.g.: https://commonvoice.mozilla.org/nl/listen
| skrebbel wrote:
| Woops! Thanks :-)
| meepmorp wrote:
| Don't feel bad - it's not especially obvious. I only
| thought about it because I'm already familiar with the
| project.
| dabinat wrote:
| Although English is the most-contributed language, one of the
| goals of Common Voice is to support languages that wouldn't
| normally receive attention from commercial providers.
| yorwba wrote:
| The most-contributed language is Catalan with 3678 hours
| recorded vs. 3395 hours in English
| https://commonvoice.mozilla.org/en/languages (The language
| list sorts your browser's UI languages ahead of all others,
| which is why English may appear on top for you.)
| zerotolerance wrote:
| https://commonvoice.mozilla.org/en/about?tab=how-add-languag...
| dang wrote:
| Related. Others?
|
| _Mozilla Common Voice Adds 16 New Languages and 4,600 New Hours
| of Speech_ - https://news.ycombinator.com/item?id=28073016 - Aug
| 2021 (170 comments)
|
| _Firefox Voice_ - https://news.ycombinator.com/item?id=24096082
| - Aug 2020 (154 comments)
|
| _Firefox Voice: Browse the web with your voice_ -
| https://news.ycombinator.com/item?id=23902560 - July 2020 (2
| comments)
|
| _Mozilla Common Voice Dataset: More data, more languages_ -
| https://news.ycombinator.com/item?id=23695377 - June 2020 (41
| comments)
|
| _The Common Voice Project by Mozilla reached its first goal: 1k
| hours in englisch_ -
| https://news.ycombinator.com/item?id=23051756 - May 2020 (1
| comment)
|
| _Common Voice: A Massively-Multilingual Speech Corpus_ -
| https://news.ycombinator.com/item?id=21887693 - Dec 2019 (9
| comments)
|
| _Common Voice - Mozilla 's initiative to help teach machines how
| real people speak_ -
| https://news.ycombinator.com/item?id=21268579 - Oct 2019 (49
| comments)
|
| _Mozilla releases the largest to-date public domain transcribed
| voice dataset_ - https://news.ycombinator.com/item?id=19270646 -
| Feb 2019 (61 comments)
|
| _Mozilla Overhauls Speech-To-Text Contribution Interface_ -
| https://news.ycombinator.com/item?id=17436958 - July 2018 (42
| comments)
|
| _Initial Release of Mozilla's Open Source Speech Recognition
| Model and Voice Data_ -
| https://news.ycombinator.com/item?id=15808124 - Nov 2017 (88
| comments)
|
| _Project Common Voice_ -
| https://news.ycombinator.com/item?id=14794654 - July 2017 (57
| comments)
|
| _Mozilla: Project Common Voice_ -
| https://news.ycombinator.com/item?id=14786881 - July 2017 (1
| comment)
| vidarh wrote:
| I submitted a request for Norwegian Bokmal, and realised a
| complication which I'm sure must affect other languages too:
|
| Norway has two separate official languages. They are unusually
| close - one is relatively close to Danish, and the other started
| as a collection of dialects, but technically they are written
| languages, _especially Bokmal_ which basically means "book
| language".
|
| I'm _unusual_ in that I speak close to "pure" Bokmal. Thanks to
| expectations at school etc., a lot of speakers who write Bokmal
| will adjust or tone down their dialect if asked to read a text
| that is written in grammatically and orthographically correct
| bokmal, but will otherwise speak in a manner that can deviate
| fairly significantly from the written language.
|
| As such, depending on whether your goal is text to speech or
| speech recognition, the pronunciation you will need is very
| different.
|
| E.g. people I know who write Bokmal might _say_ something like
| "hva erredu ser pa a?" ("what are you looking at?") with hardly
| any gaps between words, while I would stick close to the written
| "hva er det du ser pa?" with clear gaps. In recognition you need
| to handle both (and many other variations), while for generation
| you'd at least by default usually want the latter unless there
| are indications the text is written in dialect.
|
| It strikes me you'd _really_ want people to write more detail
| about what it is they are speaking and /or let people tag/label
| data with additional info about accents. Not just for this, but
| for other multi-lingual speakers as well. E.g. it'd be helpful to
| have many foreign accents in the English (and other languages)
| dataset for recognition, but as much as I want speech recognition
| to understand me, I'm not particularly interested in teaching it
| to speak English with a strong Norwegian accent.
|
| That is _less_ of an issue than the dialects in some languages
| that can involve much more than just speaking the same words
| differently.
|
| To take another example "Jeg apnet doren og gikk ut i solen" og
| "Jeg apna dora og gikk ut i sola" are both valid Bokmal.
| Depending on _context_ a reader may stick strictly to the text or
| swap apnet <->apna, doren<->dora, sola<->sola, and _every
| permutation is valid_. Which exact set you use differs and some
| speakers will write one but use the other when speaking. E.g. I
| would _say_ apna, dora, sola, but write apnet, doren, solen. The
| latter is more formal and /or old-fashioned in some parts of the
| country, but the perception of that also varies by region. And
| this totally leaves out all the dialect variations used by people
| who'd say their language is Bokmal, and would be recognized as
| such by Norwegian speakers, but who use variants of words or
| conjugations that aren't technically recognized as valid Bokmal.
|
| The former is more "modern" (several of the forms are only valid
| Bokmal as a result of successive language reforms), more common
| in the Eastern part of Norway outside of the posher parts of Oslo
| and other wealthy regions, and (weirdly) more common in 1970's
| radical left-wing academics (especially people involved with the
| Maoist Workers Communist Party/AKP-ML) as an
| affectation/sociolect, with each of these groups also deviating
| in other aspects....
|
| If you want to maximize the utility of a dataset like this, you
| _really_ would want to let each speaker at least assign a lot of
| tags /labels to their profile; even if you don't want to deal
| with the hornet nest of trying to figure out all the
| distinctions, even unstructured labels would be a start, and
| ideally allowing people to tag individual recordings as well,
| because there are a _lot_ more variations than just "language"
| and "accent" here.
| indigo945 wrote:
| This is a great argument.
|
| I particularly agree with your point regarding English - my
| German accent sounds jarring to probably most native English
| speakers, but it should still be understood. To add to your
| argument, I have sometimes tried to turn on subtitles for
| Youtube videos in some accent of English that I haven't had
| much contact with (such as Nigerian English), but the auto-
| generated closed captions turned out to be even more useless
| than my own comprehension.
|
| However, one should keep in mind that Mozilla's main goal here
| is accessibility, with the implication that they mean
| accessibility for blind and deaf people in particular - as
| opposed to accessibility for stunted multilinguals like us. For
| these purposes, being able to transcribe mainly mainstream uses
| of the language is fine, and so is being able to generate
| speech in a hodge-podge averaged dialect. I highly doubt most
| blind people care about whether their TTS engine speaks The
| Queen's English or not, as long as it is clear and
| understandable.
| vidarh wrote:
| What is "clear and understandable" varies greatly, though.
| E.g. Nigerian English is often subtitled in the UK, but
| fairly often so is Scottish English... Both often to the
| great dismay of speakers of the two who sometimes are very
| annoyed at the expectation that people might not understand
| them.
|
| Nigerian English is actually fascinating in that there's a
| whole spectrum from Nigerian Pidgin, which ranges from nearly
| unintelligible to English speakers, to "mostly British
| English" in terms of orthography and grammar, but which still
| tends to incorporate words from several differences Nigerian
| languages and pidgin. (e.g. abeg, don't give me any wahala;
| Please, don't give me any trouble)
|
| Now consider Nigeria is about to become the country with the
| second largest number of English speakers worldwide (it's
| close to tied with India, depending which sources and level
| of proficiency you consider, and Nigeria's population is
| growing far faster than India's), and while it's still quite
| far behind the UK for people speaking it as their _first_
| language, with current population growth and increasing use
| of English (e.g. my ex wife 's first language is English
| because her parents first languages were Igbo and Yoruba, and
| that kind of situation is driving adoption) likely to cause
| Nigeria to become the second largest on that measure as well.
|
| So handling a broader range of dialects will matter, at least
| in terms of recognition - I do agree that there's _more_
| flexibility for generation, though even there if you try feed
| a broader Nigerian English pidgin to a TTS engine and it
| doesn 't know what to do with the words it might well end up
| being unintelligible both to eg. American or British English
| speakers and Nigerian English speakers.
| OfSanguineFire wrote:
| Are you autistic? I ask because this is HN where lots of people
| are, and choosing to speak the literary norm in countries with
| diglossia is often associated with autism. For example,
| foreigners in Finland are urged to quickly get to grips with
| _puhekieli_ (spoken Finnish) because speaking _kirjakieli_ (the
| literary norm) in everyday contexts, or writing it in chats, is
| "something only autistic people do".
| vidarh wrote:
| Not to my knowledge, though I may have some traits.
|
| That said, in Norway the literary form is/was spoken on e.g.
| TV and radio similar to how RP (received pronunciation)
| is/was spoken on the BBC, more so (in both cases) before than
| now where dialects are more broadly tolerated. On top of
| that, in affluent areas of Western Oslo and adjoining
| affluent areas the dialect sits mostly within what is
| "allowed" in Bokmal, and actually mostly towards a more
| conservative end of the allowed range than where I sit, and
| it's somewhat political, in that more conservative forms of
| Bokmal historically tended to be associated with social
| status (or aspirations...).
|
| It's unusual more in that the pockets and social groups where
| dialects that overlaps fully or almost entirely with Bokmal
| are fairly small.
|
| My spoken dialect is within that spectrum, exacerbated by
| reading _a lot_ of older literature at early age that used
| quite old fashioned forms of Bokmal, and picking up more
| formal language than many of my peers spoke through that, but
| I tend to be closer to the more affluent dialect in writing
| than spoken.
|
| (EDIT: My spoken dialect would probably fit as a somewhat
| "posh" version of Urban East Norwegian[1] today, with
| somewhat more conservative word choices in places where
| contemporary Urban East Norwegian would have deviated from
| Bokmal in minor ways in the 70's and 80's by being somewhat
| more "relaxed" in ways that have since been accepted in
| subsequent adjustments of the rules)
|
| If you heard me alongside my dad there'd be relatively minor
| differences between our dialects, and I'd probably sound
| marginally less formal as I adopted some spoken patterns from
| the more working class area I grew up in outside Oslo, while
| he at least when younger would be recognisable as having
| grown up on the Western edges of Oslo.
|
| Beyond that, language has always fascinated me, and I tended
| to take a certain level of delight in torturing my Norwegian
| teacher who favoured the other official language - Nynorsk.
| Nynorsk and Bokmal overlaps very significantly, and more so
| after recent language reforms which have tended towards
| allowing more Nynorsk forms of words, or ones closer to them,
| in Bokmal. Our Norwegian teacher very much wanted us to use
| those forms (that'd be favouring "sola" over "solen" etc.),
| and I used to express my distaste for Nynorsk by instead
| exaggerating my preference for the more conservative Bokmal
| forms.
|
| [1] https://en.wikipedia.org/wiki/Urban_East_Norwegian
| CoBE10 wrote:
| I'd like to give a shout-out to Common Voice Android:
| https://github.com/Sav22999/common-voice-android
|
| It's a handy app for those interested in contributing to the
| project. You can record voices for the languages you speak and
| validate other user contributions. I used to be a frequent
| contributor about two years ago, and this app had a much more
| user-friendly design compared to the official website version.
|
| Additionally, check out the official Common Voice Matrix channel:
| https://chat.mozilla.org/#/room/#common-voice:mozilla.org
| jeena wrote:
| Why then is the text2speech in reader mode (which other than that
| is excellent) on a Linux Firefox so extremely bad? Much worse
| than Steven Hawkins text2speech.
| spadufed wrote:
| Crowdsourced datasets like this and the ones produced by the
| OpenAssistant project could easily become the ONLY way to build
| foundational models if the courts decide that what OpenAI and co
| are doing is not Fair-Use. I don't think I would call this
| scenario unlikely, either.
| pimlottc wrote:
| With recent events in AI and deepfake technology, I would need to
| see some assurances before I agreed to "donate my voice" to
| something like this. It seems like the project is for voice
| recognition, not generation, but it's not immediately clear.
| thih9 wrote:
| What assurances would you like to see?
| moron4hire wrote:
| > Voice datasets also underrepresent: non-English speakers,
| people of colour, disabled people, women and LGBTQIA+ people.
|
| How does being gay change your voice?
| pseudalopex wrote:
| https://en.wikipedia.org/wiki/LGBT_linguistics#Accents_of_En...
| moron4hire wrote:
| I'm aware of the trope. I've yet to meet anyone that adheres
| to it, though. Always thought it was just one of those things
| that Hollywood overemphasizes to "other" gay people.
| pseudalopex wrote:
| > Always thought it was just one of those things that
| Hollywood overemphasizes to "other" gay people.
|
| Did you think they over emphasized it or did you think they
| made it up?
___________________________________________________________________
(page generated 2023-12-05 23:00 UTC)