[HN Gopher] Voice2json: Offline speech and intent recognition on...
___________________________________________________________________
Voice2json: Offline speech and intent recognition on Linux
Author : easrng
Score : 326 points
Date : 2021-05-21 16:14 UTC (6 hours ago)
(HTM) web link (voice2json.org)
(TXT) w3m dump (voice2json.org)
| marcodiego wrote:
| Good FLOSS speech recognition and TTS is badly needed. Such
| interaction should not be left to an oligoply with bad history of
| not respecting users freedoms and privacy.
| GekkePrutser wrote:
| Indeed, and it doesn't have to be as "machine learning" as the
| big ones.
|
| A FLOSS system would only have my voice to recognise and I
| would be willing to spend some time training it. Very different
| usecase from a massive cloud that should recognise everyone's
| voice and accent.
| sodality2 wrote:
| Mozilla CommonVoice is definitely trying. I always do a few
| validations and a few clips if I have a few minutes to spare,
| and I recommend everyone does. They need volunteers to validate
| and upload speech clips to create a dataset.
|
| https://commonvoice.mozilla.org/en
| jfarina wrote:
| I wonder if they use movies and tv; recordings where the
| script is already available.
| kelnos wrote:
| I expect that wouldn't be perfect, though. Sometimes the
| cut that makes it into the final product doesn't exactly
| match the script. Sometimes it's due to an edit, other
| times it's due to an actor saying something similar to but
| not exactly what the script says, but the director deciding
| to just go with it.
|
| What might work better is using closed captions or
| subtitles, but I've also seen enough cases where those
| don't exactly match the actual speech either.
| taneq wrote:
| They might work even better for interpreting the intent
| of spoken text. Not great for dictation though.
| habibur wrote:
| He meant subtitle when he talked of script.
| wongarsu wrote:
| That's fine for training your own model, but I don't think
| you could distribute the training set. That seems like a
| clear copyright violation, against one of the groups that
| cares most about copyright.
|
| Maybe you could convince a couple of indie creators or
| state-run programs to licence their audio? But I'm not sure
| if negotiating that is more efficient than just recording a
| bit more audio, or promoting the project to get more
| volunteers.
| dec0dedab0de wrote:
| _That 's fine for training your own model, but I don't
| think you could distribute the training set. That seems
| like a clear copyright violation, against one of the
| groups that cares most about copyright._
|
| I'm not sure that is a clear copyright violation. Sure,
| at a glance it seems like a derivative work, but it may
| be altered enough that it is not. I believe that
| collages, and reference guides like cliff notes are both
| legal.
|
| I think a bigger problem would be that the scripts, and
| even the closed captioning, rarely match the recorded
| audio 100%
| Wowfunhappy wrote:
| And also... it's not like the program actually contains a
| copy of the training data, right? The training data is a
| tool which is used to build a model.
| taneq wrote:
| How is it different from things like GPT3 which (unless
| I'm mistaken) is trained on a giant web scrape? I thought
| they didn't release the model out of concerns for what
| people would do with a general prose generator rather
| than any copyright concerns?
| sodality2 wrote:
| Does using copyrighted works to train a machine learning
| model make that model infringing?
| marcodiego wrote:
| GP is not talking about the model but about the training
| data set.
| sodality2 wrote:
| I am aware, I'm asking if the model, however, is
| infringing. Surely you can't distribute them in a dataset
| but is training on copyrighted data legal, and can you
| distribute that model?
| _jal wrote:
| All text written by a human in the US is automatically
| copyright the author. So if an engine trained on works
| under copyright is a derivative work, GPT3 and friends
| have serious problems.
| wongarsu wrote:
| Generally a ML model transforms the copyrighted material
| to the point where it isn't recognizable, so it should be
| treated as its own unrelated work that isn't infringing
| or derivative. But then you have e.g. GPT that is
| reproducing some (largeish) parts of the training set
| word-for-word, which might be infringing.
|
| Also I don't think there have been any major court cases
| about this, so there's no clear precedent in either
| direction.
| visarga wrote:
| > But then you have e.g. GPT that is reproducing some
| (largeish) parts of the training set word-for-word, which
| might be infringing.
|
| Easy fix - keep a bloom filter of hashed ngrams ensuring
| you don't repeat more than N words from the training set.
| sodality2 wrote:
| Thanks!
| akiselev wrote:
| It would likely be a lot easier for someone from within
| the BBC, CBC, PBS, or another public broadcaster to
| convince their employer to contribute to the models.
| These organizations often have accessibility mandates
| with real teeth and real costs implementing that mandate.
| The work of closed captioning, for example, can
| realistically be improved by excellent open source speech
| recognition and TTS models without handing all of the
| power over to Youtube and the like.
|
| It would still be an uphill battle to convince them to
| hand over the training set but the legal department can
| likely be convinced if the data set they contribute back
| is heavily chopped up audio of the original content,
| especially if they have the originals before mixing. I
| imagine short audio files without any of the music, sound
| effects, or visual content are pretty much worthless as
| far as IP goes.
| teraflop wrote:
| I like the idea, and decided to try doing some validation.
| The first thing I noticed is that it asks me to make a yes-
| or-no judgment of whether the sentence was spoken
| "accurately", but nowhere on the site is it explained what
| "accurate" means, or how strict I should be.
|
| (The first clip I got was spoken more or less correctly, but
| a couple of words are slurred together and the prosody is
| awkward. Without having a good idea of the standards and
| goals of the project, I have no idea whether including this
| clip would make the overall dataset better or worse. My gut
| feeling is that it's good for training recognition, and bad
| for training synthesis.)
|
| This seems to me like a major issue, since it should take a
| relatively small amount of effort to write up a list of
| guidelines, and it would be hugely beneficial to establish
| those guidelines _before_ asking a lot of volunteers to
| donate their time. I don 't find it encouraging that this has
| been an open issue for four years, with apparently no action
| except a bunch of bikeshedding: https://github.com/common-
| voice/common-voice/issues/273
| cptskippy wrote:
| After listening to about 10 clips your point becomes
| abundantly clear.
|
| One speaker, who sounded like they were from the mid-west
| United States, was dropping the S off words in a couple
| clips. I wasn't sure if it was misreads or some accent I'd
| never heard.
|
| Another speaker, with a thick accent that sounded European,
| sounded out all the vowels in circuit. Had I not had the
| line being read, I don't think I'd have understood the
| word.
|
| I heard a speaker with an Indian accent who added a
| preposition to the sentence that was inconsequential but
| incorrect none the less.
|
| I hear these random prepositions added as flourishes
| frequently with some Indian coworkers, does anyone know the
| a reason? It's kind of like how American's interject
| "Umm..." or drop prepositions (e.g. "Are you done your
| meal?") and I almost didn't pick up on it. For that matter
| where did the American habit of dropping prepositions come
| from? It seems like it's people in the North East
| primarily.
| OJFord wrote:
| I can't quite imagine superfluous prepositions (could you
| give an example?) but I have found it slightly amusing
| learning Hindi and coming across things where I think Oh!
| _That 's_ why you sometimes hear X from Indian English
| speakers, it's just a slightly 'too' literal1 mapping
| from Hindi, or trying to use a grammatical construction
| that doesn't really exist in English, like 'topic
| marking'.
|
| [1] If that's even fair given it's a dialect in its own
| right - Americans also say things differently than I
| would as a 'Britisher'
| [deleted]
| xwx wrote:
| I downloaded the (unofficial) Common Voice app [1] and it
| provides a link to some guidelines [2], which also aren't
| official but look sensible and seem like the best there is
| at the moment.
|
| [1] https://f-droid.org/packages/org.commonvoice.saverio/
|
| [2] https://discourse.mozilla.org/t/discussion-of-new-
| guidelines...
| wcarss wrote:
| I've used the deepspeech project a fair amount and it is
| good. It's not _perfect_ , certainly, and it honestly isn't
| good enough yet for an accurate transcription in my mind, but
| it's good. Easy to work with, pretty good results, and all
| the right kinds of free.
|
| Thanks for taking time to contribute!
| tootie wrote:
| If you read the doc, it says voice2json is layer on top of
| the actual voice recognition engine. And it supports mozilla
| deep speech, pocket sphinx and a few others as the underlying
| engine.
| cerved wrote:
| Weird sentences
| londons_explore wrote:
| Good speech recognition generally requites _massive_ mountains
| of training data, both labelled and unlabelled.
|
| Massive mountains of data tends to be incompatible with
| opensource projects. Even Mozilla collecting user statistics is
| pretty controversial. Imagine someone like Mozilla trying to
| collect hundreds of voice clips from each of tens of millions
| of users!!
| timvisee wrote:
| Another problem is that the models tend to get very very
| large for what I've seen. A gigabyte to 10s of gigabyes is an
| undesirable requirement on your local machine.
| londons_explore wrote:
| With insane amounts of computation, making models much
| smaller while having minimal impacts on performance is
| possible.
| kelnos wrote:
| Not sure about others, but DeepSpeech also distributes a
| "lite" model that's much smaller and suitable for mobile
| devices. Not sure how its accuracy compares to the full
| model though.
| GekkePrutser wrote:
| Well speech recognition for personal use doesn't have to
| recognise everyone. In fact it's a feature, not a bug if it
| recognises only me as the user.
| marcodiego wrote:
| Really complicated question, but considering the free world
| got wikipedia and openstreetmaps, I'd bet we'll find a way.
| JadeNB wrote:
| > Really complicated question, but considering the free
| world got wikipedia and openstreetmaps, I'd bet we'll find
| a way.
|
| Both of those involve entering data about _external_
| things. Asking people to share their _own_ data is another
| thing entirely--I suspect most people, me included, are
| much more suspicious about that.
| posmonerd wrote:
| Not an expert on any of this, but wouldn't already published
| content (public or proprietary) such as Youtube videos,
| audiobooks, tv interviews, movies, tv programs, radio
| programs, podcasts, etc. be useful and exempt from privacy
| concerns?
|
| Do user collected clips have soemthing so special to the
| point that it's critical to collect them?
| eliaspro wrote:
| Movies etc would need to be transcribed accurately to be
| useful for training and even then just provide a single
| sample for the specific item.
| sodality2 wrote:
| > Imagine someone like Mozilla trying to collect hundreds of
| voice clips from each of tens of millions of users!!
|
| They do, and it's working! https://commonvoice.mozilla.org/en
| londons_explore wrote:
| Except they have 12k hours of audio, when really they could
| do with 12B hours of audio...
| sodality2 wrote:
| Good point. I'm doing my part to contribute to it,
| though, not much else I can do!
| woodson wrote:
| Then you need a lot of people that listen to those 12B
| hours of audio, and multiple listeners agree for each
| chunk of audio that what is spoken corresponds to the
| transcript.
| londons_explore wrote:
| Lots of machine learning systems can use unsupervised and
| semi-supervised learning. Then nobody has to listen to
| and annotate all that audio.
| Animats wrote:
| That's not what this is. This is more like the system you use
| for phone-answering systems ("Do you want help with a bill,
| payment, order, or refund?")
| GekkePrutser wrote:
| Indeed, this is what I got from it too. It seems an
| alternative to VoiceXML used by companies like Nuance.
| londons_explore wrote:
| Speech recognition algorithms today require lots of data, lots
| of training computation, and a decent design.
|
| Decent designs are in published papers all over the place, so
| thats a solved issue.
|
| Lots of compute requires lots of $$$, which isn't opensource-
| friendly.
|
| Lots of data also isn't really opensource friendly.
|
| Sadly this is a niche that the opensource business model
| doesn't really fit.
| marcodiego wrote:
| People would probably say the same about wikipedia 20 years
| ago. People said similar things about gnu, gcc and linux 30
| years ago.
| sildur wrote:
| > Lots of compute requires lots of $$$, which isn't
| opensource-friendly.
|
| Not really, look up BOINC.
| jauer wrote:
| There's more involved in than just raw CPU cycles. It's not
| something that is easily adapted to BOINC, but trying to
| offload things to BOINC to free up clusters better suited
| to training models might make sense.
| airstrike wrote:
| Sounds like a viable model for certain universities, though.
| asdfman123 wrote:
| > should not be left to an oligoply with bad history of not
| respecting users freedoms and privacy
|
| So companies with a lot of data, then.
| yewenjie wrote:
| How does it compare to Vosk and other open source models/APIs?
| robmsmt wrote:
| I am working on something to compare at least 10 different ASRs
| both open source and production ones.
| a-dub wrote:
| neat. would be even neater if it used state to provide a prior on
| likely intents. (ie. in its most simple form, if you know the
| light is on, "turn on the light" has a prior of 0)
| jedimastert wrote:
| Things like state would probably be under the scope of whatever
| you're feeding the intents into
| a-dub wrote:
| yes, but by the time you've generated an intent it's too late
| to improve recognition accuracy using the prior.
| bmn__ wrote:
| Has anyone had any success getting the software to work?
|
| It's entirely unpackaged:
| https://repology.org/projects/?search=voice2json
| https://pkgs.org/search/?q=voice2json
|
| Docker image is broken, how'd that happen? $
| voice2json --debug train-profile ImportError:
| numpy.core.multiarray failed to import Traceback (most
| recent call last): File
| "/usr/lib/voice2json/.venv/lib/python3.7/site-
| packages/deepspeech/impl.py", line 14, in swig_import_helper
| return importlib.import_module(mname) File
| "/usr/lib/python3.7/importlib/__init__.py", line 127, in
| import_module return
| _bootstrap._gcd_import(name[level:], package, level)
| File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
| File "<frozen importlib._bootstrap>", line 983, in _find_and_load
| File "<frozen importlib._bootstrap>", line 967, in
| _find_and_load_unlocked File "<frozen
| importlib._bootstrap>", line 670, in _load_unlocked
| File "<frozen importlib._bootstrap>", line 583, in
| module_from_spec File "<frozen
| importlib._bootstrap_external>", line 1043, in create_module
| File "<frozen importlib._bootstrap>", line 219, in
| _call_with_frames_removed ImportError:
| numpy.core.multiarray failed to import
| nerdponx wrote:
| The source package does have installation instructions and
| appears to use Autotools:
| https://voice2json.org/install.html#from-source. Hopefully at
| least building from source works.
| mdaniel wrote:
| Building the v2.0 tag (or even master) using docker does not:
| E: The repository 'http://security.ubuntu.com/ubuntu eoan-
| security Release' does not have a Release file.
|
| And just bumping the image tag to ":groovy" caused subsequent
| silliness, so this project is obviously only for folks who
| enjoy fighting with build systems (and that matches my
| experience of anything in the world that touches Numpy and
| friends)
| xrd wrote:
| I tried docker (both debian version of Dockerfile), building
| from scratch, none of them work.
| hirundo wrote:
| I wonder if it would be possible to map vim keybindings to sounds
| and effectively drive the editor with the mouth when the hands
| are otherwise occupied. It might be possible to use sounds that
| compose into pronounceable words with minimal syllables for
| combinations. What would vim bindings look like as a concise
| command language suited to human vocalization?
|
| E.g. maybe "dine" maps to d$ and "chine" to c$. So as in keyboard
| vim you can guess what "dend" and "chend" do.
| krysp wrote:
| I do this successfully for work using https://talonvoice.com/ -
| initial learning curve is steep, but once you learn how to
| configure and hack on the commands, you can be very effective.
| I use it maybe half the day to combat lingering RSI symptoms,
| and with some work I could probably use it for 98% of input for
| the computer. Some people do use it for 100% afaik
| [deleted]
| twobitshifter wrote:
| https://youtu.be/8SkdfdXWYaI?t=600
|
| this guy is already there: Slurp slap scratch buff yank
| skratlo wrote:
| I now get the joke about Emacs and OS
| intrepidhero wrote:
| Would love to see a demo integrating this with an IDE for either
| voice to code or for voice commands to navigate menus. I think
| the killer application would layer voice and traditional input
| rather than replace.
| jrm4 wrote:
| Excellent! I just installed MyCroft the other day to play around
| with it; while it looks like a great start, two odd things. The
| first is obvious, which is the online/offline thing.
|
| The second was a little surprising (and maybe I missed it?) There
| was not much in the way of easily accessing transcribed output to
| and from shell scripts?
| varispeed wrote:
| It's not quite clear, but do you need to sacrifice your privacy
| in any way to use it? E.g. sending the data to some service in
| order to get trained model?
| nmstoker wrote:
| The description clarifies the underlying systems:
|
| >> Supported speech to text systems include:
|
| >> CMU's pocketsphinx
|
| >> Dan Povey's Kaldi
|
| >> Mozilla's DeepSpeech 0.6
|
| >> Kyoto University's Julius
|
| In case you're not aware, those are all locally run (thus not
| sending data off, not sacrificing privacy as you mention)
| nwalker85 wrote:
| Really interesting use of intents and entities. I feel like some
| of this is reinventing the wheel, since there is already a
| grammar specification, but novel use of intents/entities.
| https://www.w3.org/TR/speech-grammar/
| Edman274 wrote:
| Yeah, in my experience no one uses or supports that
| specification, which is a shame because if you're using
| something like AWS Connect with AWS Lex for telephony IVR, you
| can't just create a grammar and then have AWS Lex figure out
| how to turn its recognized speech-to-text into something that
| matches a grammar rule. Thus, Lex will return speech-to-text
| results that are according to general English grammar rules,
| rather than what you might have prompted the user to reply
| with. You'll be unpleasantly surprised if you think that
| defining a custom entity as alphanumeric always prevents the
| utterance "[w^n]" as sometimes matching "won" instead of "one"
| or "1".
|
| Edit - Sorry, I realize that's a tangent. What I'm saying is
| that when I was evaluating speech to text engines for things
| like IVR systems using AWS and Google, neither of them
| supported SRGS. Microsoft does, I think, but they didn't have a
| telephony component, and IBM was ignored from the get go, so
| "no one" really means "two very large companies."
| offtop5 wrote:
| Fantastic.
|
| Might use this with a Raspberry pi to set up some projects around
| the house. Is it possible to buy higher quality voice data ?
| nmstoker wrote:
| If you're interested in projects on a Pi then you might just be
| interested in this: https://github.com/rhasspy/rhasspy
|
| It's from the same author.
| GekkePrutser wrote:
| I like rhasspy but the problem I have with it is that it's
| too much of a toolkit and less of an application. There's too
| many choices to pick for the different components.. I think
| they should pick one of each and really tune them so it works
| really well. This way they'd take a lot of complexity away
| from the user.
| marcodiego wrote:
| For those who care: MIT license.
| synesthesiam wrote:
| Author here. Thanks to everyone for checking out voice2json!
|
| The TLDR of this project is: a unified command-line interface to
| different offline speech recognition projects, with the ability
| to train your own grammar/intent recognizer in one step.
|
| My apologies for the broken packages; I'll get those fixed
| shortly. My focus lately has been on Rhasspy
| (https://github.com/rhasspy/rhasspy), which has a lot of the same
| ideas but a larger scope (full voice assistant).
|
| Questions, comments, and suggestions are welcomed and
| appreciated!
| bachmitre wrote:
| This should come with pre-trained templates to create new
| templates via voice commands ;)
| [deleted]
___________________________________________________________________
(page generated 2021-05-21 23:00 UTC)