[HN Gopher] Voice2json: Offline speech and intent recognition on...
       ___________________________________________________________________
        
       Voice2json: Offline speech and intent recognition on Linux
        
       Author : easrng
       Score  : 326 points
       Date   : 2021-05-21 16:14 UTC (6 hours ago)
        
 (HTM) web link (voice2json.org)
 (TXT) w3m dump (voice2json.org)
        
       | marcodiego wrote:
       | Good FLOSS speech recognition and TTS is badly needed. Such
       | interaction should not be left to an oligoply with bad history of
       | not respecting users freedoms and privacy.
        
         | GekkePrutser wrote:
         | Indeed, and it doesn't have to be as "machine learning" as the
         | big ones.
         | 
         | A FLOSS system would only have my voice to recognise and I
         | would be willing to spend some time training it. Very different
         | usecase from a massive cloud that should recognise everyone's
         | voice and accent.
        
         | sodality2 wrote:
         | Mozilla CommonVoice is definitely trying. I always do a few
         | validations and a few clips if I have a few minutes to spare,
         | and I recommend everyone does. They need volunteers to validate
         | and upload speech clips to create a dataset.
         | 
         | https://commonvoice.mozilla.org/en
        
           | jfarina wrote:
           | I wonder if they use movies and tv; recordings where the
           | script is already available.
        
             | kelnos wrote:
             | I expect that wouldn't be perfect, though. Sometimes the
             | cut that makes it into the final product doesn't exactly
             | match the script. Sometimes it's due to an edit, other
             | times it's due to an actor saying something similar to but
             | not exactly what the script says, but the director deciding
             | to just go with it.
             | 
             | What might work better is using closed captions or
             | subtitles, but I've also seen enough cases where those
             | don't exactly match the actual speech either.
        
               | taneq wrote:
               | They might work even better for interpreting the intent
               | of spoken text. Not great for dictation though.
        
               | habibur wrote:
               | He meant subtitle when he talked of script.
        
             | wongarsu wrote:
             | That's fine for training your own model, but I don't think
             | you could distribute the training set. That seems like a
             | clear copyright violation, against one of the groups that
             | cares most about copyright.
             | 
             | Maybe you could convince a couple of indie creators or
             | state-run programs to licence their audio? But I'm not sure
             | if negotiating that is more efficient than just recording a
             | bit more audio, or promoting the project to get more
             | volunteers.
        
               | dec0dedab0de wrote:
               | _That 's fine for training your own model, but I don't
               | think you could distribute the training set. That seems
               | like a clear copyright violation, against one of the
               | groups that cares most about copyright._
               | 
               | I'm not sure that is a clear copyright violation. Sure,
               | at a glance it seems like a derivative work, but it may
               | be altered enough that it is not. I believe that
               | collages, and reference guides like cliff notes are both
               | legal.
               | 
               | I think a bigger problem would be that the scripts, and
               | even the closed captioning, rarely match the recorded
               | audio 100%
        
               | Wowfunhappy wrote:
               | And also... it's not like the program actually contains a
               | copy of the training data, right? The training data is a
               | tool which is used to build a model.
        
               | taneq wrote:
               | How is it different from things like GPT3 which (unless
               | I'm mistaken) is trained on a giant web scrape? I thought
               | they didn't release the model out of concerns for what
               | people would do with a general prose generator rather
               | than any copyright concerns?
        
               | sodality2 wrote:
               | Does using copyrighted works to train a machine learning
               | model make that model infringing?
        
               | marcodiego wrote:
               | GP is not talking about the model but about the training
               | data set.
        
               | sodality2 wrote:
               | I am aware, I'm asking if the model, however, is
               | infringing. Surely you can't distribute them in a dataset
               | but is training on copyrighted data legal, and can you
               | distribute that model?
        
               | _jal wrote:
               | All text written by a human in the US is automatically
               | copyright the author. So if an engine trained on works
               | under copyright is a derivative work, GPT3 and friends
               | have serious problems.
        
               | wongarsu wrote:
               | Generally a ML model transforms the copyrighted material
               | to the point where it isn't recognizable, so it should be
               | treated as its own unrelated work that isn't infringing
               | or derivative. But then you have e.g. GPT that is
               | reproducing some (largeish) parts of the training set
               | word-for-word, which might be infringing.
               | 
               | Also I don't think there have been any major court cases
               | about this, so there's no clear precedent in either
               | direction.
        
               | visarga wrote:
               | > But then you have e.g. GPT that is reproducing some
               | (largeish) parts of the training set word-for-word, which
               | might be infringing.
               | 
               | Easy fix - keep a bloom filter of hashed ngrams ensuring
               | you don't repeat more than N words from the training set.
        
               | sodality2 wrote:
               | Thanks!
        
               | akiselev wrote:
               | It would likely be a lot easier for someone from within
               | the BBC, CBC, PBS, or another public broadcaster to
               | convince their employer to contribute to the models.
               | These organizations often have accessibility mandates
               | with real teeth and real costs implementing that mandate.
               | The work of closed captioning, for example, can
               | realistically be improved by excellent open source speech
               | recognition and TTS models without handing all of the
               | power over to Youtube and the like.
               | 
               | It would still be an uphill battle to convince them to
               | hand over the training set but the legal department can
               | likely be convinced if the data set they contribute back
               | is heavily chopped up audio of the original content,
               | especially if they have the originals before mixing. I
               | imagine short audio files without any of the music, sound
               | effects, or visual content are pretty much worthless as
               | far as IP goes.
        
           | teraflop wrote:
           | I like the idea, and decided to try doing some validation.
           | The first thing I noticed is that it asks me to make a yes-
           | or-no judgment of whether the sentence was spoken
           | "accurately", but nowhere on the site is it explained what
           | "accurate" means, or how strict I should be.
           | 
           | (The first clip I got was spoken more or less correctly, but
           | a couple of words are slurred together and the prosody is
           | awkward. Without having a good idea of the standards and
           | goals of the project, I have no idea whether including this
           | clip would make the overall dataset better or worse. My gut
           | feeling is that it's good for training recognition, and bad
           | for training synthesis.)
           | 
           | This seems to me like a major issue, since it should take a
           | relatively small amount of effort to write up a list of
           | guidelines, and it would be hugely beneficial to establish
           | those guidelines _before_ asking a lot of volunteers to
           | donate their time. I don 't find it encouraging that this has
           | been an open issue for four years, with apparently no action
           | except a bunch of bikeshedding: https://github.com/common-
           | voice/common-voice/issues/273
        
             | cptskippy wrote:
             | After listening to about 10 clips your point becomes
             | abundantly clear.
             | 
             | One speaker, who sounded like they were from the mid-west
             | United States, was dropping the S off words in a couple
             | clips. I wasn't sure if it was misreads or some accent I'd
             | never heard.
             | 
             | Another speaker, with a thick accent that sounded European,
             | sounded out all the vowels in circuit. Had I not had the
             | line being read, I don't think I'd have understood the
             | word.
             | 
             | I heard a speaker with an Indian accent who added a
             | preposition to the sentence that was inconsequential but
             | incorrect none the less.
             | 
             | I hear these random prepositions added as flourishes
             | frequently with some Indian coworkers, does anyone know the
             | a reason? It's kind of like how American's interject
             | "Umm..." or drop prepositions (e.g. "Are you done your
             | meal?") and I almost didn't pick up on it. For that matter
             | where did the American habit of dropping prepositions come
             | from? It seems like it's people in the North East
             | primarily.
        
               | OJFord wrote:
               | I can't quite imagine superfluous prepositions (could you
               | give an example?) but I have found it slightly amusing
               | learning Hindi and coming across things where I think Oh!
               | _That 's_ why you sometimes hear X from Indian English
               | speakers, it's just a slightly 'too' literal1 mapping
               | from Hindi, or trying to use a grammatical construction
               | that doesn't really exist in English, like 'topic
               | marking'.
               | 
               | [1] If that's even fair given it's a dialect in its own
               | right - Americans also say things differently than I
               | would as a 'Britisher'
        
             | [deleted]
        
             | xwx wrote:
             | I downloaded the (unofficial) Common Voice app [1] and it
             | provides a link to some guidelines [2], which also aren't
             | official but look sensible and seem like the best there is
             | at the moment.
             | 
             | [1] https://f-droid.org/packages/org.commonvoice.saverio/
             | 
             | [2] https://discourse.mozilla.org/t/discussion-of-new-
             | guidelines...
        
           | wcarss wrote:
           | I've used the deepspeech project a fair amount and it is
           | good. It's not _perfect_ , certainly, and it honestly isn't
           | good enough yet for an accurate transcription in my mind, but
           | it's good. Easy to work with, pretty good results, and all
           | the right kinds of free.
           | 
           | Thanks for taking time to contribute!
        
           | tootie wrote:
           | If you read the doc, it says voice2json is layer on top of
           | the actual voice recognition engine. And it supports mozilla
           | deep speech, pocket sphinx and a few others as the underlying
           | engine.
        
           | cerved wrote:
           | Weird sentences
        
         | londons_explore wrote:
         | Good speech recognition generally requites _massive_ mountains
         | of training data, both labelled and unlabelled.
         | 
         | Massive mountains of data tends to be incompatible with
         | opensource projects. Even Mozilla collecting user statistics is
         | pretty controversial. Imagine someone like Mozilla trying to
         | collect hundreds of voice clips from each of tens of millions
         | of users!!
        
           | timvisee wrote:
           | Another problem is that the models tend to get very very
           | large for what I've seen. A gigabyte to 10s of gigabyes is an
           | undesirable requirement on your local machine.
        
             | londons_explore wrote:
             | With insane amounts of computation, making models much
             | smaller while having minimal impacts on performance is
             | possible.
        
             | kelnos wrote:
             | Not sure about others, but DeepSpeech also distributes a
             | "lite" model that's much smaller and suitable for mobile
             | devices. Not sure how its accuracy compares to the full
             | model though.
        
           | GekkePrutser wrote:
           | Well speech recognition for personal use doesn't have to
           | recognise everyone. In fact it's a feature, not a bug if it
           | recognises only me as the user.
        
           | marcodiego wrote:
           | Really complicated question, but considering the free world
           | got wikipedia and openstreetmaps, I'd bet we'll find a way.
        
             | JadeNB wrote:
             | > Really complicated question, but considering the free
             | world got wikipedia and openstreetmaps, I'd bet we'll find
             | a way.
             | 
             | Both of those involve entering data about _external_
             | things. Asking people to share their _own_ data is another
             | thing entirely--I suspect most people, me included, are
             | much more suspicious about that.
        
           | posmonerd wrote:
           | Not an expert on any of this, but wouldn't already published
           | content (public or proprietary) such as Youtube videos,
           | audiobooks, tv interviews, movies, tv programs, radio
           | programs, podcasts, etc. be useful and exempt from privacy
           | concerns?
           | 
           | Do user collected clips have soemthing so special to the
           | point that it's critical to collect them?
        
             | eliaspro wrote:
             | Movies etc would need to be transcribed accurately to be
             | useful for training and even then just provide a single
             | sample for the specific item.
        
           | sodality2 wrote:
           | > Imagine someone like Mozilla trying to collect hundreds of
           | voice clips from each of tens of millions of users!!
           | 
           | They do, and it's working! https://commonvoice.mozilla.org/en
        
             | londons_explore wrote:
             | Except they have 12k hours of audio, when really they could
             | do with 12B hours of audio...
        
               | sodality2 wrote:
               | Good point. I'm doing my part to contribute to it,
               | though, not much else I can do!
        
               | woodson wrote:
               | Then you need a lot of people that listen to those 12B
               | hours of audio, and multiple listeners agree for each
               | chunk of audio that what is spoken corresponds to the
               | transcript.
        
               | londons_explore wrote:
               | Lots of machine learning systems can use unsupervised and
               | semi-supervised learning. Then nobody has to listen to
               | and annotate all that audio.
        
         | Animats wrote:
         | That's not what this is. This is more like the system you use
         | for phone-answering systems ("Do you want help with a bill,
         | payment, order, or refund?")
        
           | GekkePrutser wrote:
           | Indeed, this is what I got from it too. It seems an
           | alternative to VoiceXML used by companies like Nuance.
        
         | londons_explore wrote:
         | Speech recognition algorithms today require lots of data, lots
         | of training computation, and a decent design.
         | 
         | Decent designs are in published papers all over the place, so
         | thats a solved issue.
         | 
         | Lots of compute requires lots of $$$, which isn't opensource-
         | friendly.
         | 
         | Lots of data also isn't really opensource friendly.
         | 
         | Sadly this is a niche that the opensource business model
         | doesn't really fit.
        
           | marcodiego wrote:
           | People would probably say the same about wikipedia 20 years
           | ago. People said similar things about gnu, gcc and linux 30
           | years ago.
        
           | sildur wrote:
           | > Lots of compute requires lots of $$$, which isn't
           | opensource-friendly.
           | 
           | Not really, look up BOINC.
        
             | jauer wrote:
             | There's more involved in than just raw CPU cycles. It's not
             | something that is easily adapted to BOINC, but trying to
             | offload things to BOINC to free up clusters better suited
             | to training models might make sense.
        
           | airstrike wrote:
           | Sounds like a viable model for certain universities, though.
        
         | asdfman123 wrote:
         | > should not be left to an oligoply with bad history of not
         | respecting users freedoms and privacy
         | 
         | So companies with a lot of data, then.
        
       | yewenjie wrote:
       | How does it compare to Vosk and other open source models/APIs?
        
         | robmsmt wrote:
         | I am working on something to compare at least 10 different ASRs
         | both open source and production ones.
        
       | a-dub wrote:
       | neat. would be even neater if it used state to provide a prior on
       | likely intents. (ie. in its most simple form, if you know the
       | light is on, "turn on the light" has a prior of 0)
        
         | jedimastert wrote:
         | Things like state would probably be under the scope of whatever
         | you're feeding the intents into
        
           | a-dub wrote:
           | yes, but by the time you've generated an intent it's too late
           | to improve recognition accuracy using the prior.
        
       | bmn__ wrote:
       | Has anyone had any success getting the software to work?
       | 
       | It's entirely unpackaged:
       | https://repology.org/projects/?search=voice2json
       | https://pkgs.org/search/?q=voice2json
       | 
       | Docker image is broken, how'd that happen?                   $
       | voice2json --debug train-profile         ImportError:
       | numpy.core.multiarray failed to import         Traceback (most
       | recent call last):           File
       | "/usr/lib/voice2json/.venv/lib/python3.7/site-
       | packages/deepspeech/impl.py", line 14, in swig_import_helper
       | return importlib.import_module(mname)           File
       | "/usr/lib/python3.7/importlib/__init__.py", line 127, in
       | import_module             return
       | _bootstrap._gcd_import(name[level:], package, level)
       | File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
       | File "<frozen importlib._bootstrap>", line 983, in _find_and_load
       | File "<frozen importlib._bootstrap>", line 967, in
       | _find_and_load_unlocked           File "<frozen
       | importlib._bootstrap>", line 670, in _load_unlocked
       | File "<frozen importlib._bootstrap>", line 583, in
       | module_from_spec           File "<frozen
       | importlib._bootstrap_external>", line 1043, in create_module
       | File "<frozen importlib._bootstrap>", line 219, in
       | _call_with_frames_removed         ImportError:
       | numpy.core.multiarray failed to import
        
         | nerdponx wrote:
         | The source package does have installation instructions and
         | appears to use Autotools:
         | https://voice2json.org/install.html#from-source. Hopefully at
         | least building from source works.
        
           | mdaniel wrote:
           | Building the v2.0 tag (or even master) using docker does not:
           | E: The repository 'http://security.ubuntu.com/ubuntu eoan-
           | security Release' does not have a Release file.
           | 
           | And just bumping the image tag to ":groovy" caused subsequent
           | silliness, so this project is obviously only for folks who
           | enjoy fighting with build systems (and that matches my
           | experience of anything in the world that touches Numpy and
           | friends)
        
         | xrd wrote:
         | I tried docker (both debian version of Dockerfile), building
         | from scratch, none of them work.
        
       | hirundo wrote:
       | I wonder if it would be possible to map vim keybindings to sounds
       | and effectively drive the editor with the mouth when the hands
       | are otherwise occupied. It might be possible to use sounds that
       | compose into pronounceable words with minimal syllables for
       | combinations. What would vim bindings look like as a concise
       | command language suited to human vocalization?
       | 
       | E.g. maybe "dine" maps to d$ and "chine" to c$. So as in keyboard
       | vim you can guess what "dend" and "chend" do.
        
         | krysp wrote:
         | I do this successfully for work using https://talonvoice.com/ -
         | initial learning curve is steep, but once you learn how to
         | configure and hack on the commands, you can be very effective.
         | I use it maybe half the day to combat lingering RSI symptoms,
         | and with some work I could probably use it for 98% of input for
         | the computer. Some people do use it for 100% afaik
        
         | [deleted]
        
         | twobitshifter wrote:
         | https://youtu.be/8SkdfdXWYaI?t=600
         | 
         | this guy is already there: Slurp slap scratch buff yank
        
           | skratlo wrote:
           | I now get the joke about Emacs and OS
        
       | intrepidhero wrote:
       | Would love to see a demo integrating this with an IDE for either
       | voice to code or for voice commands to navigate menus. I think
       | the killer application would layer voice and traditional input
       | rather than replace.
        
       | jrm4 wrote:
       | Excellent! I just installed MyCroft the other day to play around
       | with it; while it looks like a great start, two odd things. The
       | first is obvious, which is the online/offline thing.
       | 
       | The second was a little surprising (and maybe I missed it?) There
       | was not much in the way of easily accessing transcribed output to
       | and from shell scripts?
        
       | varispeed wrote:
       | It's not quite clear, but do you need to sacrifice your privacy
       | in any way to use it? E.g. sending the data to some service in
       | order to get trained model?
        
         | nmstoker wrote:
         | The description clarifies the underlying systems:
         | 
         | >> Supported speech to text systems include:
         | 
         | >> CMU's pocketsphinx
         | 
         | >> Dan Povey's Kaldi
         | 
         | >> Mozilla's DeepSpeech 0.6
         | 
         | >> Kyoto University's Julius
         | 
         | In case you're not aware, those are all locally run (thus not
         | sending data off, not sacrificing privacy as you mention)
        
       | nwalker85 wrote:
       | Really interesting use of intents and entities. I feel like some
       | of this is reinventing the wheel, since there is already a
       | grammar specification, but novel use of intents/entities.
       | https://www.w3.org/TR/speech-grammar/
        
         | Edman274 wrote:
         | Yeah, in my experience no one uses or supports that
         | specification, which is a shame because if you're using
         | something like AWS Connect with AWS Lex for telephony IVR, you
         | can't just create a grammar and then have AWS Lex figure out
         | how to turn its recognized speech-to-text into something that
         | matches a grammar rule. Thus, Lex will return speech-to-text
         | results that are according to general English grammar rules,
         | rather than what you might have prompted the user to reply
         | with. You'll be unpleasantly surprised if you think that
         | defining a custom entity as alphanumeric always prevents the
         | utterance "[w^n]" as sometimes matching "won" instead of "one"
         | or "1".
         | 
         | Edit - Sorry, I realize that's a tangent. What I'm saying is
         | that when I was evaluating speech to text engines for things
         | like IVR systems using AWS and Google, neither of them
         | supported SRGS. Microsoft does, I think, but they didn't have a
         | telephony component, and IBM was ignored from the get go, so
         | "no one" really means "two very large companies."
        
       | offtop5 wrote:
       | Fantastic.
       | 
       | Might use this with a Raspberry pi to set up some projects around
       | the house. Is it possible to buy higher quality voice data ?
        
         | nmstoker wrote:
         | If you're interested in projects on a Pi then you might just be
         | interested in this: https://github.com/rhasspy/rhasspy
         | 
         | It's from the same author.
        
           | GekkePrutser wrote:
           | I like rhasspy but the problem I have with it is that it's
           | too much of a toolkit and less of an application. There's too
           | many choices to pick for the different components.. I think
           | they should pick one of each and really tune them so it works
           | really well. This way they'd take a lot of complexity away
           | from the user.
        
       | marcodiego wrote:
       | For those who care: MIT license.
        
       | synesthesiam wrote:
       | Author here. Thanks to everyone for checking out voice2json!
       | 
       | The TLDR of this project is: a unified command-line interface to
       | different offline speech recognition projects, with the ability
       | to train your own grammar/intent recognizer in one step.
       | 
       | My apologies for the broken packages; I'll get those fixed
       | shortly. My focus lately has been on Rhasspy
       | (https://github.com/rhasspy/rhasspy), which has a lot of the same
       | ideas but a larger scope (full voice assistant).
       | 
       | Questions, comments, and suggestions are welcomed and
       | appreciated!
        
       | bachmitre wrote:
       | This should come with pre-trained templates to create new
       | templates via voice commands ;)
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2021-05-21 23:00 UTC)