[HN Gopher] Numen: Voice Control for Handsfree Computing
       ___________________________________________________________________
        
       Numen: Voice Control for Handsfree Computing
        
       Author : memorable
       Score  : 98 points
       Date   : 2023-02-16 07:24 UTC (2 days ago)
        
 (HTM) web link (numenvoice.com)
 (TXT) w3m dump (numenvoice.com)
        
       | penjelly wrote:
       | interesting... i just broke my arm so this is potentially useful
       | for me. The words you use will take some getting used to though.
        
       | teknopaul wrote:
       | This is soo needed.
       | 
       | All big techs use of voice has so far required Internet access
       | and is creepy. Googles is apawling in that it changes so things
       | that did work, stop working.
       | 
       | What voice needed was for humans to adjust a little to make the
       | computer work easier. e.g. "Computer" "file save" is much more
       | efficient all round than sending off audio to the bork for AI to
       | try work out what it means.
        
       | jzellis wrote:
       | Dunno what the video is, but it's broken on Firefox mobile at
       | least.
        
         | pfortuny wrote:
         | Doesn't work on ipad safari either...
        
         | pimlottc wrote:
         | Broken in Safari on iPhone as well
        
         | 58x14 wrote:
         | Same here - but I bet it's the HN hug
        
       | zerop wrote:
       | Intresting. Why will someone use handsfree computing. It's slow,
       | i would rather type.
        
         | replicanteven wrote:
         | TBF, you need working hands to use hands-on computers.
         | 
         | Plus, the offline part could make a good starting point for a
         | DIY personal assistant.
         | 
         | That said, their "getting started" sounds...esoteric.
         | 
         | >There normally isn't any output but you should be able to type
         | "hey" by saying "hoof eve yank" and transcribe a sentence after
         | saying "scribe". You can terminate it by pressing Ctrl+c or
         | saying "troll cap".
        
         | simplyinfinity wrote:
         | Because they might be physically impaired in some way. Or have
         | severe Repetitive Strain Injury (RSI).
        
         | [deleted]
        
       | gramiro wrote:
       | Interesting project for providing better accessibility!
       | 
       | Reminded me a bit of those scenes on Blade Runner where Deckard
       | is asking the computer to zoom in a certain area and enhance
       | image :D
        
         | noduerme wrote:
         | > Deckard
         | 
         | It does. Bloody awesome. I'm re-watching this video trying to
         | understand some of the shorthand being used. There's "bang" for
         | exclamation mark; "cap drum" (?) for `cd`. I can't figure out
         | what words he uses to invoke `git clone` at 1:27 but it's
         | incredibly futuristic. I wish my daily driver wasn't a Mac
         | these days =(
        
           | ArchieMaclean wrote:
           | It looks like they have a word (or multiple) for each letter
           | of the alphabet. So CD is "change drum", git clone is " guest
           | ice traps space cap look [Ctrl right - autocomplete]", where
           | you can read the commands from the first letters of each
           | word.
           | 
           | Edit: the default 'phrases' are here:
           | https://git.sr.ht/~geb/numen/tree/master/item/phrases
        
             | simultsop wrote:
             | Find it counter intuitive, like we have to memorize new
             | constants that a programer defines... But having this with
             | english words like next line or page down or page up,
             | gamechanger.
        
       | ArchieMaclean wrote:
       | I like this a lot. This is built upon Vosk [0], open source voice
       | recognition. I must try it for some of my own projects!
       | 
       | [0] https://alphacephei.com/vosk/
        
       | CodexArcana wrote:
       | I'm only interested if you have to activate it by saying
       | "Hello... Numen..." ala Seinfeld.
        
         | TheHumanist wrote:
         | [dead]
        
       | nathias wrote:
       | this looks much better than any voice control I've seen so far, I
       | wonder if it requires tiles or you can integrate it with other
       | tiling managers
        
       | cube2222 wrote:
       | It's worth mentioning Talon[0] here, which is a system for
       | offline voice control as well, with great python-based scripting
       | (and also supports eye tracking, though I haven't used it
       | myself).
       | 
       | Using your computer or programming with it works like a charm,
       | with some interesting and impressive projects based on it coming
       | out as well, like Cursorless[1].
       | 
       | There's a great strangeloop talk[2] demonstrating talon and the
       | actual state of voice coding, which is how I discovered it (hint:
       | it's much better than you'd expect, and straightforward to learn
       | at that).
       | 
       | [0]: https://talonvoice.com/
       | 
       | [1]: https://github.com/cursorless-dev/cursorless
       | 
       | [2]: https://youtu.be/YKuRkGkf5HU
       | 
       | Disclaimer: not affiliated, just a happy occasional user
        
         | 2Gkashmiri wrote:
         | I can go back to win 7 and it had "speech recognition". Before
         | that in xp days I dabbled with offline dragon and stuff.
         | 
         | Point is, I've been bugged with this problem.
         | 
         | " I need a dictation software to read me back what it
         | understood and typed". ALL the software either assume you are
         | looking at the screen and like the win7 (scratch that) I don't
         | want that.
         | 
         | Let me say "I was walking and running besides the train."
         | <pause> "I was walking and besides The train." Would be
         | response so I would say "scratch that." And I would repeat it
         | or ask for help and all.
         | 
         | Why isn't such a system there?
         | 
         | Think of it as a person doing the typing. You write a line,
         | they read back what you said, okay, next. Otherwise fix that
         | like this
        
           | pcdoodle wrote:
           | It seems SAPI might be removed from the latest versions of
           | windows. It was pretty simple to use in VB6 in pure dictation
           | mode or you could even load a dictionary of listen words for
           | even higher false positives. Any replacements that anyone is
           | aware of for offline dictation / dictionary?
        
         | comfypotato wrote:
         | Was hoping for a comparison to Talon. Talon is incredible. I'm
         | particularly interested to see if any project spawns focused
         | around augmenting the keyboard as opposed to replacing it in a
         | programming context.
        
           | rom-antics wrote:
           | You might be interested in Cursorless's experimental keyboard
           | mode: https://www.cursorless.org/docs/user/experimental/keybo
           | ard/m...
        
         | orbisvicis wrote:
         | The talon demonstration from the last link was inspiring, but
         | it works in the exact opposite fashion that I would have
         | imagined. The code-development examples are command-based, with
         | a command to enter phrase mode. I'd have expected with
         | technology such as tree-sitter and IntelliJ etc, that by
         | parsing the syntax tree of current computer language for
         | completions, development could occur completely in phrase mode
         | with only a few commands for handling unknown inputs such as
         | new variable names.
         | 
         | I'm curious if anyone has ever tried implementing the latter,
         | or compared the two approaches. I'm sure there would be many
         | obstacles I haven't considered.
        
           | lunixbochs wrote:
           | Fixed commands are fast, precise, and predictable.
           | 
           | Assuming you mean speaking in natural language, that's slower
           | to say, and likely less precise and predictable if you want
           | to be able to just say "anything" any have a result.
           | 
           | You need a command system either way. If you want to express
           | some precise intention, you need to understand what the
           | command system will do.
           | 
           | There is a combined "mixed mode" system I've been testing in
           | the talon beta where you can use both phrases and commands
           | without switching modes.
        
         | unshavedyak wrote:
         | Wow eyetracking is not something i thought of.. and now i want
         | it.
         | 
         | I wonder if we could replace mouse with eyetracking? I wouldn't
         | expect it to be accurate enough though, give micro movements
         | that eyes do.. and in general erratic movements.. but i'd love
         | to be wrong.
        
           | orbisvicis wrote:
           | Eye tracking is useful if you can or want to sit in front of
           | a desk. I'm concerned at the lack of diversity in eye-
           | tracking manufacturers. Tobii is the only commercial brand
           | I'm aware of or that Talon supports and initial setup
           | requires Windows (I don't know if recalibration also requires
           | Windows).
           | 
           | I haven't used eye tracking but I'd imagine that commands
           | could be given in the short time that an on-screen element is
           | focused... and the rest of the time the cursor jumps
           | erratically.
        
           | russellbeattie wrote:
           | I've been researching eye tracking for my own project for the
           | past year. I have a Tobii eye tracker which is probably the
           | best eye tracking device for consumers currently (or the only
           | one really). It's much more accurate than trying to repurpose
           | a webcam.
           | 
           | So the problem with eye tracking is what's called the "midas
           | touch" problem. Everything you look at is potentially a
           | target. If you were to simply connect your mouse pointer to
           | your gaze, for example, any sort of hover effect on a web
           | page would be activated simply by glancing at it. [1]
           | 
           | Additionally, our eyes are constantly making small movements
           | call saccades [2]. If you track eye movement perfectly, the
           | target will wobble all over the screen like mad. The ways to
           | alleviate this is by expanding the target visually so that
           | the small movements are contained within a "bubble" or by
           | delaying the target slightly so the movements can be smoothed
           | out which naturally causes inaccuracy and latency. [3] There
           | are efforts to predict the eyes movements to give the user
           | the impression of lower latency, but it's imperfect solution.
           | 
           | Another issue is gaze activation. Computers can't read our
           | minds, so systems which require one to stare fixedly at an
           | object in order to activate an interface are common. The
           | problem with this is the both the delay and effort required.
           | You can easily get a headache from the effort of trying to
           | fixate your eyes on a target. Eye tracking in VR and AR have
           | similar problems.
           | 
           | There are other forms of activation - if you open your
           | iPhone's accessibility menu in the settings, you'll see a
           | bunch of options including head nods, facial gestures, eye
           | blinks and more. [4]
           | 
           | The future of eye tracking is definitely multimodal. A
           | specific gaze target combined with a gesture or hotword is
           | the way humans naturally interact with other humans (you look
           | at a person, get confirmation through eye contact or a nod,
           | and then speak or gesture.) What's amazing is the amount of
           | redundant effort being made in this area. Some of this stuff
           | has been known a decade or more. There are tons of both
           | research papers and thousands of patents to explore which
           | cover the topic in great detail. There is very little that
           | hasn't already been solved.
           | 
           | 1. https://uxdesign.cc/the-midas-touch-effect-the-most-
           | unknown-...
           | 
           | 2. https://en.m.wikipedia.org/wiki/Saccade
           | 
           | 3. https://help.tobii.com/hc/en-us/articles/210245345-How-to-
           | se...
           | 
           | 4. https://support.apple.com/accessibility
        
           | lunixbochs wrote:
           | Talon's eye tracking functions as a mouse replacement. Is
           | there a specific demo you'd like to see? I can record one.
        
         | 58x14 wrote:
         | That strangeloop talk inspired me to explore a lot of things,
         | including my methodology for writing command phrases that are
         | phonetically distinct and succinct.
         | 
         | Glad to hear Talon is still around! Their slack has grown and
         | they really seem like they have a product now.
        
         | theusus wrote:
         | I tried this and the speech recognition is really poor.
        
           | lunixbochs wrote:
           | The Talon model is fairly accurate, but it can be confusing
           | for new users to use the command system correctly. I posted a
           | sibling reply about this, but the most common reason for
           | Talon users to complain about the recognition is that they
           | are in the strict "command mode" and say things that aren't
           | actually commands.
           | 
           | If you encounter what feels like poor recognition in Talon, I
           | recommend enabling Save Recordings and zipping+sharing some
           | examples on the Slack and asking for advice.
           | 
           | The current command set is definitely harder to learn than a
           | system designed for chat/email where "what you say is what
           | you get", but it's much more powerful for tasks like
           | programming once you learn it.
           | 
           | I'm dubious about what kind of general command accuracy Numen
           | is able to get with the Vosk models, as Vosk to my
           | understanding is more designed for natural language than
           | commands.
        
         | yewenjie wrote:
         | Last time I checked Talon's models were very bad at recognizing
         | my voice. Does it support better models now, for example
         | OpenAI's Whisper?
        
           | caternoster wrote:
           | The creator of Talon has tested the Whisper models
           | extensively[0].
           | 
           | [0]:
           | https://twitter.com/lunixbochs/status/1574848899897884672
        
             | orbisvicis wrote:
             | I don't know what type of speech each dataset represents,
             | but the talon results are extremely impressive... I assume
             | it wasn't trained on at least some subset (depending on the
             | train/test split) of this data?
        
               | lunixbochs wrote:
               | A handful of the datasets I tested are fully held out (I
               | have reason to believe none of the models have trained on
               | them), and talon was trained on none of the dev or test
               | data of any of the datasets in question.
               | 
               | Due to whisper's weakly supervised training on a large
               | amount of automatically scraped data and reliance on a
               | bigger language model, it's far more likely whisper had
               | seen some of the test data before.
        
           | lunixbochs wrote:
           | Depending on when that was: in 2018 the free model was the
           | macOS speech engine, in 2019 it was a fast but relatively
           | weak model, and as of late 2021 it's a much stronger model.
           | I'm currently working on the next model series with a lot
           | more resources than I had before.
           | 
           | It's also worth saying that if you only tried things out
           | briefly, there are a handful of reasons recognition may have
           | seemed worse. Talon uses a strict command system by default,
           | because that improves precision and speed for trained users,
           | but the tradeoff there is it's more confusing for people who
           | haven't learned it yet.
           | 
           | For example, Talon isn't in "dictation mode" by default, so
           | you need to switch to that if you're trying to write email-
           | like text and don't want to prefix your phrases with a
           | command like "say".
           | 
           | The timeout system may also be confusing at first. When you
           | pause, Talon assumes you were done speaking and tries to run
           | whatever you said. You can mitigate this by speaking faster
           | or increasing the timeout.
           | 
           | The default commands (like the alphabet) may also just not be
           | very good for some accents, and that will be the case for any
           | speech engine - you will likely need to change some commands
           | if they're hard to enunciate in your accent.
           | 
           | I recommend joining the slack [1] and asking there if you
           | want more specific feedback. I definitely want to support
           | many accents and even have some users testing Talon with
           | other spoken languages.
           | 
           | [1] https://talonvoice.com/chat
        
       | Xevi wrote:
       | Impressive, I'm looking forward to seeing more of this project.
       | Did you draw inspiration from Talon? There are a lot of
       | similarities when it comes to the voice commands.
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2023-02-18 23:01 UTC)