[HN Gopher] Transcribro: On-device Accurate Speech-to-text
       ___________________________________________________________________
        
       Transcribro: On-device Accurate Speech-to-text
        
       Author : thebiblelover7
       Score  : 155 points
       Date   : 2024-07-18 17:25 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | crancher wrote:
       | Accrescent hype is comically overdone.
        
         | free_bip wrote:
         | I looked in the GitHub issues and there's a closed issue for
         | F-droid inclusion. The author states that F-droid "Doesn't meet
         | their requirements" but doesn't elaborate. I wonder what
         | F-droid is missing that they need so much?
        
           | ementally wrote:
           | Reason https://www.privacyguides.org/en/android/#f-droid
        
           | okso wrote:
           | Link: https://github.com/soupslurpr/Transcribro/issues/9
        
           | okso wrote:
           | F-Droid only packages open-source software and rebuilds it
           | from source, while installing from Accrescent would move all
           | trust to the developer, even if the license changes to
           | proprietary.
           | 
           | I understand that the author trusts itself more than F-Droid,
           | but as a user the opposite seems more relevant.
        
         | mijoharas wrote:
         | I only just saw it from this project.
         | 
         | I see the features listed[0] which seems like a reasonable
         | feature set, but nothing unusual afaict.
         | 
         | If there has been a lot of hype can you tell me what people
         | find compelling about it?
         | 
         | [0] https://accrescent.app/
        
       | flax wrote:
       | Documentation severely lacking. I wanted to know whether this
       | does streaming or only batch, as well as examples for integrating
       | with Android apps.
        
         | pants2 wrote:
         | Considering it uses Whisper, it's probably not streaming
        
           | refulgentis wrote:
           | I did some core work on TTS at Google, at several layers, and
           | I've never quite understood what people mean by streaming vs.
           | not.
           | 
           | In each and every case I'm familiar with, streaming means
           | "send the whole audio thus far to the inference engine,
           | inference it, and send back the transcription"
           | 
           | I have a Flutter library that does the same flow as this
           | (though via ONNX, so I can cover all platforms), and Whisper
           | + Silero is ~identical to the interfaces I used at Google.
           | 
           | If the idea is streaming is when each audio byte is only sent
           | once to the server, there's still an audio buffer accumulated
           | -- its just on the server.
        
             | iamjackg wrote:
             | I think in practical terms (at least for me):
             | 
             | - streaming == I talk and the text appears as I talk
             | 
             | - batched == I talk, and after I'm done talking some
             | processing happens and the text gets populated
        
               | refulgentis wrote:
               | Gotcha, then, it's "not even wrong" in the Pauli sense to
               | say Whisper isn't streaming
        
               | opprobium wrote:
               | It is not streaming in the way people normally use this
               | term. It's a fuzzy notion but typically streaming means
               | something encompassing:
               | 
               | - Processing and emitting results on something closer to
               | word by word level - Allowing partial results while the
               | user is still speaking and mid-segment - Not relying on
               | an external segmenter to determine the chunking (and
               | therefore also latency) of the output.
        
               | refulgentis wrote:
               | This is fascinating because if your hint in another
               | comment indicates you worked on this at Google, it's
               | entirely possible I have this all wrong because I'm
               | missing the _actual ML_ part - I wrote the client encoder
               | & server decoder for Opus and the client-side UI for the
               | SODA launch, and I'm honestly really surprised to hear
               | Google has different stuff. The client-side code loop
               | AGSA used is 100% replicated, in my experience, by using
               | Whisper.
               | 
               | I don't want to make too strong of claims given NDAs (in
               | reality, my failing memory :P) but I'm 99% sure inference
               | on-device is just as fast as SODA. I don't know what to
               | say because I'm flummoxed, it makes sense to me that
               | Whisper isn't as good as SODA, and I don't want to start
               | banging the table about that its no different from a user
               | or client perspective, I don't think that's fair. There's
               | a difference in model architecture and it matters. I
               | think its at least a couple WER behind.
               | 
               | But then where's the better STT solutions? Are all the
               | obviously much better solutions really all locked up?
               | Picovoice is the only closed solution I know of available
               | for local dev, and per even them, it's only better than
               | the worst Whisper. Smallest is 70 MB in ONNX vs. 130 MB
               | for next step up, both inference fine with ~600 ms
               | latency from audio byte to mic to text on screen, ranging
               | from WASM in web browser to 3 year old Android phone.
        
               | regularfry wrote:
               | Something to keep an eye on is that Whisper is strongly
               | bound to processing a 30-second window at a time. So if
               | you send it 30 seconds of audio, and it decodes it, then
               | you send it another one second of audio, the only
               | sensible way it can work is to have it reprocess seconds
               | 2s-30s in addition to the new data at 31s. If there was a
               | way to have it just process the update, then there's
               | every possibility it could avoid a lot of work.
               | 
               | I suspect that's what people are getting at by saying
               | it's "not streaming": it's built as a batch process but,
               | under some circumstances, you can run it fast enough to
               | get away with pretending that it isn't.
        
               | opprobium wrote:
               | You are missing the speech decoding part. I can't speak
               | to why the clients you were working on were doing what
               | they were doing. For a different reference point see the
               | cloud streaming api.
               | 
               | This is a good public reference:
               | https://research.google/blog/an-all-neural-on-device-
               | speech-...
               | 
               | Possibly confusions from that doc: "RNN-T" is entirely
               | orthogonal to RNNs (and not the only streamable model).
               | Attention is also orthogonal to streaming. A chunked or
               | sliding window attention can stream, a bi-directional RNN
               | cannot. How you think of an encoder and a decoder
               | streaming is also different.
               | 
               | At a practical level, if a model is fast enough, and VAD
               | is doing an adequate job, you can get something that
               | looks like "streaming" which a non-streaming model. If a
               | streaming model has tons of look-ahead or a very large
               | input chunk size, its latency may not feel a lot better.
               | 
               | Where the difference is sharp is where VAD is not
               | adequate: Users speak in continuous streams of audio,
               | they leave in unusual gaps within sentences and run
               | sentences together. A non-streaming system either hurts
               | quality because sentences (or even words) get broken up
               | that shouldn't, or has to wait forever and doesn't get a
               | chance to run, when a streaming system would have already
               | been producing output.
               | 
               | And to your points about echo cancellation and
               | interference: There's many text only operations that
               | benefit from being able to start early in the audio
               | stream, not late.
               | 
               | I just went through process of helping someone stand up
               | an interactive system with whisper etc and the lack of an
               | open sourced whisper-quality streaming system is such a
               | bummer because it really is so much laggier than it has
               | to be.
        
             | flax wrote:
             | "streaming" in this case is like another reply said:
             | transcriptions appear as I talk. Compared to not-streaming
             | in which the service waits for silence, then processes the
             | captured speech, then returns some transcription.
             | 
             | Is your Flutter library available? And does it run locally?
             | I'm looking for a good Flutter streaming (in the sense
             | above) speech recognition library. vosk looks good, but
             | it's lacking some configurability such as selecting audio
             | source.
        
               | refulgentis wrote:
               | FONNX, haven't gone out of my way to make it trivial[1],
               | but, it's very good, battle tested on every single
               | platform. (And yes runs locally)
               | 
               | [1] example app shows how to do everything, there's basic
               | doc, but man the amount of nonsense you need to know to
               | pull it all together is just too hard to document without
               | a specific Q. Do feel free to file an issue
        
             | opprobium wrote:
             | Streaming for TTS doesn't matter but for speech to text it
             | is more meaningful in interactive cases. In that case the
             | user's speech is arriving in real time and streaming can
             | mean a couple levels of things:
             | 
             | - Overlap compute with the user speaking: Not having to
             | wait until all the speech has been acquired can massively
             | reduce latency at the end of speech and allow a larger
             | model to be used. This doesn't have to be the whole system,
             | for instance an encoder can run in this fashion along audio
             | as it comes in even if the final step of the system then
             | runs in a non-streaming fashion.
             | 
             | - Produce partial results while the user is speaking: This
             | can be just a UI nice to have, but it can also be much
             | deeper, eg, a system can be activating on words or phrases
             | in the input before the user is finished speaking which can
             | dramatically change latency.
             | 
             | - Better segmentation: Whisper + Silero is just using VAD
             | to make segments for Whisper, this is not at all the best
             | you can do if you are actually decoding while you go.
             | Looking at the results as you go allow you to make much
             | better and faster segmentation decisions.
        
               | refulgentis wrote:
               | The only models that do what you're poking at hostically
               | are 4o (claimed) and that french company with the 7B one.
               | They're also bleeding edge, either unreleased or released
               | and way wilder, ex. The french one interrupts too much,
               | and screams back in an alien language occasionally.
               | 
               | Until these, you'd use echo cancellation to try and allow
               | interruptible dialogue, and thats unsolved, you need a
               | consistently cooperative chipset vendor for that (read:
               | wasn't possible even at scale, carrots, presumably
               | sticks, and with nuch cajoling. So it works on iPhones
               | consistently.)
               | 
               | The partial results are obtained by running inference on
               | the entire audio so far, and silence is determined by
               | VAD, on every stack I've seen that is described as
               | streaming
               | 
               | I find it hard to believe that Google and Apple
               | specifically, and every other audio stack I've seen, are
               | choosing to do "not the best they can at all"
        
               | opprobium wrote:
               | This is exactly what Google ASR does. Give it a try and
               | watch how the results flow back to you, it certainly is
               | not waiting for VAD segment breaking. I should know.
               | 
               | Streaming used to be something people cared about more.
               | VAD is always part of those systems as well, you want to
               | use it to start segments and to hard cut-off, but it is
               | just the starting off point. It's kind of a big gap (to
               | me) that's missing in available models since Whisper came
               | out, partly I think because it does add to the complexity
               | of using the model, and latency has to be tuned/traded-
               | off with quality.
        
               | Nimitz14 wrote:
               | This is a complete non sequitur lol. FYI whisper is not a
               | streaming model though it can, with some work, be adapted
               | to be one.
        
               | refulgentis wrote:
               | You and I agree fully, then. IMHO it's not too much work,
               | at all, 400 LOC and someone else's models. Of course, as
               | in that old saw, the art is knowing exactly those models,
               | knowing what ONNX is, etc. etc., that's what makes it
               | fast.
               | 
               | The non-sequitor is because I can't feel out what's going
               | on from their perspective, the hedging left a huge range
               | where they could have been saying "I saw the gpt4o demo
               | and theres another way that lets you have more natural
               | conversation" and "hey think like an LSTM model, like
               | Silero, there are voice recognizers that let you
               | magically get a state and current transcription out", or
               | in between, "yeah in reality the models are f(audio
               | bytes) => transcription", which appears to be closer to
               | your position, given your "it's not a streaming model,
               | though it can be adapted"
        
               | r2_pilot wrote:
               | Thank you for your insight. It confirms some of my
               | suspicions working in this area (you wouldn't happen to
               | know anybody who makes anything more modern than the
               | Respeaker 4-mic array?). My biggest problem is even with
               | AEC, the voice output is triggering the VAD and so it
               | continually thinks it's getting interrupted by a human.
               | My next attempt will be to try to only signal true VAD if
               | there's also sound coming from anywhere but behind, where
               | the speaker is. It's been an interesting challenge so far
               | though.
        
               | refulgentis wrote:
               | Re: mic, alas, no, BigCo kinda sucked, I had to go way
               | out of my way to get work on interesting stuff, it never
               | mattered, and even when you did, you never got over the
               | immediate wall of your own org, except for brief moments.
               | i.e. never ever had anyone even close to knowing anything
               | about the microphones we'd be using, they were shocked to
               | hear what AEC was, even when what we were working on was
               | a marketing tentpole for Pixel. Funny place.
               | 
               | I'm really glad you saw this. So, so, so much time and
               | hope was wasted there on the Nth team of XX people saying
               | "how hard can it be? given physics and a lil ML, we can
               | do $X", and inevitably reality was far more complicated,
               | and it's important to me to talk about it so other people
               | get a sense it's not them, it's the problem. Even
               | unlimited resources and your Nth fresh try can fail.
               | 
               | FWIW my mind's been grinding on how I'd get my little
               | Silero x Whisper gAssistant on device replica pulling off
               | something akin to the gpt4o demo. I keep coming back to
               | speaker ID: replace Silero with some newer models I'm
               | seeing hit ONNX. Super handwave-y, but I can't help
               | thinking this does an end-around both AEC being shit on
               | presumably most non-Apple devices, and poor interactions
               | from trying to juggle two things operating differently
               | (VAD and AEC). """Just""" detect when there's >= 2
               | simultaneous speakers with > 20% confidence --- of
               | course, tons of bits missing from there, ideally you'd be
               | resilient to ex. TV in background. Sigh. Tough problems.
        
               | azeirah wrote:
               | I'm not particularly experienced, but I did have good
               | experiences with picovoice's services. It's a business
               | specialised in programmatically available audio, tts, vad
               | services etc.
               | 
               | They have a VAD that is trained on a 10 second clip of
               | -your- voice, and it is then only activated by -your-
               | voice. It works quite well in my experience, although it
               | does add a little bit of additional latency before it
               | starts detecting your voice (which is reasonably easy to
               | overcome by keeping a 1s buffer of voice ready at all
               | times. If the vad is active, just add the past 100-200ms
               | of the buffer to the recorded audio. Works perfectly
               | fine. It's just that the UI showing "voice detected" or
               | "voice not detected" might lag behind 100-200ms)
               | 
               | Source: I worked on a VAD + whisper + LLM demo project
               | this year and ran into some VAD issues myself too.
        
       | james2doyle wrote:
       | Looks similar to the new FUTO keyboard:
       | https://voiceinput.futo.org/
        
         | iamjackg wrote:
         | I've been using this for a while (the voice input, not their
         | keyboard) and it's so refreshing to be able to just speak and
         | have the output come out as fully formed, well punctuated
         | sentences with proper capitalization.
        
           | james2doyle wrote:
           | I agree. No more "speaking punctuation". Just talk as normal
           | and it comes out fully formed
        
             | freedomben wrote:
             | I actually don't mind speaking punctuation, in fact it kind
             | of helps. What I really hate is the middle-spot where we
             | are right now, where it tries to place punctuation and
             | sucks badly at it.
        
               | infinitezest wrote:
               | In my experience, futo is actually pretty good at just
               | knowing the right punctuation to use.
        
         | leobg wrote:
         | Anything like that available for iOS?
        
           | crazygringo wrote:
           | iOS already has on-device dictation built into the standard
           | keyboard.
           | 
           | Years ago it got sent to the cloud, but as long as you have
           | an iPhone from the past few years it's on-device.
        
             | ttla wrote:
             | You're right that it exists, but it's complete crap outside
             | a quiet environment. Try to use it while walking around
             | outside or in any semi-noisy area and it fails horribly
             | (iPhone 13, so YMMV if you have a newer one).
             | 
             | You cannot use an iPhone as a dictation device without
             | reviewing the transcribed text, which IMO defeats the
             | purpose of dictation.
             | 
             | Meanwhile, i've gotten excellent results on the iPhone from
             | a Whipser->LLM pipeline.
        
               | crazygringo wrote:
               | I've never found real-time dictation software that
               | doesn't need to be reviewed.
               | 
               | I'm definitely waiting for Apple to upgrade their
               | dictation software to the next generation -- I have my
               | own annoyances with it -- but I haven't found anything
               | else that works way better, in real time, on a phone,
               | that runs in the background (like as part of the
               | keyboard).
               | 
               | You talk about Whisper but that doesn't even work in real
               | time, much less when you have to run it through an LLM.
        
           | b33f wrote:
           | Aiko is a free app for iOS and macOS that also uses whisper
           | for local TTS
        
           | brylie wrote:
           | Aiko, mentioned elsewhere, includes a local copy of the
           | OpenAI Whisper model:
           | https://apps.apple.com/app/aiko/id1672085276
        
         | yjftsjthsd-h wrote:
         | But open source, which is a pretty big difference
        
           | grandma_tea wrote:
           | FUTO and Transcribro are open source.
        
             | yencabulator wrote:
             | FUTO is not open source.
             | 
             | https://gitlab.futo.org/alex/voiceinput/-/blob/master/LICEN
             | S...
             | 
             | > FUTO Source First License 1.0
             | 
             | > You may use or modify the software only for non-
             | commercial purposes
        
             | Humbly8967 wrote:
             | No, FUTO made a new "Source First License"[1] that is not
             | Open Source by the OSI definition.
             | 
             | [1] https://github.com/futo-org/android-
             | keyboard/blob/master/LIC...
        
               | grandma_tea wrote:
               | Oh, that's lame.
        
               | observationist wrote:
               | I can get behind people doing their own custom "licenses"
               | that amount to throwing their work into the public
               | domain, but if someone builds their own limited licenses
               | around a thing, I won't touch their product. This FUTO
               | license is garbage. Use a real license and either be open
               | source or not; inventing new personal licenses doesn't do
               | anyone any good.
        
         | kolme wrote:
         | This looks great! I've been wanting to drop the Swipe keyboard
         | ever since I saw sneaky ads on it (like me typing "Google Maps"
         | and getting "Bing Maps" as a "suggestion").
        
       | yewenjie wrote:
       | Seems like Gboard is incompatible with it. Is there a good enough
       | open source alternative to Gboard in 2024 that has smooth glide-
       | typing and a similar layout?
        
         | SparkyMcUnicorn wrote:
         | Any of these should work.
         | 
         | https://github.com/Helium314/HeliBoard
         | 
         | https://github.com/openboard-team/openboard
         | 
         | https://github.com/rkkr/simple-keyboard (guessing, since AOSP
         | Keyboard works and this is a fork)
         | 
         | Not open source: https://www.microsoft.com/en-us/swiftkey
         | 
         | Does not have glide/swipe (reserved for symbols), but I just
         | installed and giving it a shot:
         | https://github.com/Julow/Unexpected-Keyboard
        
           | Grimblewald wrote:
           | Unexpected keyboard is unexpectedly awesome. Looks a bit
           | dated, but boy does it have some functionality packed into
           | it.
        
           | nine_k wrote:
           | My choice is
           | https://github.com/AnySoftKeyboard/AnySoftKeyboard/
           | 
           | It does have glide typing, even.though I don't use it.
           | 
           | It rather uses long-tap to access multiple symbols, and can
           | be split or pushed to a corner on devices with a big screen.
        
       | lawgimenez wrote:
       | This is cool, I get to read another Jetpack Compose codebase
       | since I am halfway through migrating our app to Jetpack. So this
       | helps a lot.
        
       | tmaly wrote:
       | I wish there was something where I could transcribe iPhone voice
       | memos to text.
       | 
       | I would pay for an app that did this.
        
         | hidelooktropic wrote:
         | The microphone icon on the keyboard does this.
        
         | cee_el123 wrote:
         | Google has an app called live transcribe on Android but there's
         | no iPhone version
         | 
         | This is an unaffiliated version looks like
         | https://apps.apple.com/us/app/live-transcribe/id1471473738
        
       | swyx wrote:
       | is there an iPhone version of this? custom keyboard?
        
       | smeej wrote:
       | Not sure what I'm doing wrong, but I tried installing it on a
       | GrapheneOS device _with_ Play Services installed and nothing
       | happened. When I pushed the mic button, it changed to look
       | pressed for a second, and went back to normal. Nothing happened
       | when I spoke. Tried holding it down while speaking. Still
       | nothing.
       | 
       | I'm very interested in using this, but I can't even find a way to
       | _try_ to troubleshoot it. I 'm not finding usage instructions,
       | never mind any kind of error messages. It just doesn't do
       | anything.
       | 
       | This is especially interesting to me because the screenshot on
       | the repo is from Vanadium, which strongly suggests to me that
       | it's from a GrapheneOS device itself.
        
         | soupslurpr wrote:
         | You're correct I do use GrapheneOS. Hm do you have the global
         | microphone toggle off? There's an upstream issue that causes
         | SpeechRecognizer implementations to silently fail when the
         | microphone toggle is off. You may have to force-stop
         | Transcribro after turning it on.
         | 
         | https://github.com/soupslurpr/Transcribro/issues/3
        
           | smeej wrote:
           | I didn't think I did, but cycling it a couple times and
           | restarting did fix! Great guess!
           | 
           | The thing I'm tripping over now is just that I keep pressing
           | the button more than once when I'm done speaking because it's
           | not clear that it registered the first time. If it could even
           | just stay "pressed" or something while it processes the text,
           | I think that would make it clearer. Any third state for the
           | button would do I think.
           | 
           | Looking forward to using this! Thanks!
        
       ___________________________________________________________________
       (page generated 2024-07-19 23:11 UTC)