[HN Gopher] Transcribro: On-device Accurate Speech-to-text
___________________________________________________________________
Transcribro: On-device Accurate Speech-to-text
Author : thebiblelover7
Score : 155 points
Date : 2024-07-18 17:25 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| crancher wrote:
| Accrescent hype is comically overdone.
| free_bip wrote:
| I looked in the GitHub issues and there's a closed issue for
| F-droid inclusion. The author states that F-droid "Doesn't meet
| their requirements" but doesn't elaborate. I wonder what
| F-droid is missing that they need so much?
| ementally wrote:
| Reason https://www.privacyguides.org/en/android/#f-droid
| okso wrote:
| Link: https://github.com/soupslurpr/Transcribro/issues/9
| okso wrote:
| F-Droid only packages open-source software and rebuilds it
| from source, while installing from Accrescent would move all
| trust to the developer, even if the license changes to
| proprietary.
|
| I understand that the author trusts itself more than F-Droid,
| but as a user the opposite seems more relevant.
| mijoharas wrote:
| I only just saw it from this project.
|
| I see the features listed[0] which seems like a reasonable
| feature set, but nothing unusual afaict.
|
| If there has been a lot of hype can you tell me what people
| find compelling about it?
|
| [0] https://accrescent.app/
| flax wrote:
| Documentation severely lacking. I wanted to know whether this
| does streaming or only batch, as well as examples for integrating
| with Android apps.
| pants2 wrote:
| Considering it uses Whisper, it's probably not streaming
| refulgentis wrote:
| I did some core work on TTS at Google, at several layers, and
| I've never quite understood what people mean by streaming vs.
| not.
|
| In each and every case I'm familiar with, streaming means
| "send the whole audio thus far to the inference engine,
| inference it, and send back the transcription"
|
| I have a Flutter library that does the same flow as this
| (though via ONNX, so I can cover all platforms), and Whisper
| + Silero is ~identical to the interfaces I used at Google.
|
| If the idea is streaming is when each audio byte is only sent
| once to the server, there's still an audio buffer accumulated
| -- its just on the server.
| iamjackg wrote:
| I think in practical terms (at least for me):
|
| - streaming == I talk and the text appears as I talk
|
| - batched == I talk, and after I'm done talking some
| processing happens and the text gets populated
| refulgentis wrote:
| Gotcha, then, it's "not even wrong" in the Pauli sense to
| say Whisper isn't streaming
| opprobium wrote:
| It is not streaming in the way people normally use this
| term. It's a fuzzy notion but typically streaming means
| something encompassing:
|
| - Processing and emitting results on something closer to
| word by word level - Allowing partial results while the
| user is still speaking and mid-segment - Not relying on
| an external segmenter to determine the chunking (and
| therefore also latency) of the output.
| refulgentis wrote:
| This is fascinating because if your hint in another
| comment indicates you worked on this at Google, it's
| entirely possible I have this all wrong because I'm
| missing the _actual ML_ part - I wrote the client encoder
| & server decoder for Opus and the client-side UI for the
| SODA launch, and I'm honestly really surprised to hear
| Google has different stuff. The client-side code loop
| AGSA used is 100% replicated, in my experience, by using
| Whisper.
|
| I don't want to make too strong of claims given NDAs (in
| reality, my failing memory :P) but I'm 99% sure inference
| on-device is just as fast as SODA. I don't know what to
| say because I'm flummoxed, it makes sense to me that
| Whisper isn't as good as SODA, and I don't want to start
| banging the table about that its no different from a user
| or client perspective, I don't think that's fair. There's
| a difference in model architecture and it matters. I
| think its at least a couple WER behind.
|
| But then where's the better STT solutions? Are all the
| obviously much better solutions really all locked up?
| Picovoice is the only closed solution I know of available
| for local dev, and per even them, it's only better than
| the worst Whisper. Smallest is 70 MB in ONNX vs. 130 MB
| for next step up, both inference fine with ~600 ms
| latency from audio byte to mic to text on screen, ranging
| from WASM in web browser to 3 year old Android phone.
| regularfry wrote:
| Something to keep an eye on is that Whisper is strongly
| bound to processing a 30-second window at a time. So if
| you send it 30 seconds of audio, and it decodes it, then
| you send it another one second of audio, the only
| sensible way it can work is to have it reprocess seconds
| 2s-30s in addition to the new data at 31s. If there was a
| way to have it just process the update, then there's
| every possibility it could avoid a lot of work.
|
| I suspect that's what people are getting at by saying
| it's "not streaming": it's built as a batch process but,
| under some circumstances, you can run it fast enough to
| get away with pretending that it isn't.
| opprobium wrote:
| You are missing the speech decoding part. I can't speak
| to why the clients you were working on were doing what
| they were doing. For a different reference point see the
| cloud streaming api.
|
| This is a good public reference:
| https://research.google/blog/an-all-neural-on-device-
| speech-...
|
| Possibly confusions from that doc: "RNN-T" is entirely
| orthogonal to RNNs (and not the only streamable model).
| Attention is also orthogonal to streaming. A chunked or
| sliding window attention can stream, a bi-directional RNN
| cannot. How you think of an encoder and a decoder
| streaming is also different.
|
| At a practical level, if a model is fast enough, and VAD
| is doing an adequate job, you can get something that
| looks like "streaming" which a non-streaming model. If a
| streaming model has tons of look-ahead or a very large
| input chunk size, its latency may not feel a lot better.
|
| Where the difference is sharp is where VAD is not
| adequate: Users speak in continuous streams of audio,
| they leave in unusual gaps within sentences and run
| sentences together. A non-streaming system either hurts
| quality because sentences (or even words) get broken up
| that shouldn't, or has to wait forever and doesn't get a
| chance to run, when a streaming system would have already
| been producing output.
|
| And to your points about echo cancellation and
| interference: There's many text only operations that
| benefit from being able to start early in the audio
| stream, not late.
|
| I just went through process of helping someone stand up
| an interactive system with whisper etc and the lack of an
| open sourced whisper-quality streaming system is such a
| bummer because it really is so much laggier than it has
| to be.
| flax wrote:
| "streaming" in this case is like another reply said:
| transcriptions appear as I talk. Compared to not-streaming
| in which the service waits for silence, then processes the
| captured speech, then returns some transcription.
|
| Is your Flutter library available? And does it run locally?
| I'm looking for a good Flutter streaming (in the sense
| above) speech recognition library. vosk looks good, but
| it's lacking some configurability such as selecting audio
| source.
| refulgentis wrote:
| FONNX, haven't gone out of my way to make it trivial[1],
| but, it's very good, battle tested on every single
| platform. (And yes runs locally)
|
| [1] example app shows how to do everything, there's basic
| doc, but man the amount of nonsense you need to know to
| pull it all together is just too hard to document without
| a specific Q. Do feel free to file an issue
| opprobium wrote:
| Streaming for TTS doesn't matter but for speech to text it
| is more meaningful in interactive cases. In that case the
| user's speech is arriving in real time and streaming can
| mean a couple levels of things:
|
| - Overlap compute with the user speaking: Not having to
| wait until all the speech has been acquired can massively
| reduce latency at the end of speech and allow a larger
| model to be used. This doesn't have to be the whole system,
| for instance an encoder can run in this fashion along audio
| as it comes in even if the final step of the system then
| runs in a non-streaming fashion.
|
| - Produce partial results while the user is speaking: This
| can be just a UI nice to have, but it can also be much
| deeper, eg, a system can be activating on words or phrases
| in the input before the user is finished speaking which can
| dramatically change latency.
|
| - Better segmentation: Whisper + Silero is just using VAD
| to make segments for Whisper, this is not at all the best
| you can do if you are actually decoding while you go.
| Looking at the results as you go allow you to make much
| better and faster segmentation decisions.
| refulgentis wrote:
| The only models that do what you're poking at hostically
| are 4o (claimed) and that french company with the 7B one.
| They're also bleeding edge, either unreleased or released
| and way wilder, ex. The french one interrupts too much,
| and screams back in an alien language occasionally.
|
| Until these, you'd use echo cancellation to try and allow
| interruptible dialogue, and thats unsolved, you need a
| consistently cooperative chipset vendor for that (read:
| wasn't possible even at scale, carrots, presumably
| sticks, and with nuch cajoling. So it works on iPhones
| consistently.)
|
| The partial results are obtained by running inference on
| the entire audio so far, and silence is determined by
| VAD, on every stack I've seen that is described as
| streaming
|
| I find it hard to believe that Google and Apple
| specifically, and every other audio stack I've seen, are
| choosing to do "not the best they can at all"
| opprobium wrote:
| This is exactly what Google ASR does. Give it a try and
| watch how the results flow back to you, it certainly is
| not waiting for VAD segment breaking. I should know.
|
| Streaming used to be something people cared about more.
| VAD is always part of those systems as well, you want to
| use it to start segments and to hard cut-off, but it is
| just the starting off point. It's kind of a big gap (to
| me) that's missing in available models since Whisper came
| out, partly I think because it does add to the complexity
| of using the model, and latency has to be tuned/traded-
| off with quality.
| Nimitz14 wrote:
| This is a complete non sequitur lol. FYI whisper is not a
| streaming model though it can, with some work, be adapted
| to be one.
| refulgentis wrote:
| You and I agree fully, then. IMHO it's not too much work,
| at all, 400 LOC and someone else's models. Of course, as
| in that old saw, the art is knowing exactly those models,
| knowing what ONNX is, etc. etc., that's what makes it
| fast.
|
| The non-sequitor is because I can't feel out what's going
| on from their perspective, the hedging left a huge range
| where they could have been saying "I saw the gpt4o demo
| and theres another way that lets you have more natural
| conversation" and "hey think like an LSTM model, like
| Silero, there are voice recognizers that let you
| magically get a state and current transcription out", or
| in between, "yeah in reality the models are f(audio
| bytes) => transcription", which appears to be closer to
| your position, given your "it's not a streaming model,
| though it can be adapted"
| r2_pilot wrote:
| Thank you for your insight. It confirms some of my
| suspicions working in this area (you wouldn't happen to
| know anybody who makes anything more modern than the
| Respeaker 4-mic array?). My biggest problem is even with
| AEC, the voice output is triggering the VAD and so it
| continually thinks it's getting interrupted by a human.
| My next attempt will be to try to only signal true VAD if
| there's also sound coming from anywhere but behind, where
| the speaker is. It's been an interesting challenge so far
| though.
| refulgentis wrote:
| Re: mic, alas, no, BigCo kinda sucked, I had to go way
| out of my way to get work on interesting stuff, it never
| mattered, and even when you did, you never got over the
| immediate wall of your own org, except for brief moments.
| i.e. never ever had anyone even close to knowing anything
| about the microphones we'd be using, they were shocked to
| hear what AEC was, even when what we were working on was
| a marketing tentpole for Pixel. Funny place.
|
| I'm really glad you saw this. So, so, so much time and
| hope was wasted there on the Nth team of XX people saying
| "how hard can it be? given physics and a lil ML, we can
| do $X", and inevitably reality was far more complicated,
| and it's important to me to talk about it so other people
| get a sense it's not them, it's the problem. Even
| unlimited resources and your Nth fresh try can fail.
|
| FWIW my mind's been grinding on how I'd get my little
| Silero x Whisper gAssistant on device replica pulling off
| something akin to the gpt4o demo. I keep coming back to
| speaker ID: replace Silero with some newer models I'm
| seeing hit ONNX. Super handwave-y, but I can't help
| thinking this does an end-around both AEC being shit on
| presumably most non-Apple devices, and poor interactions
| from trying to juggle two things operating differently
| (VAD and AEC). """Just""" detect when there's >= 2
| simultaneous speakers with > 20% confidence --- of
| course, tons of bits missing from there, ideally you'd be
| resilient to ex. TV in background. Sigh. Tough problems.
| azeirah wrote:
| I'm not particularly experienced, but I did have good
| experiences with picovoice's services. It's a business
| specialised in programmatically available audio, tts, vad
| services etc.
|
| They have a VAD that is trained on a 10 second clip of
| -your- voice, and it is then only activated by -your-
| voice. It works quite well in my experience, although it
| does add a little bit of additional latency before it
| starts detecting your voice (which is reasonably easy to
| overcome by keeping a 1s buffer of voice ready at all
| times. If the vad is active, just add the past 100-200ms
| of the buffer to the recorded audio. Works perfectly
| fine. It's just that the UI showing "voice detected" or
| "voice not detected" might lag behind 100-200ms)
|
| Source: I worked on a VAD + whisper + LLM demo project
| this year and ran into some VAD issues myself too.
| james2doyle wrote:
| Looks similar to the new FUTO keyboard:
| https://voiceinput.futo.org/
| iamjackg wrote:
| I've been using this for a while (the voice input, not their
| keyboard) and it's so refreshing to be able to just speak and
| have the output come out as fully formed, well punctuated
| sentences with proper capitalization.
| james2doyle wrote:
| I agree. No more "speaking punctuation". Just talk as normal
| and it comes out fully formed
| freedomben wrote:
| I actually don't mind speaking punctuation, in fact it kind
| of helps. What I really hate is the middle-spot where we
| are right now, where it tries to place punctuation and
| sucks badly at it.
| infinitezest wrote:
| In my experience, futo is actually pretty good at just
| knowing the right punctuation to use.
| leobg wrote:
| Anything like that available for iOS?
| crazygringo wrote:
| iOS already has on-device dictation built into the standard
| keyboard.
|
| Years ago it got sent to the cloud, but as long as you have
| an iPhone from the past few years it's on-device.
| ttla wrote:
| You're right that it exists, but it's complete crap outside
| a quiet environment. Try to use it while walking around
| outside or in any semi-noisy area and it fails horribly
| (iPhone 13, so YMMV if you have a newer one).
|
| You cannot use an iPhone as a dictation device without
| reviewing the transcribed text, which IMO defeats the
| purpose of dictation.
|
| Meanwhile, i've gotten excellent results on the iPhone from
| a Whipser->LLM pipeline.
| crazygringo wrote:
| I've never found real-time dictation software that
| doesn't need to be reviewed.
|
| I'm definitely waiting for Apple to upgrade their
| dictation software to the next generation -- I have my
| own annoyances with it -- but I haven't found anything
| else that works way better, in real time, on a phone,
| that runs in the background (like as part of the
| keyboard).
|
| You talk about Whisper but that doesn't even work in real
| time, much less when you have to run it through an LLM.
| b33f wrote:
| Aiko is a free app for iOS and macOS that also uses whisper
| for local TTS
| brylie wrote:
| Aiko, mentioned elsewhere, includes a local copy of the
| OpenAI Whisper model:
| https://apps.apple.com/app/aiko/id1672085276
| yjftsjthsd-h wrote:
| But open source, which is a pretty big difference
| grandma_tea wrote:
| FUTO and Transcribro are open source.
| yencabulator wrote:
| FUTO is not open source.
|
| https://gitlab.futo.org/alex/voiceinput/-/blob/master/LICEN
| S...
|
| > FUTO Source First License 1.0
|
| > You may use or modify the software only for non-
| commercial purposes
| Humbly8967 wrote:
| No, FUTO made a new "Source First License"[1] that is not
| Open Source by the OSI definition.
|
| [1] https://github.com/futo-org/android-
| keyboard/blob/master/LIC...
| grandma_tea wrote:
| Oh, that's lame.
| observationist wrote:
| I can get behind people doing their own custom "licenses"
| that amount to throwing their work into the public
| domain, but if someone builds their own limited licenses
| around a thing, I won't touch their product. This FUTO
| license is garbage. Use a real license and either be open
| source or not; inventing new personal licenses doesn't do
| anyone any good.
| kolme wrote:
| This looks great! I've been wanting to drop the Swipe keyboard
| ever since I saw sneaky ads on it (like me typing "Google Maps"
| and getting "Bing Maps" as a "suggestion").
| yewenjie wrote:
| Seems like Gboard is incompatible with it. Is there a good enough
| open source alternative to Gboard in 2024 that has smooth glide-
| typing and a similar layout?
| SparkyMcUnicorn wrote:
| Any of these should work.
|
| https://github.com/Helium314/HeliBoard
|
| https://github.com/openboard-team/openboard
|
| https://github.com/rkkr/simple-keyboard (guessing, since AOSP
| Keyboard works and this is a fork)
|
| Not open source: https://www.microsoft.com/en-us/swiftkey
|
| Does not have glide/swipe (reserved for symbols), but I just
| installed and giving it a shot:
| https://github.com/Julow/Unexpected-Keyboard
| Grimblewald wrote:
| Unexpected keyboard is unexpectedly awesome. Looks a bit
| dated, but boy does it have some functionality packed into
| it.
| nine_k wrote:
| My choice is
| https://github.com/AnySoftKeyboard/AnySoftKeyboard/
|
| It does have glide typing, even.though I don't use it.
|
| It rather uses long-tap to access multiple symbols, and can
| be split or pushed to a corner on devices with a big screen.
| lawgimenez wrote:
| This is cool, I get to read another Jetpack Compose codebase
| since I am halfway through migrating our app to Jetpack. So this
| helps a lot.
| tmaly wrote:
| I wish there was something where I could transcribe iPhone voice
| memos to text.
|
| I would pay for an app that did this.
| hidelooktropic wrote:
| The microphone icon on the keyboard does this.
| cee_el123 wrote:
| Google has an app called live transcribe on Android but there's
| no iPhone version
|
| This is an unaffiliated version looks like
| https://apps.apple.com/us/app/live-transcribe/id1471473738
| swyx wrote:
| is there an iPhone version of this? custom keyboard?
| smeej wrote:
| Not sure what I'm doing wrong, but I tried installing it on a
| GrapheneOS device _with_ Play Services installed and nothing
| happened. When I pushed the mic button, it changed to look
| pressed for a second, and went back to normal. Nothing happened
| when I spoke. Tried holding it down while speaking. Still
| nothing.
|
| I'm very interested in using this, but I can't even find a way to
| _try_ to troubleshoot it. I 'm not finding usage instructions,
| never mind any kind of error messages. It just doesn't do
| anything.
|
| This is especially interesting to me because the screenshot on
| the repo is from Vanadium, which strongly suggests to me that
| it's from a GrapheneOS device itself.
| soupslurpr wrote:
| You're correct I do use GrapheneOS. Hm do you have the global
| microphone toggle off? There's an upstream issue that causes
| SpeechRecognizer implementations to silently fail when the
| microphone toggle is off. You may have to force-stop
| Transcribro after turning it on.
|
| https://github.com/soupslurpr/Transcribro/issues/3
| smeej wrote:
| I didn't think I did, but cycling it a couple times and
| restarting did fix! Great guess!
|
| The thing I'm tripping over now is just that I keep pressing
| the button more than once when I'm done speaking because it's
| not clear that it registered the first time. If it could even
| just stay "pressed" or something while it processes the text,
| I think that would make it clearer. Any third state for the
| button would do I think.
|
| Looking forward to using this! Thanks!
___________________________________________________________________
(page generated 2024-07-19 23:11 UTC)