[HN Gopher] Omni SenseVoice: High-Speed Speech Recognition with ...
___________________________________________________________________
Omni SenseVoice: High-Speed Speech Recognition with Words
Timestamps
Author : ringer007
Score : 151 points
Date : 2024-10-13 00:48 UTC (21 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| modeless wrote:
| Looks cool! Combine this with this new TTS that released today
| that looks really good and an LLM and you'd have a pretty good
| all-local voice assistant! https://github.com/SWivid/F5-TTS
| staticautomatic wrote:
| I've been building a production app on top of ASR and find the
| range of models kind of bewildering compared to LLMs and video.
| The commercial offerings seem to be custom or built on top of
| Whisper or maybe nvidia canary/parakeet and then you have stuff
| like speechbrain that seems to run on top of lots of different
| open models for different tasks. Sometimes it's genuinely hard to
| tell what's a foundation model and what isn't.
|
| Separately, I wonder if this is the model Speechmatics uses.
| leetharris wrote:
| We released a new SOTA ASR as open source just a couple of
| weeks ago. https://www.rev.com/blog/speech-to-text-
| technology/introduci...
|
| Take a look. We'll be open sourcing more models very soon!
| mkl wrote:
| > These models are accessible under a non-commercial license.
|
| That is not open source.
| threeseed wrote:
| Exactly. It is source available but not open source:
|
| https://opensource.org/osd
| staticautomatic wrote:
| I'll check it out.
|
| FWIW, in terms of benchmarking, I'm more interested in
| benchmarks against Gladia, Deepgram, Pyannote, and
| Speechmatics than whatever is built into the hyperscaler
| platforms. But I end up doing my own anyway so whatevs.
|
| Also, you guys need any training data? I have >10K hrs of
| conversational iso-audio :)
| yalok wrote:
| that's great to hear! amazing performance of the model!
|
| for voice chat bots, however, shorter input utterances are a
| norm (anywhere from 1-10 sec), with lots of silence in
| between, so this limitation is a bit sad:
|
| > On the Gigaspeech test suite, Rev's research model is worse
| than other open-source models. The average segment length of
| this corpus is 5.7 seconds; these short segments are not a
| good match for the design of Rev's model. These results
| demonstrate that despite its strong performance on long-form
| tests, Rev is not the best candidate for short-form
| recognition applications like voice search.
| woodson wrote:
| There's just not a single one-size-fits-all model/pipeline. You
| choose the right one for the job, depending on whether you need
| streaming (i.e., low latency; words output right when they're
| spoken), run on device (e.g. phone) or server, what
| languages/dialects, conversational or more "produced" like a
| news broadcast or podcast, etc. Best way is to benchmark with
| data in your target domain.
| staticautomatic wrote:
| Sure, you're just going to try lots of things and see what
| works best, but it's confusing to be comparing things at such
| different levels of abstraction where a lot of the time you
| don't even know what you're comparing and it's impossible to
| do apples-to-apples even on your own test data. If your need
| is "speaker identification", you're going to end up comparing
| commercial black boxes like Speechmatics (probably custom) vs
| commercial translucent boxes like Gladia (some custom blend
| of whisper + pyannote + etc) vs
| [asr_api]/[some_specific_sepformer_model]. Like, I can
| observe that products I know to be built on top of whisper
| don't seem to handle overlapping speaker diarization that
| well, but I don't actually have any way of knowing if that's
| got anything to do with whisper.
| satvikpendem wrote:
| Can it diarize?
| staticautomatic wrote:
| Apparently not. See https://github.com/lifeiteng/OmniSenseVoice
| /blob/main/src/om.... See also FunASR running SenseVoice but
| using Kaldi for speaker identification
| https://github.com/modelscope/FunASR/blob/cd684580991661b9a0...
| deegles wrote:
| Does it do diarization?
| staticautomatic wrote:
| Apparently not. See my reply to satvikpendem.
| steinvakt wrote:
| How does the accuracy compare to Whisper?
| Etheryte wrote:
| This uses SenseVoice under the hood, which claims to have
| better accuracy than Whisper. Not sure how accurate that
| statement is though, since I haven't seen a third party
| comparison, in this space it's very easy to toot your own horn.
|
| [0] https://github.com/FunAudioLLM/SenseVoice
| pferdone wrote:
| I mean they make a bold statement up top just to paddle back
| a little bit further down with: "[...] In terms of Chinese
| and Cantonese recognition, the SenseVoice-Small model has
| advantages."
|
| It feels dishonest to me.
|
| [0] https://github.com/FunAudioLLM/SenseVoice?tab=readme-ov-
| file...
| jmward01 wrote:
| This uses SenseVoice small under the hood. They claim their
| large model is better than Whisper large v3, not the small
| version. This small version is definitely worse than Whisper
| large v3 but still usable and the extra annotation it does is
| interesting.
| khimaros wrote:
| this claims to have speaker diarization which is a
| potentially killer feature missing from most whisper
| implementations.
| ks2048 wrote:
| I've been doing some things with Whisper and find the accuracy
| very good, BUT I've found the timestamps to be pretty bad. For
| example, using the timestamps directly to clip words or phrases
| often clips off the end of word (even simple cases where is
| followed by silence). Since this emphases word timestamps, I
| may give it a try.
| frozencell wrote:
| Does it work with chorus?
| mrkramer wrote:
| With timestamps?! I gotta try this.
| jbellis wrote:
| OOMs even in quantized mode on a 3090. What's a better option for
| personal use?
|
| > torch.OutOfMemoryError: CUDA out of memory. Tried to allocate
| 43.71 GiB. GPU 0 has a total capacity of 24.00 GiB of which 20.74
| GiB is free.
| yellow_lead wrote:
| Not sure if you mean in general, or options for this particular
| project, but Whisper should work for you.
| unshavedyak wrote:
| Can't wait for a bundle of something like this with screen
| capture. I'd love to pipe my convos/habits/apps/etc to a local
| index for search. Seems we're getting close
| throwaway2016a wrote:
| This looks really nice. What I find interesting is that it seems
| to advertise itself for the transcription use case but if it is
| "lightning fast" I wonder if there are better uses cases for it.
|
| I use AWS Transcribe[1] primarily. It costs me $0.024 per minute
| of video and also provides timestamps. It's unclear to me without
| running the numbers if using this model I could do any better
| than that seeing as it needs a GPU to run.
|
| With that said, I always love to see these things in the Open
| Source domain. Competition drives innovation.
|
| Edit: Doing some math, with spot instances on EC2 or serverless
| GPU on some other platforms it could be relatively price
| competitive with AWS Transcribe if the performance is even
| slightly fast (2 hours of transcription per hour to break even).
| Of course the devops work for running your own model is higher.
|
| [1] https://aws.amazon.com/transcribe/
| ChrisMarshallNY wrote:
| _> better uses cases for it._
|
| I want my babelfish!
| riiii wrote:
| Which languages does it support?
___________________________________________________________________
(page generated 2024-10-13 22:01 UTC)