[HN Gopher] Omni SenseVoice: High-Speed Speech Recognition with ...
       ___________________________________________________________________
        
       Omni SenseVoice: High-Speed Speech Recognition with Words
       Timestamps
        
       Author : ringer007
       Score  : 151 points
       Date   : 2024-10-13 00:48 UTC (21 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | modeless wrote:
       | Looks cool! Combine this with this new TTS that released today
       | that looks really good and an LLM and you'd have a pretty good
       | all-local voice assistant! https://github.com/SWivid/F5-TTS
        
       | staticautomatic wrote:
       | I've been building a production app on top of ASR and find the
       | range of models kind of bewildering compared to LLMs and video.
       | The commercial offerings seem to be custom or built on top of
       | Whisper or maybe nvidia canary/parakeet and then you have stuff
       | like speechbrain that seems to run on top of lots of different
       | open models for different tasks. Sometimes it's genuinely hard to
       | tell what's a foundation model and what isn't.
       | 
       | Separately, I wonder if this is the model Speechmatics uses.
        
         | leetharris wrote:
         | We released a new SOTA ASR as open source just a couple of
         | weeks ago. https://www.rev.com/blog/speech-to-text-
         | technology/introduci...
         | 
         | Take a look. We'll be open sourcing more models very soon!
        
           | mkl wrote:
           | > These models are accessible under a non-commercial license.
           | 
           | That is not open source.
        
             | threeseed wrote:
             | Exactly. It is source available but not open source:
             | 
             | https://opensource.org/osd
        
           | staticautomatic wrote:
           | I'll check it out.
           | 
           | FWIW, in terms of benchmarking, I'm more interested in
           | benchmarks against Gladia, Deepgram, Pyannote, and
           | Speechmatics than whatever is built into the hyperscaler
           | platforms. But I end up doing my own anyway so whatevs.
           | 
           | Also, you guys need any training data? I have >10K hrs of
           | conversational iso-audio :)
        
           | yalok wrote:
           | that's great to hear! amazing performance of the model!
           | 
           | for voice chat bots, however, shorter input utterances are a
           | norm (anywhere from 1-10 sec), with lots of silence in
           | between, so this limitation is a bit sad:
           | 
           | > On the Gigaspeech test suite, Rev's research model is worse
           | than other open-source models. The average segment length of
           | this corpus is 5.7 seconds; these short segments are not a
           | good match for the design of Rev's model. These results
           | demonstrate that despite its strong performance on long-form
           | tests, Rev is not the best candidate for short-form
           | recognition applications like voice search.
        
         | woodson wrote:
         | There's just not a single one-size-fits-all model/pipeline. You
         | choose the right one for the job, depending on whether you need
         | streaming (i.e., low latency; words output right when they're
         | spoken), run on device (e.g. phone) or server, what
         | languages/dialects, conversational or more "produced" like a
         | news broadcast or podcast, etc. Best way is to benchmark with
         | data in your target domain.
        
           | staticautomatic wrote:
           | Sure, you're just going to try lots of things and see what
           | works best, but it's confusing to be comparing things at such
           | different levels of abstraction where a lot of the time you
           | don't even know what you're comparing and it's impossible to
           | do apples-to-apples even on your own test data. If your need
           | is "speaker identification", you're going to end up comparing
           | commercial black boxes like Speechmatics (probably custom) vs
           | commercial translucent boxes like Gladia (some custom blend
           | of whisper + pyannote + etc) vs
           | [asr_api]/[some_specific_sepformer_model]. Like, I can
           | observe that products I know to be built on top of whisper
           | don't seem to handle overlapping speaker diarization that
           | well, but I don't actually have any way of knowing if that's
           | got anything to do with whisper.
        
       | satvikpendem wrote:
       | Can it diarize?
        
         | staticautomatic wrote:
         | Apparently not. See https://github.com/lifeiteng/OmniSenseVoice
         | /blob/main/src/om.... See also FunASR running SenseVoice but
         | using Kaldi for speaker identification
         | https://github.com/modelscope/FunASR/blob/cd684580991661b9a0...
        
       | deegles wrote:
       | Does it do diarization?
        
         | staticautomatic wrote:
         | Apparently not. See my reply to satvikpendem.
        
       | steinvakt wrote:
       | How does the accuracy compare to Whisper?
        
         | Etheryte wrote:
         | This uses SenseVoice under the hood, which claims to have
         | better accuracy than Whisper. Not sure how accurate that
         | statement is though, since I haven't seen a third party
         | comparison, in this space it's very easy to toot your own horn.
         | 
         | [0] https://github.com/FunAudioLLM/SenseVoice
        
           | pferdone wrote:
           | I mean they make a bold statement up top just to paddle back
           | a little bit further down with: "[...] In terms of Chinese
           | and Cantonese recognition, the SenseVoice-Small model has
           | advantages."
           | 
           | It feels dishonest to me.
           | 
           | [0] https://github.com/FunAudioLLM/SenseVoice?tab=readme-ov-
           | file...
        
           | jmward01 wrote:
           | This uses SenseVoice small under the hood. They claim their
           | large model is better than Whisper large v3, not the small
           | version. This small version is definitely worse than Whisper
           | large v3 but still usable and the extra annotation it does is
           | interesting.
        
           | khimaros wrote:
           | this claims to have speaker diarization which is a
           | potentially killer feature missing from most whisper
           | implementations.
        
         | ks2048 wrote:
         | I've been doing some things with Whisper and find the accuracy
         | very good, BUT I've found the timestamps to be pretty bad. For
         | example, using the timestamps directly to clip words or phrases
         | often clips off the end of word (even simple cases where is
         | followed by silence). Since this emphases word timestamps, I
         | may give it a try.
        
       | frozencell wrote:
       | Does it work with chorus?
        
       | mrkramer wrote:
       | With timestamps?! I gotta try this.
        
       | jbellis wrote:
       | OOMs even in quantized mode on a 3090. What's a better option for
       | personal use?
       | 
       | > torch.OutOfMemoryError: CUDA out of memory. Tried to allocate
       | 43.71 GiB. GPU 0 has a total capacity of 24.00 GiB of which 20.74
       | GiB is free.
        
         | yellow_lead wrote:
         | Not sure if you mean in general, or options for this particular
         | project, but Whisper should work for you.
        
       | unshavedyak wrote:
       | Can't wait for a bundle of something like this with screen
       | capture. I'd love to pipe my convos/habits/apps/etc to a local
       | index for search. Seems we're getting close
        
       | throwaway2016a wrote:
       | This looks really nice. What I find interesting is that it seems
       | to advertise itself for the transcription use case but if it is
       | "lightning fast" I wonder if there are better uses cases for it.
       | 
       | I use AWS Transcribe[1] primarily. It costs me $0.024 per minute
       | of video and also provides timestamps. It's unclear to me without
       | running the numbers if using this model I could do any better
       | than that seeing as it needs a GPU to run.
       | 
       | With that said, I always love to see these things in the Open
       | Source domain. Competition drives innovation.
       | 
       | Edit: Doing some math, with spot instances on EC2 or serverless
       | GPU on some other platforms it could be relatively price
       | competitive with AWS Transcribe if the performance is even
       | slightly fast (2 hours of transcription per hour to break even).
       | Of course the devops work for running your own model is higher.
       | 
       | [1] https://aws.amazon.com/transcribe/
        
         | ChrisMarshallNY wrote:
         | _> better uses cases for it._
         | 
         | I want my babelfish!
        
       | riiii wrote:
       | Which languages does it support?
        
       ___________________________________________________________________
       (page generated 2024-10-13 22:01 UTC)