[HN Gopher] Show HN: Self-host Whisper As a Service with GUI and...
       ___________________________________________________________________
        
       Show HN: Self-host Whisper As a Service with GUI and queueing
        
       Schibsted created a transcription service for our journalists to
       transcribe audio interviews and podcasts really quick.
        
       Author : olekenneth
       Score  : 231 points
       Date   : 2023-02-13 07:00 UTC (16 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | [deleted]
        
       | henry_viii wrote:
       | By the way there is also another project called Whisper.cpp:
       | 
       | https://github.com/ggerganov/whisper.cpp
       | 
       | Which uses x8 less memory than the Python implementation for the
       | tiny model. It would be a good idea to keep an eye on it since
       | there are Python bindings planned on the roadmap:
       | 
       | https://github.com/ggerganov/whisper.cpp#bindings
        
         | kardianos wrote:
         | I used whisper.cpp to build a tool to transcribe audio files,
         | either as a one off or as a folder watcher:
         | 
         | https://github.com/kardianos/audioclerk
         | 
         | Built in Go/cgo.
        
         | Metus wrote:
         | Is there a fork running it on Apple's Neural Engine?
        
         | [deleted]
        
         | sorenjan wrote:
         | And there's a fork of that that uses DirectCompute to run it on
         | GPUs without Cuda on Windows:
         | 
         | https://github.com/Const-me/Whisper
        
       | jonititan wrote:
       | That's very interesting. I've been using whisper via pip also but
       | I'm surprised you haven't sought to optimize whisper at all?
       | 
       | I've been looking at using compilation in torch but not
       | successful yet as otherwise it can take awhile to run.
       | https://pytorch.org/tutorials/intermediate/torch_compile_tut...
        
       | INTPenis wrote:
       | Has anyone looked at the code? Because as a Swedish citizen I
       | must say that anything I use by Schibsted is a hot mess from a UX
       | perspective.
        
         | olekenneth wrote:
         | That's just the Swedish part of Schibsted. We're vg.no and
         | Norwegian.
        
       | MitPitt wrote:
       | There's no GUI though
        
         | olekenneth wrote:
         | Working on it. https://github.com/schibsted/WAAS/pull/101
        
       | [deleted]
        
       | mkl wrote:
       | Looks interesting. I noticed that the README says "containe" or
       | "containes" several times, where I think you mean "container(s)".
        
         | olekenneth wrote:
         | Thanks for feedback. Looks like the Readme need more work
        
       | tsycho wrote:
       | Is there an open source speech recognition model which can be
       | restricted to a smaller domain-specific dictionary?
       | 
       | Use case: I want to transcribe my poker hands while playing, eg:
       | "Flop was 2 of spaces, 3 of diamonds and King of spades", "Button
       | raised to $20" etc.
       | 
       | When I tried using Whisper and some other model, the recognition
       | accuracy was atrocious, and it kept finding non-poker words that
       | sounded similar to poker words. I want to restrict its search
       | space to my own list of poker words which should significantly
       | increase the accuracy (theoretically).
       | 
       | Any suggestions on how to go about this?
        
         | levpopov wrote:
         | You can prefix a prompt for Whisper with a small text section
         | containing desired vocab, and it will likely improve accuracy
         | for that specific domain.
         | 
         | Whisper source is very readable, check out
         | https://github.com/openai/whisper/blob/main/whisper/decoding...
        
         | nshm wrote:
         | Vosk
         | 
         | https://alphacephei.com/vosk/lm
         | 
         | You can restrict the vocabulary the way you like, for example,
         | here is the chess app built with Vosk
         | 
         | https://www.chessvis.com/
        
       | sebastianvoelkl wrote:
       | The only thing Whisper misses is speaker diarization. I'm
       | currently working on a model that uses Whisper + pyannote to
       | transcribe Interviews and also detects who is speaking. It's
       | working but damn it takes so long
        
         | graderjs wrote:
         | Can you not separate into two phases? Speech separation to get
         | source per speaker, and then whisper on each in isolation
         | (maybe interlacing prompts)?
        
         | olekenneth wrote:
         | Ok. Our service is pretty fast. Also the M-Macs is really fast
         | imo
        
         | nasir wrote:
         | I'm badly looking for that! Is there a repo I can follow?
        
           | swyx wrote:
           | not GP (hoping he responds tho) but i've been collecting a
           | couple of diarization options: https://github.com/sw-yx/ai-
           | notes/blob/main/AUDIO.md
           | 
           | basically whisper.cpp has some support but its not great
           | (based on my own testing)
           | 
           | - https://huggingface.co/spaces/vumichien/whisper-speaker-
           | diar...
           | 
           | - https://github.com/Majdoddin/nlp pyannote diarization
           | 
           | - whisperX with diarization
           | https://twitter.com/maxhbain/status/1619698716914622466
           | https://github.com/m-bain/whisperX
        
             | [deleted]
        
           | sebastianvoelkl wrote:
           | I can share my repo when it's finished. In the meantime, you
           | can take a look at this:
           | https://huggingface.co/spaces/vumichien/whisper-speaker-
           | diar...
        
             | sebastianvoelkl wrote:
             | My goal for my project is to build a tool that transcribes
             | Interviews (e.g, in Sales or Recruiting) and puts the
             | Transcription through ChatGPT (Waiting for the API atm) to
             | make a summary that looks like the notes of the call.
             | Speaker diarization is important, so I don't have more than
             | 4000 tokens input in ChatGPT. I will see how it goes, but
             | if it's reliable enough (looks like it so far), it will
             | save the time it takes to write meeting notes and rewrite
             | them to send them to someone after the call (Hiring
             | Managers etc.) Imagine a 10x Otter.ai or something like
             | that.
        
               | jotnguyen wrote:
               | Why are you waiting for the API? The OpenAI Playground
               | has API examples you can copy paste. You can go over 4000
               | tokens if you have a business justification and payment
               | method. You have access to most of their models even the
               | new Codex ones
               | 
               | Edit: Looked at your link and I misunderstood. I think I
               | understand you're waiting for the ChatGPT specific model
               | now?
        
               | tikkun wrote:
               | > You can go over 4000 tokens if you have a business
               | justification and payment method.
               | 
               | That's incorrect
        
               | jotnguyen wrote:
               | You are correct that I was incorrect. Thank you for
               | correcting me. I misread their documentation. Sounds like
               | they might increase the token limit in the future, but
               | right now it's 4097 tokens shared with the prompt
        
               | epoch_100 wrote:
               | Ha. I'm also doing something similar with a friend at
               | https://www.paxo.ai. Funny that we all seemed to have an
               | similar idea, all at once.
        
               | moneywoes wrote:
               | What did you build the landing page with?
        
               | shiv86 wrote:
               | from the source code <!DOCTYPE html><!-- This site was
               | created in Webflow. https://www.webflow.com --><!--
        
               | moneywoes wrote:
               | Sounds interesting do you have a page
        
               | cjonas wrote:
               | I also started building the same thing. Crazy that
               | something that used to be nearly impossible will soon be
               | a "hello world" type project
        
         | jalino23 wrote:
         | whats your training rig like?
        
       | monkeydust wrote:
       | Run this locally for a few work related tasks. One useful feature
       | is being able to provide in your own 'jargon' in the initial
       | prompt which improves recognition quality ('--initial_prompt
       | 'jargon1 jargon 2 ... ')
        
       | deskamess wrote:
       | Related/Off Topic: Is there a documented way to improve the
       | accuracy of a particular language model? Say we can put in the
       | effort to collect 1000's of verified/transcribed samples of a
       | language that is currently scored poorly (WER). What steps do I
       | have to take to get those improvements into the system?
        
         | binarymax wrote:
         | Yes, you need to fine-tune the model with your data. This might
         | be easy or hard, depending on your experience level and
         | complexity of the model and available tooling.
         | 
         | For this model specifically (https://github.com/openai/whisper)
         | it would be a significant challenge for a newcomer. Luckily
         | Huggingface has a blog post that will get you started:
         | https://huggingface.co/blog/fine-tune-whisper
        
           | deskamess wrote:
           | Thank you. I do consider myself programming able but new to
           | the ML ecosystem.
        
       | adlpz wrote:
       | I understand this is self-hosting the OpenAI Whisper model (which
       | I see is fully MIT-licensed, weights and all). So not calling any
       | OpenAI APIs like other GPT-related tools do.
       | 
       | Am I correct on this? The README is not explicit.
        
         | olekenneth wrote:
         | Yes. This runs the OpenAI Whisper model locally.
        
       | sgt wrote:
       | Is the Whisper model better than say Youtube's auto transcribing?
       | I hope it is because the one on YT gets so much wrong it's almost
       | comical.
        
         | olekenneth wrote:
         | But more impressive is that it understand a lot of other
         | languages then just English. Really impressed with hard-to-
         | understand Norwegian dialects.
        
         | beardedetim wrote:
         | We use whisper at $dayjob and we found that it is far better
         | than Azures transcripts and we found that Azure and GCP had
         | about the same correctness.
         | 
         | I'd assume whisper will be better than YT auto ones for sure,
         | especially if you choose the right model
        
         | AB1908 wrote:
         | It is pretty good. I tried it a few times and there were minor
         | mistakes at best. The Whisper readme itself shows a live
         | transcription use case.
        
         | Jach wrote:
         | Generally yes when it produces sane output at all, but while YT
         | can get stuff comically wrong I've never seen it just go off
         | the rails and start hallucinating and mindlessly repeating
         | itself, which Whisper sometimes does _especially_ if you 're
         | also trying to get it to translate something. Like Whisper will
         | sometimes output a stream of things like "Please subscribe to
         | my channel and follow me on Twitter!" or "Thank you for
         | watching.".
         | 
         | On one source I tried the other day, the first 90 seconds or so
         | is just generic opening music, no speech, but it "transcribes"
         | it as "This is the end of the video. Thank you for watching.
         | Please subscribe to the channel if you like. See you in the
         | next video. Thank you for watching. Please subscribe to the
         | channel if you like. Thank you for watching. ..." If you help
         | it along by cutting up the source into only spoken segments you
         | can get it to do better but just throwing it at a directory of
         | material is probably going to leave you with some
         | disappointment.
         | 
         | Then sometimes it does something surprising, on a j-pop song
         | after hallucinating a bit during the intro it spit out a
         | translation in the form you might find on a lyrics site, that
         | is each line was "japanese-characters romaji-version english-
         | translation". I haven't been able to get it to do it again
         | (even for the same source).
        
       | teucris wrote:
       | Very cool - I have a homegrown setup where a script scans my
       | iCloud audio notes directory and generates transcriptions for any
       | new notes. Works like a charm.
        
       | elliotpage wrote:
       | This looks really good, thanks! Really appreciate this and all
       | the other Whisper implementations in this thread as I am sorting
       | up transcriptions for my 120+ podcast episodes.
        
         | olekenneth wrote:
         | Awesome!
        
       | magicseth wrote:
       | Is it possible to create a streaming endpoint that returns real-
       | time transcriptions?
        
         | lfmunoz4 wrote:
         | [dead]
        
         | nojs wrote:
         | I was working on this yesterday. It seems that the most common
         | approach with Whisper is simply to break the audio into chunks
         | and transcribe each one separately. This works but as you'd
         | expect sometimes has trouble at the edges. The segments also
         | have to be sufficiently long (like 10s) or the accuracy
         | suffers, meaning it's not truly real-time.
         | 
         | You could do better by overlapping the segments, except then
         | stitching the transcriptions together becomes an issue since
         | whisper doesn't provide reliable per-token timestamps [0], and
         | the output of the common part of overlapping segments isn't
         | necessarily the same. I can imagine a cool approach where you
         | transcribe long, overlapping chunks in real-time and
         | intelligently merge the stream of words somehow though.
         | 
         | Some more useful discussion here (whisper.cpp project, but
         | still relevant) [1].
         | 
         | 0. https://github.com/openai/whisper/discussions/332
         | 
         | 1. https://github.com/ggerganov/whisper.cpp/issues/10
        
       | silviot wrote:
       | People interested in this might also be interested in transcribe-
       | anything [1].
       | 
       | It automates video fetching and uses whisper to generate .srt,
       | .vtt and .txt files.
       | 
       | [1] https://github.com/zackees/transcribe-anything
        
         | olekenneth wrote:
         | This have the same output formats plus it's own Jojo-format to
         | open in an atm internal Mac-app and in the editor-feature-
         | branch.
        
       | raybb wrote:
       | Whisper-UI is also looking really nice lately but I think it's
       | still pretty early in development. The ability to click on the
       | transcript and hear the sound of that particular moment is great.
       | https://github.com/hayabhay/whisper-ui
        
       ___________________________________________________________________
       (page generated 2023-02-13 23:02 UTC)