[HN Gopher] Show HN: Self-host Whisper As a Service with GUI and...
___________________________________________________________________
Show HN: Self-host Whisper As a Service with GUI and queueing
Schibsted created a transcription service for our journalists to
transcribe audio interviews and podcasts really quick.
Author : olekenneth
Score : 231 points
Date : 2023-02-13 07:00 UTC (16 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| [deleted]
| henry_viii wrote:
| By the way there is also another project called Whisper.cpp:
|
| https://github.com/ggerganov/whisper.cpp
|
| Which uses x8 less memory than the Python implementation for the
| tiny model. It would be a good idea to keep an eye on it since
| there are Python bindings planned on the roadmap:
|
| https://github.com/ggerganov/whisper.cpp#bindings
| kardianos wrote:
| I used whisper.cpp to build a tool to transcribe audio files,
| either as a one off or as a folder watcher:
|
| https://github.com/kardianos/audioclerk
|
| Built in Go/cgo.
| Metus wrote:
| Is there a fork running it on Apple's Neural Engine?
| [deleted]
| sorenjan wrote:
| And there's a fork of that that uses DirectCompute to run it on
| GPUs without Cuda on Windows:
|
| https://github.com/Const-me/Whisper
| jonititan wrote:
| That's very interesting. I've been using whisper via pip also but
| I'm surprised you haven't sought to optimize whisper at all?
|
| I've been looking at using compilation in torch but not
| successful yet as otherwise it can take awhile to run.
| https://pytorch.org/tutorials/intermediate/torch_compile_tut...
| INTPenis wrote:
| Has anyone looked at the code? Because as a Swedish citizen I
| must say that anything I use by Schibsted is a hot mess from a UX
| perspective.
| olekenneth wrote:
| That's just the Swedish part of Schibsted. We're vg.no and
| Norwegian.
| MitPitt wrote:
| There's no GUI though
| olekenneth wrote:
| Working on it. https://github.com/schibsted/WAAS/pull/101
| [deleted]
| mkl wrote:
| Looks interesting. I noticed that the README says "containe" or
| "containes" several times, where I think you mean "container(s)".
| olekenneth wrote:
| Thanks for feedback. Looks like the Readme need more work
| tsycho wrote:
| Is there an open source speech recognition model which can be
| restricted to a smaller domain-specific dictionary?
|
| Use case: I want to transcribe my poker hands while playing, eg:
| "Flop was 2 of spaces, 3 of diamonds and King of spades", "Button
| raised to $20" etc.
|
| When I tried using Whisper and some other model, the recognition
| accuracy was atrocious, and it kept finding non-poker words that
| sounded similar to poker words. I want to restrict its search
| space to my own list of poker words which should significantly
| increase the accuracy (theoretically).
|
| Any suggestions on how to go about this?
| levpopov wrote:
| You can prefix a prompt for Whisper with a small text section
| containing desired vocab, and it will likely improve accuracy
| for that specific domain.
|
| Whisper source is very readable, check out
| https://github.com/openai/whisper/blob/main/whisper/decoding...
| nshm wrote:
| Vosk
|
| https://alphacephei.com/vosk/lm
|
| You can restrict the vocabulary the way you like, for example,
| here is the chess app built with Vosk
|
| https://www.chessvis.com/
| sebastianvoelkl wrote:
| The only thing Whisper misses is speaker diarization. I'm
| currently working on a model that uses Whisper + pyannote to
| transcribe Interviews and also detects who is speaking. It's
| working but damn it takes so long
| graderjs wrote:
| Can you not separate into two phases? Speech separation to get
| source per speaker, and then whisper on each in isolation
| (maybe interlacing prompts)?
| olekenneth wrote:
| Ok. Our service is pretty fast. Also the M-Macs is really fast
| imo
| nasir wrote:
| I'm badly looking for that! Is there a repo I can follow?
| swyx wrote:
| not GP (hoping he responds tho) but i've been collecting a
| couple of diarization options: https://github.com/sw-yx/ai-
| notes/blob/main/AUDIO.md
|
| basically whisper.cpp has some support but its not great
| (based on my own testing)
|
| - https://huggingface.co/spaces/vumichien/whisper-speaker-
| diar...
|
| - https://github.com/Majdoddin/nlp pyannote diarization
|
| - whisperX with diarization
| https://twitter.com/maxhbain/status/1619698716914622466
| https://github.com/m-bain/whisperX
| [deleted]
| sebastianvoelkl wrote:
| I can share my repo when it's finished. In the meantime, you
| can take a look at this:
| https://huggingface.co/spaces/vumichien/whisper-speaker-
| diar...
| sebastianvoelkl wrote:
| My goal for my project is to build a tool that transcribes
| Interviews (e.g, in Sales or Recruiting) and puts the
| Transcription through ChatGPT (Waiting for the API atm) to
| make a summary that looks like the notes of the call.
| Speaker diarization is important, so I don't have more than
| 4000 tokens input in ChatGPT. I will see how it goes, but
| if it's reliable enough (looks like it so far), it will
| save the time it takes to write meeting notes and rewrite
| them to send them to someone after the call (Hiring
| Managers etc.) Imagine a 10x Otter.ai or something like
| that.
| jotnguyen wrote:
| Why are you waiting for the API? The OpenAI Playground
| has API examples you can copy paste. You can go over 4000
| tokens if you have a business justification and payment
| method. You have access to most of their models even the
| new Codex ones
|
| Edit: Looked at your link and I misunderstood. I think I
| understand you're waiting for the ChatGPT specific model
| now?
| tikkun wrote:
| > You can go over 4000 tokens if you have a business
| justification and payment method.
|
| That's incorrect
| jotnguyen wrote:
| You are correct that I was incorrect. Thank you for
| correcting me. I misread their documentation. Sounds like
| they might increase the token limit in the future, but
| right now it's 4097 tokens shared with the prompt
| epoch_100 wrote:
| Ha. I'm also doing something similar with a friend at
| https://www.paxo.ai. Funny that we all seemed to have an
| similar idea, all at once.
| moneywoes wrote:
| What did you build the landing page with?
| shiv86 wrote:
| from the source code <!DOCTYPE html><!-- This site was
| created in Webflow. https://www.webflow.com --><!--
| moneywoes wrote:
| Sounds interesting do you have a page
| cjonas wrote:
| I also started building the same thing. Crazy that
| something that used to be nearly impossible will soon be
| a "hello world" type project
| jalino23 wrote:
| whats your training rig like?
| monkeydust wrote:
| Run this locally for a few work related tasks. One useful feature
| is being able to provide in your own 'jargon' in the initial
| prompt which improves recognition quality ('--initial_prompt
| 'jargon1 jargon 2 ... ')
| deskamess wrote:
| Related/Off Topic: Is there a documented way to improve the
| accuracy of a particular language model? Say we can put in the
| effort to collect 1000's of verified/transcribed samples of a
| language that is currently scored poorly (WER). What steps do I
| have to take to get those improvements into the system?
| binarymax wrote:
| Yes, you need to fine-tune the model with your data. This might
| be easy or hard, depending on your experience level and
| complexity of the model and available tooling.
|
| For this model specifically (https://github.com/openai/whisper)
| it would be a significant challenge for a newcomer. Luckily
| Huggingface has a blog post that will get you started:
| https://huggingface.co/blog/fine-tune-whisper
| deskamess wrote:
| Thank you. I do consider myself programming able but new to
| the ML ecosystem.
| adlpz wrote:
| I understand this is self-hosting the OpenAI Whisper model (which
| I see is fully MIT-licensed, weights and all). So not calling any
| OpenAI APIs like other GPT-related tools do.
|
| Am I correct on this? The README is not explicit.
| olekenneth wrote:
| Yes. This runs the OpenAI Whisper model locally.
| sgt wrote:
| Is the Whisper model better than say Youtube's auto transcribing?
| I hope it is because the one on YT gets so much wrong it's almost
| comical.
| olekenneth wrote:
| But more impressive is that it understand a lot of other
| languages then just English. Really impressed with hard-to-
| understand Norwegian dialects.
| beardedetim wrote:
| We use whisper at $dayjob and we found that it is far better
| than Azures transcripts and we found that Azure and GCP had
| about the same correctness.
|
| I'd assume whisper will be better than YT auto ones for sure,
| especially if you choose the right model
| AB1908 wrote:
| It is pretty good. I tried it a few times and there were minor
| mistakes at best. The Whisper readme itself shows a live
| transcription use case.
| Jach wrote:
| Generally yes when it produces sane output at all, but while YT
| can get stuff comically wrong I've never seen it just go off
| the rails and start hallucinating and mindlessly repeating
| itself, which Whisper sometimes does _especially_ if you 're
| also trying to get it to translate something. Like Whisper will
| sometimes output a stream of things like "Please subscribe to
| my channel and follow me on Twitter!" or "Thank you for
| watching.".
|
| On one source I tried the other day, the first 90 seconds or so
| is just generic opening music, no speech, but it "transcribes"
| it as "This is the end of the video. Thank you for watching.
| Please subscribe to the channel if you like. See you in the
| next video. Thank you for watching. Please subscribe to the
| channel if you like. Thank you for watching. ..." If you help
| it along by cutting up the source into only spoken segments you
| can get it to do better but just throwing it at a directory of
| material is probably going to leave you with some
| disappointment.
|
| Then sometimes it does something surprising, on a j-pop song
| after hallucinating a bit during the intro it spit out a
| translation in the form you might find on a lyrics site, that
| is each line was "japanese-characters romaji-version english-
| translation". I haven't been able to get it to do it again
| (even for the same source).
| teucris wrote:
| Very cool - I have a homegrown setup where a script scans my
| iCloud audio notes directory and generates transcriptions for any
| new notes. Works like a charm.
| elliotpage wrote:
| This looks really good, thanks! Really appreciate this and all
| the other Whisper implementations in this thread as I am sorting
| up transcriptions for my 120+ podcast episodes.
| olekenneth wrote:
| Awesome!
| magicseth wrote:
| Is it possible to create a streaming endpoint that returns real-
| time transcriptions?
| lfmunoz4 wrote:
| [dead]
| nojs wrote:
| I was working on this yesterday. It seems that the most common
| approach with Whisper is simply to break the audio into chunks
| and transcribe each one separately. This works but as you'd
| expect sometimes has trouble at the edges. The segments also
| have to be sufficiently long (like 10s) or the accuracy
| suffers, meaning it's not truly real-time.
|
| You could do better by overlapping the segments, except then
| stitching the transcriptions together becomes an issue since
| whisper doesn't provide reliable per-token timestamps [0], and
| the output of the common part of overlapping segments isn't
| necessarily the same. I can imagine a cool approach where you
| transcribe long, overlapping chunks in real-time and
| intelligently merge the stream of words somehow though.
|
| Some more useful discussion here (whisper.cpp project, but
| still relevant) [1].
|
| 0. https://github.com/openai/whisper/discussions/332
|
| 1. https://github.com/ggerganov/whisper.cpp/issues/10
| silviot wrote:
| People interested in this might also be interested in transcribe-
| anything [1].
|
| It automates video fetching and uses whisper to generate .srt,
| .vtt and .txt files.
|
| [1] https://github.com/zackees/transcribe-anything
| olekenneth wrote:
| This have the same output formats plus it's own Jojo-format to
| open in an atm internal Mac-app and in the editor-feature-
| branch.
| raybb wrote:
| Whisper-UI is also looking really nice lately but I think it's
| still pretty early in development. The ability to click on the
| transcript and hear the sound of that particular moment is great.
| https://github.com/hayabhay/whisper-ui
___________________________________________________________________
(page generated 2023-02-13 23:02 UTC)