[HN Gopher] Show HN: I made a free transcription service powered...
       ___________________________________________________________________
        
       Show HN: I made a free transcription service powered by Whisper AI
        
       Author : mayeaux
       Score  : 197 points
       Date   : 2022-11-18 22:33 UTC (1 days ago)
        
 (HTM) web link (freesubtitles.ai)
 (TXT) w3m dump (freesubtitles.ai)
        
       | smallerfish wrote:
       | I've been testing whisper on AWS - the g4dn machines are the
       | sweet spot of price/performance. It's extremely good, and there
       | will be rapid consolidation in the transcription market as a
       | result of it existing (its one major missing feature is the
       | ability to supply custom dictionaries). The fact that it does a
       | credible job at translation to english is a cherry on top.
       | 
       | Anyway, I'd love to get it running well on g5g, but they seem
       | extremely temperamental. If anybody has, please let me know your
       | secret. :)
        
       | pi3rre wrote:
       | Where is your server hosted with?
        
         | mayeaux wrote:
         | 2x RTX A6000 server on Vast.ai with another server with nginx
         | as a reverse proxy
        
           | [deleted]
        
       | gamegoblin wrote:
       | I recently tried Whisper to transcribe our local Seattle Fire
       | Department radio scanner -- unfortunately it was not reliable
       | enough for my use case, e.g. "adult male hit by car" gets
       | transcribed as "don't mail it by car".
       | 
       | I imagine future models will allow the user to input some context
       | to disambiguate. Like if I could give it the audio along with the
       | context "Seattle Fire Department and EMS radio traffic", it would
       | bias towards the type of things you'd likely hear on such a
       | channel.
        
         | [deleted]
        
         | RockRobotRock wrote:
         | Was there a big difference in accuracy depending on which model
         | you used?
        
           | gamegoblin wrote:
           | Yes, large was by far the best, but still not accurate enough
           | that I'd be willing to put it into a fully automated
           | pipeline. It would have gotten it right probably 75% of the
           | time. Anything other than the large model was far too bad to
           | even think about using.
        
             | mayeaux wrote:
             | Whisper does pretty well, even with background music and
             | things like that, I think you're working with a pretty
             | weird subsection of recorded audio that won't work, for
             | that edge case to work you'll very likely need to train
             | your own model.
        
             | kkielhofner wrote:
             | What was the performance, resource usage, etc of doing this
             | with large? What's the speed like?
             | 
             | I'm still getting spun up on this but base delivers a
             | pretty impressive 5-20x realtime on my RTX 3090. I haven't
             | gotten around to trying the larger models and with only
             | 24GB of VRAM I'm not sure what kind of success I'll have
             | anyway...
             | 
             | In my case the goal was to actually generate tweets based
             | on XYZ. As I've already said there were serious technical
             | challenges so I abandoned the project but I was also a
             | little concerned about the privacy, safety, etc issues of
             | realtime or near-realtime reporting on public safety
             | activity. I also streamed to broadcastify and it really
             | seems like they insert artificial delay because of these
             | concerns.
        
               | IanCal wrote:
               | You can run the larger models just fine on a 3090. Large
               | takes about 10G for transcribing English.
               | 
               | For a 1:17 file it takes:
               | 
               | 6s for base.en, I think 2s to load the model based on the
               | sound of my power supply.
               | 
               | 33s for large, I think 11s of which is loading the model.
               | 
               | Varies a lot with how dense the audio file is, this was
               | me giving a talk so not the fastest and quite clean
               | audio.
               | 
               | While I saw near perfect or perfect performance on many
               | things with smaller models, the largest really are better
               | . I'll upload a gist in a but with Rap God passed through
               | base.en and large.
               | 
               | edit -
               | 
               | Timings (explicitly marked as language en and task
               | transcribe):
               | 
               | base.en => 23s
               | 
               | large => 2m50
               | 
               | Audio length 6m10
               | 
               | Results (nsfw, it's Rap God by Eminem): https://gist.gith
               | ub.com/IanCal/c3f9bcf91a79c43223ec59a56569c...
               | 
               | Base model does well, given that it's a rap. Large model
               | just does incredibly, imo. Audio is very clear, but it
               | does have music too.
        
               | kkielhofner wrote:
               | > based on the sound of my power supply
               | 
               | Hah, I love that - "benchmark by fan speed".
               | 
               | Good to know - I've tried large and it works but in my
               | case I'm using whisper-asr-webservice[0] which loads the
               | configured model for each of the workers on startup. I
               | have some prior experience with Gunicorn and other WSGI
               | implementations so there's some playing around and
               | benchmarking to be done on the configured number of
               | workers as the GPU utilization of Whisper is a little
               | spiky and whisper-asr-webservice does file format
               | conversion on CPU via ffmpeg. Default was two workers, is
               | now one but I've found as many as four with base can
               | really improve overall utilization, response time, and
               | scale (which certainly won't be possible with large).
               | 
               | OPs node+express implementation shells out to Whisper
               | which gives more control (like runtime specification of
               | model) but almost certainly has to end up slower and less
               | efficient in the long run as the model is obviously
               | loaded from scratch on each invocation. I'm front-ending
               | whisper-asr-webservice with traefik so I could certainly
               | do something like having two separate instances (one for
               | base, another for large) at different URL paths but like
               | I said I need to do some playing around with it. The
               | other issue is if this is being made available to the
               | public I doubt I'd be comfortable without front-ending
               | the entire thing with Cloudflare (or similar) and
               | Cloudflare (and others) have things like 100s timeouts
               | for final HTTP response (Websockets could get around
               | this).
               | 
               | Thanks for providing the Slim Shady examples, as a life-
               | long hip hop enthusiast I'm not offended by the content
               | in the slightest.
               | 
               | [0] - https://github.com/ahmetoner/whisper-asr-webservice
        
         | sf4lifer wrote:
         | why did you want to transcribe it? What would you do with the
         | output?
        
           | gamegoblin wrote:
           | I wanted to make a twitter bot that posted whenever a
           | pedestrian/cyclist got hit by a car.
           | 
           | I'd need to:
           | 
           | - Wait for a MED6 or AID response code to come across the
           | live event stream
           | 
           | - Listen to the radio chatter to see if it was a pedestrian
           | getting hit (use GPT3 on the transcription to determine if
           | the text was about a ped/cyclist getting hit)
           | 
           | - Maybe also correlate to SPD logging a 'mvc with injuries'
           | at the same location
        
             | [deleted]
        
         | mayeaux wrote:
         | I (used to) use simonsaysai.com to generate subtitles and they
         | had the functionality to input specialized vocabulary, so I
         | suppose it's possible in some sense but I don't know how it
         | would work with Whisper, something to ask on their Github if
         | nobody else has yet I suppose.
         | 
         | But, for me, the English model works really well. Using the
         | 'large' model works about perfectly for me, I can't think of
         | anything I thought the large model got too badly wrong, is that
         | the model you tried?
        
           | gamegoblin wrote:
           | Yes, the problem is that the radio chatter is just very, very
           | low quality, for a lot of words your brain just needs to know
           | the context to fill in the gaps due to radio static and such.
           | Even as a human some parts are unintelligible.
        
             | mayeaux wrote:
             | Yeah it's a hard case, Whisper with the large model is
             | among the cutting edge in the business so if the static is
             | bad and the quality is low there's not much you can do but
             | wait for better AI, or fix whatever they get wrong by hand,
             | but Whisper AI is on the cutting edge so you might have to
             | wait for a bit lol
        
         | pain_perdu wrote:
         | We built a working version of this using Assembly.ai
         | 
         | I am no-longer involved in the project but you're welcome to
         | contact the CTO if you're curious how it worked:
         | 
         | https://skyscraper.ai/
        
         | skim_milk wrote:
         | Have you tried the --initial_prompt CLI arg? For my use, I put
         | a bunch of industry jargon and names that are commonly
         | misspelled in there and that fixes 1/3 to 1/2 of the errors.
         | 
         | I was initially going to use Azure Cognitive Services and train
         | it on a small amount of test data, after Whisper released for
         | free I use Whisper + openai GPT-3 trained to fix the
         | transcription errors by 1) taking a sample of transcripts by
         | Whisper 2) fixing the errors and 3) fine-tuning GPT-3 by using
         | the unfixed transcriptions as the prompt and the corrected
         | transcripts as the result text.
         | 
         | Whisper with the --initial_prompt containing industry jargon
         | plus training GPT-3 to fix the transcription errors should be
         | nearly as accurate as using a custom-trained model in Azure
         | Cognitive Services but at 5-10% of the cost. Biggest downside
         | is the amount of labor to set that up, and the snail's pace of
         | Whisper transcriptions.
        
           | gamegoblin wrote:
           | Thanks for the tip, that did improve accuracy a lot.
        
           | jerpint wrote:
           | There have been a lot of hacks to speed up whisper inference
        
             | skim_milk wrote:
             | Sweet! Do you have any links to resources on how to speed
             | it up? I couldn't find any while searching Google or the
             | Whisper discussion forums.
        
               | EMIRELADERO wrote:
               | Not a hack per se but a complete reimplementation.
               | 
               | https://github.com/ggerganov/whisper.cpp
               | 
               | This is a C/C++ version of Whisper which uses the CPU.
               | It's astoundingly fast. Maybe it won't work in your use
               | case, but you should try!
        
         | kkielhofner wrote:
         | The issue here is for most radio systems you end up with about
         | 3 kHz of effective audio bandwidth (sampling). Most ASR/STT
         | models are trained on at least 16 kHz audio.
         | 
         | Did you try a telephony oriented model like aspire or similar?
         | They're trained on sort-of 8 kHz audio and might work better.
         | 
         | I tried something similar for my SDR feeds and gave up because
         | it's just too challenging and niche - the sampling, the jargon,
         | the 10 codes, the background noise, static on analog
         | systems/drop outs on digital systems, rate of speech, etc all
         | contribute to very challenges issues for an ML model.
        
           | jcalvinowens wrote:
           | > the sampling, the jargon, the 10 codes, the background
           | noise, static on analog systems/drop outs on digital systems,
           | rate of speech, etc
           | 
           | Is the reduced bandwidth really the most significant problem?
           | Naively I'd think everything else you mentioned would matter
           | a lot more, I'm curious how much you experimented with that
           | specifically.
        
             | kkielhofner wrote:
             | When it all comes together it's kind of a nightmare for an
             | ASR model. There were plenty of times in reviewing the
             | recordings and ASR output where I'd listen to the audio and
             | have no idea what they said.
             | 
             | I'm not sure which contributes most but I know from my
             | prior experiences with ASR for telephony even clean speech
             | on pristine connections does much worse with models trained
             | on 16 kHz being fed native 8 kHz audio that gets resampled.
             | 
             | I've done some early work with Whisper in the telephony
             | domain (transcribing voicemails on Asterisk and Freeswitch)
             | and the accuracy already seems to be quite a bit worse.
        
               | ivalm wrote:
               | Could one train an interpolation layer (eg take a bunch
               | of 16k audio, down sample to 8k, train 8k->16k
               | upsampler)? Or better yet (but more expensive), take
               | whisper, freeze it, and train the upsampler on whisper's
               | loss.
        
               | kkielhofner wrote:
               | "I understand some of those words."
               | 
               | Hah, in all seriousness I'm more of a practitioner in
               | this space. If this was something I absolutely needed to
               | get done who knows where it would have went. For a little
               | side hacking project once I encountered these issues I
               | moved on - back in the day expectations were lower for
               | telephony and the 8 kHz aspire models and kaldi were
               | adequate to get that "real work" done.
        
               | andrew3726 wrote:
               | Sure, that's called audio super resolution, there's a few
               | papers/projects doing that. Haven't really seen models
               | which are robust and have good generalization though.
        
       | rexreed wrote:
       | What is the maximum length of audio allowed? What are your costs
       | in running this? Are the hardware requirements substantial?
        
         | mayeaux wrote:
         | Right now the limit is 100MB because of Cloudflare, no length
         | limit, costs $1.30/h to run this, that's enough for a 2x RTX
         | A6000 on Vast.ai can you check out the specs there
        
       | lettergram wrote:
       | There's a lot of startups starting in the space offering
       | transcription.
       | 
       | Read.ai - https://www.read.ai/transcription
       | 
       | Provides transcription & diarization and the bot integrates into
       | your calendar. It joins all your meetings for zoom, teams, meet,
       | webex, tracks talk time, gives recommendations, etc.
       | 
       | It's amazing how quickly this space is moving. Particularly, with
       | the increase in remote work. Soon you'll be able to search all
       | your meetings and find exactly when a particular topic was
       | discussed! It's exciting.
        
         | mayeaux wrote:
         | Yeah I was paying $100/month for transcription services and
         | turns out Whisper with the large model was much more accurate,
         | and I didn't like the UI, I much prefer to just use this app as
         | opposed to the paid service, and I chose it because it was the
         | cheapest by far ($100/30h) as opposed to most of the other paid
         | services which were $10 an hour which to me was a bit much
         | really. But Whisper is really a game changer I don't know how
         | those companies stay in business really.
        
       | pessimizer wrote:
       | Do you just hold the page open after upload and wait for it to
       | update?
        
         | mayeaux wrote:
         | Yeah, there is a websocket connection and when the
         | transcription is done it will update the frontend with the
         | links to .srt, .vtt and .txt file downloads
        
           | pessimizer wrote:
           | Thanks. Does the queue position update through the websocket
           | too?
        
             | mayeaux wrote:
             | Yeah it does, I think there's a bug with it for saying what
             | your position is, but when the others are done it will
             | start correctly even if the frontend shows like position -2
             | or something. There's 2 uploads in the queue atm so not bad
        
               | pessimizer wrote:
               | Thanks a lot for this. I've wanted to test whisper's
               | usefulness for vintage movie subtitling projects, but
               | haven't had such a straightforward, preconfigured
               | opportunity. I promise I'll beat the subs into some sort
               | of shape as long as the timings are at least vaguely
               | alright, and not waste your money.
        
               | mayeaux wrote:
               | Hey, glad I could be of use. The problem with Whisper is
               | that it needs a lot of GPU. Actually my Mac can't even
               | use my GPU so right away I had to get it up on a server,
               | but Whisper is so powerful and it's so amazing that it's
               | open source I am surprised nobody did this yet. I could
               | see them charging for it but may as well use it anyways,
               | the other services are insanely expensive ($10/h?!) and I
               | don't really like their UIs to boot lol
        
               | pessimizer wrote:
               | Nothing has come back from the two that I tried (one
               | medium in French, the other large in Spanish), meaning no
               | change on the page since I uploaded them an hour an a
               | half ago. I loaded the page again in another tab, though,
               | and after a few seconds "finishedProcessing" appeared
               | under the form. I suspect that means something.
               | 
               | On Firefox 102.4.0esr, also uBlock Origin.
        
               | mayeaux wrote:
               | It's probably due to me rebooting to load new code, I
               | will have a way to send a signal to the frontend to
               | inform them but not implemented atm
        
               | pessimizer wrote:
               | I tried it again this morning. I'm getting all of the
               | output properly this time, but it has hung partway
               | through every time I tried.
        
               | pessimizer wrote:
               | It's me again. Ran it again, ran perfectly. Thanks for
               | all of your work.
               | 
               | edit: don't know if you'll see this any time soon, but
               | I've had it fail/hang again. You might want to take a
               | hash of uploads, so if the lost connections still end up
               | getting transcribed, if they're reuploaded they won't get
               | transcribed again.
               | 
               | Also I haven't had success in Firefox, only Chromium.
        
       | ako wrote:
       | Recently I've been using my iPad as a transcription "keyboard"
       | for my laptop when writing documents in Dropbox paper. Open the
       | document on both computers, then use dictate to enter all text,
       | and then correct if necessary with the laptop. I'm expecting to
       | move to iPad only when I can use stage manager with an external
       | monitor.
       | 
       | Dictation is also great when writing in a foreign language: I
       | speak German ok-ish, writing is harder. Dictation helps writing
       | more correct German.
        
       | Namidairo wrote:
       | Shouldn't the language and model inputs be dropdowns instead of
       | text input?
       | 
       | I'm going to hope/assume you're doing some sort of sanitisation
       | on those inputs.
       | 
       | Additionally, wouldn't you lose the language detection that's
       | done for no language input? (IIRC, it uses the first 30 seconds
       | to detect language if you don't specify one)
        
         | mayeaux wrote:
         | Yeah someone submitted a PR for those to be fixed, I'm just
         | wary about restarting the server because I haven't setup a way
         | to be able to reboot without losing the websockets
         | 
         | Well those inputs should all error unless they are a valid
         | value.
         | 
         | Yes if nothing is input it will automatically detect the
         | language based on the first 30s of input
        
           | Namidairo wrote:
           | I wasn't even aware that it was on GitHub, which I suppose is
           | an issue in itself.
        
           | TheBrokenRail wrote:
           | > I'm just wary about restarting the server because I haven't
           | setup a way to be able to reboot without losing the
           | websockets
           | 
           | Wait what. Not being able to safely restart the server sounds
           | like a disaster waiting to happen.
        
             | mayeaux wrote:
             | This was just a personal project a couple hours ago so it's
             | not setup properly to do safe reboots and a lot of other
             | things, I was just using it locally and now it's in in the
             | wild, will take some time to get everything refined and
             | professional
        
               | bergenty wrote:
               | Isn't this very expensive to host? Are you aware this
               | could cost A LOT?
        
               | jazzyjackson wrote:
               | in another comment they state:
               | 
               | > I'm just running this off of a 2x RTX A6000 server on
               | Vast.ai at the moment, about $1.30/h
               | 
               | whether that's a lot is a matter of perspective
        
             | jazzyjackson wrote:
             | > losing the websockets
             | 
             | users would lose the session and have to start over, not
             | the end of the world
        
               | mayeaux wrote:
               | I'm not even using sessions just localstorage and
               | websockets lol.
        
       | mayeaux wrote:
       | Rebooted the app do to queue upgrades, if you had a pending
       | upload please reupload it, thanks!
        
       | mayeaux wrote:
       | I've made some updates and the queue works how I wanted it, I
       | rebooted the server and I think I will leave it like this for a
       | while
        
       | tomcam wrote:
       | Congratulations on taking this to completion and announcing here!
       | Love your approach to this!
        
         | mayeaux wrote:
         | Thanks! I wrote it for myself over a weekend and have really
         | enjoyed it ever since I'm glad others were able to get
         | something out of it! It seems to run pretty well but I have
         | some improvements planned, first is I will take the whisper
         | output and feed it to you when you're in the queue so you can
         | see them progressing. Will be pretty trivial to implement but I
         | am feeling bored in the queue at the site atm so that is the
         | next killer feature lol
        
       | miki123211 wrote:
       | If somebody wants to run Whisper for dirt cheap, vast.ai is the
       | way to go.
       | 
       | I can usually get an (unverified) 1x RTX 3090 instance for about
       | $0.10/hr, and that processes audio at something like 1.5X speeds.
       | Unverified instances do crash once in a while, but as long as you
       | back up the output every few hours, it's fine, you just set up a
       | new one in case something happens. I wouldn't use this for
       | confidential company meetings, but it's good enough for podcasts,
       | Youtube Videos and other public or semi-public stuff.
        
       | j45 wrote:
       | Exciting to see.
       | 
       | Curious if there was a benefit to using whisper over something
       | like vosk which can transcribe on mobile device pretty decently.
       | 
       | Whisper has other interesting functionality but for straight
       | transcription it seems a bit heavy. Still learning about it and
       | putting it through its paces.
        
         | nshm wrote:
         | We did comparison of recent Vosk and Whisper models here:
         | 
         | https://alphacephei.com/nsh/2022/10/22/whisper.html
         | 
         | In general, Whisper is more accurate but much more resource
         | heavy. Vosk runs on single core while Whisper needs all CPU
         | cores.
         | 
         | Accuracy difference for clean speech between Vosk-small and
         | Whisper tiny is 2-3% absolute, 20% relative. Not sure how
         | important is it, I would claim it is not that critical.
         | 
         | Numbers there are for original Whisper. Whisper.cpp recommended
         | here is actually 10% worse than stock Whisper for speed
         | considerations. Not that simple.
         | 
         | Vosk is streaming design, you get results with minimum latency
         | of 200ms. Whisper requires you to wait for significant amount
         | of time. If you refactor Whisper for lower latency you will
         | loose a lot of accuracy advantage. Latency is very importan for
         | interactive applications like assistants.
         | 
         | Whisper is multilingual and has punctuation, that is a clearly
         | a good advantage. It also can use context properly improving
         | for long recordings.
         | 
         | So on mobile Vosk is still a viable option actually as many
         | others mobile-focused engines.
         | 
         | For server based transcription Whisper is certainly better. But
         | not much better than Nvidia Nemo for example. Not that much
         | publicity for the former though.
        
       | Void_ wrote:
       | I'm sorry to self promote again - but: https://whispermemos.com
       | 
       | I'm in love with the idea of pressing button on my Lock Screen
       | and getting a perfect transcription in my inbox.
       | 
       | Also, just added emoji summarization in email subject, a small
       | visual reminder of what your memo was about.
       | 
       | I hope this is useful to someone!
        
         | o_____________o wrote:
         | That's a cool idea. How about integrations like github or
         | notion which would write out a markdown file?
        
         | fire wrote:
         | is the app open source? I'm on android, so :c
        
           | nsriv wrote:
           | If you don't mind non-OSS, the Google Recorder App is
           | incredible.
        
       | krick wrote:
       | I didn't manage to transcribe anything (it just doesn't remember
       | that I submitted anything), but whatever, I didn't need to
       | anyway. I just wanted to ask: how good is Whisper with non-
       | english? At least "major" ones, like, German, French, Russian,
       | Spanish?
        
         | smallerfish wrote:
         | Spanish is great.
        
         | mayeaux wrote:
         | I've got people to test and usually with the 'medium' and
         | 'large' models it works really efficiently. Honestly, I just
         | use the large model for everything, because might as well have
         | the best quality if you're going to do the effort.
        
         | stefan_ wrote:
         | I've been using it on lots of Russian stuff and it's great,
         | even translates for you.
        
         | gdz wrote:
         | WER breakdown by languages of Fleurs dataset, using the large
         | model: https://github.com/openai/whisper/blob/main/language-
         | breakdo...
         | 
         | from https://github.com/openai/whisper#available-models-and-
         | langu...
        
         | jerpint wrote:
         | I've tried it on French and results were pretty good
        
       | braindead_in wrote:
       | This is cool. We have integrated Whisper with our human in the
       | loop tech at Scribie [1] and the results have been great. We
       | offer free credits if you want to try it out.
       | 
       | [1] https://scribie.com
        
       | donqu1xote1 wrote:
       | This is an exciting one! We are building an open-source low-code
       | alts to Retool and I think we can build integration with your
       | projects! Take a look at ours and see if you want to collaborate
       | or not. Here you go:https://github.com/illacloud/illa-builder
        
         | bl4ckneon wrote:
         | Please stop spamming your link everywhere. You know people can
         | see your post history?
        
       | si_164 wrote:
       | I've been running whisper in the terminal with Python and I've
       | found whisper surprisingly accurate with the transcription even
       | from Chinese.
       | 
       | Just given your site a try, nicely done. One feedback - would be
       | great to have a progress indicator on the processing page, I have
       | no idea what stage it's at or how much longer I need to wait.
        
         | mayeaux wrote:
         | Yeah Whisper is top of the line, they posted their performance
         | compared to the industry standard and it's right in there at
         | the top.
         | 
         | It _should_ show the data via processing.. it 's setup to just
         | take whatever stdout/stderr comes back from Whisper and send it
         | directly to the frontend via websockets, I'm surprised you got
         | stuck there :thinking:
        
       | mayeaux wrote:
       | Okay that should be my last reboot for a while, I've got it ready
       | how I'd like it for now, feel free to give any feedback!
        
       | marcooliv wrote:
       | Good luck with your AWS bill Haha. But seriously. How?
        
         | mayeaux wrote:
         | I am just paying for a somewhat expensive server and I love how
         | it's really fast but also I have a lot of free GPU time so
         | might as well let others use it too lol. It's an experiment to
         | see if people will use it productively or if someone will abuse
         | it and ruin it for others lol
        
           | marcooliv wrote:
           | I appreciate your initiative!
        
             | mayeaux wrote:
             | Thanks! Whisper is a lot of fun but it didn't take long
             | before I wanted to build a frontend. And then I built
             | something that I think came out super nice so why not share
             | it with people. I used to pay $100/month for transcriptions
             | and this works a lot better for me so might as well open-
             | source transcription if I can, but I give all the credit to
             | Whisper that module they put out is amazing
        
               | sali0 wrote:
               | Very generous of you. I made a similar free service 3
               | years ago using much worse tools and it's so cool to see
               | whisper making it all so much better and more efficient.
               | Thanks for releasing for free
        
               | mayeaux wrote:
               | No problem! I am just seeing how it runs, I might throw
               | up a referral link to Vast and put up a tutorial on how
               | to host your own service, maybe that can offset the cost
               | a bit? The current server is $700/month, maybe it could
               | just run off donations who knows
        
               | bayramn wrote:
               | $700/month? Where digital ocean? I am new to python and
               | ML, curious to know why..
        
           | knicholes wrote:
           | someone will abuse it and ruin it for others
        
             | mayeaux wrote:
             | Maybe I will put in some mechanism to prevent that but for
             | now I just want to see if people could find it useful. I
             | also have the code open source and will write tutorials for
             | people to put up their own instance as well
        
               | rexreed wrote:
               | I'd love to read a tutorial on how to do this for myself.
        
               | trompetenaccoun wrote:
               | It would be more useful if one could directly paste links
               | to videos online as well. But yeah, in general this is
               | extremely useful. I'm looking forward to video site
               | integration. Would be great if youtube could finally
               | retire their horrible auto caption function for something
               | that actually works. Being able to easily watch media in
               | different languages from around the world will be an
               | absolute game changer.
        
               | mayeaux wrote:
               | Also, I have that tested (auto download) with YouTube-Dl,
               | it works fine but haven't put it into the frontend, but
               | may as well, it helps a lot on your own instance so you
               | don't have to download it first and then upload it
        
               | mayeaux wrote:
               | I also plan to support automatic language translation I
               | have that working locally already actually, and I work
               | for one of the big alt-video platforms and rumour has it
               | that I will be shipping this feature for them soon (auto
               | transcription with auto translated subtitles)
        
               | trompetenaccoun wrote:
               | Really cool, that would be a killer feature. Definitely
               | post here when this gets released!
        
               | mayeaux wrote:
               | Yeah it's all ready to go using LibreTranslate, they have
               | about 25 languages, maybe I'll finish that this weekend
               | and put it up, it's really inexpensive to make the
               | translations compared to making the original
               | transcription so may as well. Coming soon!
        
               | tipsytoad wrote:
               | Would love to locally host this, do you have a source?
        
               | mayeaux wrote:
               | No docs or anything yet but:
               | https://github.com/mayeaux/generate-subtitles
        
               | Super_Jambo wrote:
               | thank you!
        
             | giantg2 wrote:
             | Not even abuse, but just intensive use cases. Like the guy
             | who posted a few says ago about recording and transcribing
             | all day.
        
               | mayeaux wrote:
               | I setup the server to only transcribe two files at a
               | time, so yeah someone could abuse it for sure with two
               | big uploads and stick everyone else on the queue. But for
               | me, even a 3 hour video translates with large model in
               | about ~30 minutes so it wouldn't be too bad, but
               | hopefully everyone is conscious to not do that, so far
               | nobody has abused it which is cool.
        
               | kkielhofner wrote:
               | Me again - why two at a time? In my initial testing with
               | whisper-asr-webservice and my RTX 3090 I could pretty
               | easily throw ~10 different files at it simultaneously as
               | there is some natural staggering between API entry, CPU
               | conversion/resampling/transcoding of audio, the actual
               | audio length, network effects like upload speed, etc.
               | 
               | I also implemented some anti-abuse-ish features between
               | traefik and Cloudflare that should help it stand up
               | better in the face of bad actors abusing it.
               | 
               | Certainly not something to necessarily depend on but I
               | thought I'd mention it.
        
               | oefrha wrote:
               | > I am just paying for a somewhat expensive server and I
               | love how it's really fast but also I have a lot of free
               | GPU time so might as well let others use it too lol.
               | 
               | They are donating some spare capacity.
        
       | jw1224 wrote:
       | Free startup idea: Use Whisper with pyannote-audio[0]'s speaker
       | diarization. Upload a recording, get back a multi-speaker
       | annotated transcription.
       | 
       | Make a JSON API and I'll be your first customer.
       | 
       | [0] https://github.com/pyannote/pyannote-audio
        
         | mayeaux wrote:
         | It sounds pretty good, this is my first time hearing about it
         | but it looks good. Even if it does detect that they are
         | separate entities talking, how does it label it in a way that's
         | helpful/useful for you? I guess it comes out as 'Speaker 1',
         | 'Speaker 2', etc in the end? And you can find/replace the
         | speakers with the actual people?
        
         | kkielhofner wrote:
         | I think there's been talk to do speaker diarization with
         | whisper-asr-webservice[0] which is also written in python and
         | should be able to make use of goodies such as pyannote-audio,
         | py-webrtcvad, etc.
         | 
         | Whisper is great but at the point we get to kludging various
         | things together it might start to make more sense to use
         | something like Nvidia NeMo[1] which was built with all of this
         | in mind and more.
         | 
         | [0] - https://github.com/ahmetoner/whisper-asr-webservice
         | 
         | [1] - https://github.com/NVIDIA/NeMo
        
         | coder543 wrote:
         | It's not as if people aren't trying to do that:
         | https://github.com/openai/whisper/discussions/264
         | 
         | I tried out this notebook about a month ago, and it was
         | _rough_. After spending an evening improving it, I got
         | everything  "working", but pyannote was not reliable. I tried
         | it against an hour-ish audio sample, and I found no way to tune
         | pyannote to keep track of ~10 speakers over the course of that
         | audio. It would identify some of the earlier speakers, but then
         | it felt like it lost attention and would just start labeling
         | every new speaker as the same speaker. There is an option to
         | force the minimum number of speakers higher, and that just
         | caused it to split some of the earlier speakers into multiple
         | labels. It did nothing to address the latter half of the audio.
         | 
         | So, sure, someone should continue working on putting the pieces
         | together, and I'm sure the notebook in the discussion I linked
         | has probably improved since then, but I think pyannote itself
         | needs some improvement first.
         | 
         | Sadly, I think using separate models for transcription and
         | diarization ends up being clunky to the point that it won't
         | ever be polished, no matter how good pyannote might get. If you
         | have a podcast-like environment where people get excited and
         | start talking over each other, then even if pyannote correctly
         | identifies all of the speakers during the overlapping segments
         | and when they spoke... Whisper cannot be used to separate
         | speakers. You end up with either duplicate transcripts
         | attributed to everyone involved, or something worse.
         | Impressively, I _have_ seen pyannote do exactly that, when it
         | 's working.
         | 
         | At the end of the day, I think someone is going to need to
         | either train Whisper to also perform diarization, or we're
         | going to need to wait until someone else open sources a model
         | that does both transcription and diarization simultaneously.
         | Unfortunately, it seems like most of these really big advances
         | in ML only happen when a corporate benefactor is willing to
         | dump money into the problem and then release the result, so we
         | might be waiting awhile. I'm trying to learn more about machine
         | learning, but I'm not at the point where I have any realistic
         | chance of making such an improvement to Whisper. Maybe someone
         | else around here can proven me wrong by just making it happen.
        
           | password4321 wrote:
           | Speaker recognition is another piece that isn't usually as
           | high a priority as recognizing the speech.
        
             | mayeaux wrote:
             | It's a new thing to me, I hadn't really considered it. Do
             | they have that for movies and stuff? I can't think of a
             | clear case when I've seen it
        
         | [deleted]
        
         | ramraj07 wrote:
         | And you expect the API to be free? If not why not use one of a
         | million other such services?
        
           | mayeaux wrote:
           | Well the code is open source, I don't know what I plan to do
           | with this, depends on how people like it, but for the
           | meantime you can use it to transcribe stuff for free which is
           | a victory unto itself
        
           | pessimizer wrote:
           | It's the idea that's free, not the API.
        
       | mayeaux wrote:
       | Yeah I apologize, the queue is a little messed up. It's not
       | showing your progress properly and it's not stopping people's
       | processing if they left (their websocket dies). I'm going to fix
       | these and reboot the server and the experience will be a lot
       | better, sorry. It is working properly though and transcribing all
       | this stuff but the queue needs some TLC, brb.
        
       | seligman99 wrote:
       | It broke when I tried to feed it an entire podcast file, but
       | still, I took this as a push to try out Whisper AI for myself,
       | turns out it's easier to use than I thought. Long story short, I
       | used it to transcriptify a podcast:
       | 
       | https://scotts-podcasts.s3.amazonaws.com/temp/whisper/Intern...
       | 
       | Not sure if there's a use for this that's not me, but I like the
       | idea of having subtitles for a podcast I'm listening to.
        
         | KevinBenSmith wrote:
         | In that case you could have a look at the Snipd podcast app.
         | They have Whisper built in :)
        
         | JanSt wrote:
         | What tool did you use for the player-text presentation?
        
           | seligman99 wrote:
           | Mostly just vanilla JS on that page, and a tiny bit of Python
           | glue code to turn the WebVTT output from Whisper into a data
           | format for the JS.
        
         | bergenty wrote:
         | It's stunning how it just looks like a human written copy. The
         | punctuation is close to perfect and I couldn't find any
         | transcription mistakes.
        
       | n8cpdx wrote:
       | This is a cool project. I've been very happy with whisper as an
       | alternative to otter; it works better and solves real problems
       | for me.
       | 
       | I feel compelled to point out whisper.cpp. It may be cheaper for
       | the author but is relevant for others.
       | 
       | I was running whisper on a gtx 1070 to get decent performance; it
       | was terribly slow on M1 Mac. Whisper.cpp has comparable
       | performance to the 1070 while running on M1 CPU. It is easy to
       | build and run and well documented.
       | 
       | https://github.com/ggerganov/whisper.cpp
       | 
       | I hope this doesn't come off the wrong way, I love this project
       | and I'm glad to see the technology democratized. Easily
       | accessible high-quality transcription will be a game changer for
       | many people and organizations.
        
         | emadda wrote:
         | How long would whisper.cpp take to transcribe 2 hours of audio
         | on M1?
        
           | kevinak wrote:
           | Not sure about M1, but on the Macbook Pro 14" with an M1 Max
           | using 8 threads I transcribed a 44 minute podcast in 16
           | minutes. So about 3x "real time" speed.
        
             | 19h wrote:
             | What model are you using? I guess large, as my M1 Max takes
             | about 1.4 min for a 4 min file (35% of total time)?
        
               | kevinak wrote:
               | Yep, large model.
        
         | tpetry wrote:
         | Thanks for sharing! I was looking for a M1 solution weeks ago
         | snd couldn't find any working one. Will try that one now!
         | Looking around for servers with GPUs etc. resulted in stopping
         | me at playing around with it as i got overwhelmed with options.
        
       | kkielhofner wrote:
       | What resources do you have for hosting this?
       | 
       | I setup a whisper-asr-api backend this week with gobs of CPU and
       | RAM and an RTX 3090. I'd be interested in making the API endpoint
       | available to you and working on the overall architecture to
       | spread the load, improve scalability, etc.
       | 
       | Let me know!
        
         | mayeaux wrote:
         | I'm just running this off of a 2x RTX A6000 server on Vast.ai
         | at the moment, about $1.30/h and then using nginx on another
         | server to reverse proxy it to Vast
         | 
         | Open an issue on the Github repo and we can collab for sure!:
         | https://github.com/mayeaux/generate-subtitles/issues
        
           | kkielhofner wrote:
           | Cool - will do!
           | 
           | Through a series of events I'm in the beneficial position of
           | my hosting costs (real datacenter, gig port, etc) being zero
           | and the hardware has long since paid for itself. I'm almost
           | just looking for ways to make it more productive at this
           | point.
        
             | mayeaux wrote:
             | Hey, I know the feeling, I felt bad when I had my GPU just
             | sitting there and it's just a little Vast server lol. If
             | you want to use your hardware to run this software I'd be
             | more than happy to help get it setup!
        
               | kkielhofner wrote:
               | For what's it worth my approach has been running a
               | tweaked whisper-asr-webservice[0] behind traefik behind
               | Cloudflare. Traefik enables end to end SSL (with
               | Cloudlare MITM, I know) and also helps put the brakes on
               | a little so even legitimate traffic that makes it through
               | Cloudflare gets handled optimally and gracefully. I could
               | easily deploy your express + node code instead (and
               | probably will anyway because I just like that approach
               | more than python).
               | 
               | Anyway, I'll be making an issue soon!
               | 
               | [0] - https://github.com/ahmetoner/whisper-asr-webservice
        
               | mayeaux wrote:
               | Right on, looking forward to it! Yeah I saw that module
               | and was planning to use it but I just wrote up an
               | Express/Node implementation first and never really looked
               | back. But looking forward to collabing I will await your
               | issues, cheers!
        
       | notpushkin wrote:
       | Should this have the "Show HN" tag?
       | 
       | Also, labels' [for] attributes are all "file" instead of
       | "language" and "model", so all labels trigger the file selection
       | dialog on click :-)
       | 
       | UPD: https://github.com/mayeaux/generate-subtitles/pull/1
        
         | mayeaux wrote:
         | Merged, thanks. Do I just add 'Show HN:' to the beginning of
         | the title?
        
           | notpushkin wrote:
           | I think so, yeah! https://news.ycombinator.com/showhn.html
        
             | mayeaux wrote:
             | Done, thanks! I'm at the top, feels good! I got zero
             | traction on Reddit so glad to see Hacker News was there for
             | me lol
        
       | mayeaux wrote:
       | I just rebooted the server, now if the websocket disconnects
       | (person closes the browser) it will kill the processing and move
       | the queue, so that should help unclog the queue. I'm going to add
       | a couple more queue touchups and then it should be stable again
       | (no reboots), but running well in the meantime
        
       ___________________________________________________________________
       (page generated 2022-11-19 23:02 UTC)