[HN Gopher] Show HN: I made a free transcription service powered...
___________________________________________________________________
Show HN: I made a free transcription service powered by Whisper AI
Author : mayeaux
Score : 197 points
Date : 2022-11-18 22:33 UTC (1 days ago)
(HTM) web link (freesubtitles.ai)
(TXT) w3m dump (freesubtitles.ai)
| smallerfish wrote:
| I've been testing whisper on AWS - the g4dn machines are the
| sweet spot of price/performance. It's extremely good, and there
| will be rapid consolidation in the transcription market as a
| result of it existing (its one major missing feature is the
| ability to supply custom dictionaries). The fact that it does a
| credible job at translation to english is a cherry on top.
|
| Anyway, I'd love to get it running well on g5g, but they seem
| extremely temperamental. If anybody has, please let me know your
| secret. :)
| pi3rre wrote:
| Where is your server hosted with?
| mayeaux wrote:
| 2x RTX A6000 server on Vast.ai with another server with nginx
| as a reverse proxy
| [deleted]
| gamegoblin wrote:
| I recently tried Whisper to transcribe our local Seattle Fire
| Department radio scanner -- unfortunately it was not reliable
| enough for my use case, e.g. "adult male hit by car" gets
| transcribed as "don't mail it by car".
|
| I imagine future models will allow the user to input some context
| to disambiguate. Like if I could give it the audio along with the
| context "Seattle Fire Department and EMS radio traffic", it would
| bias towards the type of things you'd likely hear on such a
| channel.
| [deleted]
| RockRobotRock wrote:
| Was there a big difference in accuracy depending on which model
| you used?
| gamegoblin wrote:
| Yes, large was by far the best, but still not accurate enough
| that I'd be willing to put it into a fully automated
| pipeline. It would have gotten it right probably 75% of the
| time. Anything other than the large model was far too bad to
| even think about using.
| mayeaux wrote:
| Whisper does pretty well, even with background music and
| things like that, I think you're working with a pretty
| weird subsection of recorded audio that won't work, for
| that edge case to work you'll very likely need to train
| your own model.
| kkielhofner wrote:
| What was the performance, resource usage, etc of doing this
| with large? What's the speed like?
|
| I'm still getting spun up on this but base delivers a
| pretty impressive 5-20x realtime on my RTX 3090. I haven't
| gotten around to trying the larger models and with only
| 24GB of VRAM I'm not sure what kind of success I'll have
| anyway...
|
| In my case the goal was to actually generate tweets based
| on XYZ. As I've already said there were serious technical
| challenges so I abandoned the project but I was also a
| little concerned about the privacy, safety, etc issues of
| realtime or near-realtime reporting on public safety
| activity. I also streamed to broadcastify and it really
| seems like they insert artificial delay because of these
| concerns.
| IanCal wrote:
| You can run the larger models just fine on a 3090. Large
| takes about 10G for transcribing English.
|
| For a 1:17 file it takes:
|
| 6s for base.en, I think 2s to load the model based on the
| sound of my power supply.
|
| 33s for large, I think 11s of which is loading the model.
|
| Varies a lot with how dense the audio file is, this was
| me giving a talk so not the fastest and quite clean
| audio.
|
| While I saw near perfect or perfect performance on many
| things with smaller models, the largest really are better
| . I'll upload a gist in a but with Rap God passed through
| base.en and large.
|
| edit -
|
| Timings (explicitly marked as language en and task
| transcribe):
|
| base.en => 23s
|
| large => 2m50
|
| Audio length 6m10
|
| Results (nsfw, it's Rap God by Eminem): https://gist.gith
| ub.com/IanCal/c3f9bcf91a79c43223ec59a56569c...
|
| Base model does well, given that it's a rap. Large model
| just does incredibly, imo. Audio is very clear, but it
| does have music too.
| kkielhofner wrote:
| > based on the sound of my power supply
|
| Hah, I love that - "benchmark by fan speed".
|
| Good to know - I've tried large and it works but in my
| case I'm using whisper-asr-webservice[0] which loads the
| configured model for each of the workers on startup. I
| have some prior experience with Gunicorn and other WSGI
| implementations so there's some playing around and
| benchmarking to be done on the configured number of
| workers as the GPU utilization of Whisper is a little
| spiky and whisper-asr-webservice does file format
| conversion on CPU via ffmpeg. Default was two workers, is
| now one but I've found as many as four with base can
| really improve overall utilization, response time, and
| scale (which certainly won't be possible with large).
|
| OPs node+express implementation shells out to Whisper
| which gives more control (like runtime specification of
| model) but almost certainly has to end up slower and less
| efficient in the long run as the model is obviously
| loaded from scratch on each invocation. I'm front-ending
| whisper-asr-webservice with traefik so I could certainly
| do something like having two separate instances (one for
| base, another for large) at different URL paths but like
| I said I need to do some playing around with it. The
| other issue is if this is being made available to the
| public I doubt I'd be comfortable without front-ending
| the entire thing with Cloudflare (or similar) and
| Cloudflare (and others) have things like 100s timeouts
| for final HTTP response (Websockets could get around
| this).
|
| Thanks for providing the Slim Shady examples, as a life-
| long hip hop enthusiast I'm not offended by the content
| in the slightest.
|
| [0] - https://github.com/ahmetoner/whisper-asr-webservice
| sf4lifer wrote:
| why did you want to transcribe it? What would you do with the
| output?
| gamegoblin wrote:
| I wanted to make a twitter bot that posted whenever a
| pedestrian/cyclist got hit by a car.
|
| I'd need to:
|
| - Wait for a MED6 or AID response code to come across the
| live event stream
|
| - Listen to the radio chatter to see if it was a pedestrian
| getting hit (use GPT3 on the transcription to determine if
| the text was about a ped/cyclist getting hit)
|
| - Maybe also correlate to SPD logging a 'mvc with injuries'
| at the same location
| [deleted]
| mayeaux wrote:
| I (used to) use simonsaysai.com to generate subtitles and they
| had the functionality to input specialized vocabulary, so I
| suppose it's possible in some sense but I don't know how it
| would work with Whisper, something to ask on their Github if
| nobody else has yet I suppose.
|
| But, for me, the English model works really well. Using the
| 'large' model works about perfectly for me, I can't think of
| anything I thought the large model got too badly wrong, is that
| the model you tried?
| gamegoblin wrote:
| Yes, the problem is that the radio chatter is just very, very
| low quality, for a lot of words your brain just needs to know
| the context to fill in the gaps due to radio static and such.
| Even as a human some parts are unintelligible.
| mayeaux wrote:
| Yeah it's a hard case, Whisper with the large model is
| among the cutting edge in the business so if the static is
| bad and the quality is low there's not much you can do but
| wait for better AI, or fix whatever they get wrong by hand,
| but Whisper AI is on the cutting edge so you might have to
| wait for a bit lol
| pain_perdu wrote:
| We built a working version of this using Assembly.ai
|
| I am no-longer involved in the project but you're welcome to
| contact the CTO if you're curious how it worked:
|
| https://skyscraper.ai/
| skim_milk wrote:
| Have you tried the --initial_prompt CLI arg? For my use, I put
| a bunch of industry jargon and names that are commonly
| misspelled in there and that fixes 1/3 to 1/2 of the errors.
|
| I was initially going to use Azure Cognitive Services and train
| it on a small amount of test data, after Whisper released for
| free I use Whisper + openai GPT-3 trained to fix the
| transcription errors by 1) taking a sample of transcripts by
| Whisper 2) fixing the errors and 3) fine-tuning GPT-3 by using
| the unfixed transcriptions as the prompt and the corrected
| transcripts as the result text.
|
| Whisper with the --initial_prompt containing industry jargon
| plus training GPT-3 to fix the transcription errors should be
| nearly as accurate as using a custom-trained model in Azure
| Cognitive Services but at 5-10% of the cost. Biggest downside
| is the amount of labor to set that up, and the snail's pace of
| Whisper transcriptions.
| gamegoblin wrote:
| Thanks for the tip, that did improve accuracy a lot.
| jerpint wrote:
| There have been a lot of hacks to speed up whisper inference
| skim_milk wrote:
| Sweet! Do you have any links to resources on how to speed
| it up? I couldn't find any while searching Google or the
| Whisper discussion forums.
| EMIRELADERO wrote:
| Not a hack per se but a complete reimplementation.
|
| https://github.com/ggerganov/whisper.cpp
|
| This is a C/C++ version of Whisper which uses the CPU.
| It's astoundingly fast. Maybe it won't work in your use
| case, but you should try!
| kkielhofner wrote:
| The issue here is for most radio systems you end up with about
| 3 kHz of effective audio bandwidth (sampling). Most ASR/STT
| models are trained on at least 16 kHz audio.
|
| Did you try a telephony oriented model like aspire or similar?
| They're trained on sort-of 8 kHz audio and might work better.
|
| I tried something similar for my SDR feeds and gave up because
| it's just too challenging and niche - the sampling, the jargon,
| the 10 codes, the background noise, static on analog
| systems/drop outs on digital systems, rate of speech, etc all
| contribute to very challenges issues for an ML model.
| jcalvinowens wrote:
| > the sampling, the jargon, the 10 codes, the background
| noise, static on analog systems/drop outs on digital systems,
| rate of speech, etc
|
| Is the reduced bandwidth really the most significant problem?
| Naively I'd think everything else you mentioned would matter
| a lot more, I'm curious how much you experimented with that
| specifically.
| kkielhofner wrote:
| When it all comes together it's kind of a nightmare for an
| ASR model. There were plenty of times in reviewing the
| recordings and ASR output where I'd listen to the audio and
| have no idea what they said.
|
| I'm not sure which contributes most but I know from my
| prior experiences with ASR for telephony even clean speech
| on pristine connections does much worse with models trained
| on 16 kHz being fed native 8 kHz audio that gets resampled.
|
| I've done some early work with Whisper in the telephony
| domain (transcribing voicemails on Asterisk and Freeswitch)
| and the accuracy already seems to be quite a bit worse.
| ivalm wrote:
| Could one train an interpolation layer (eg take a bunch
| of 16k audio, down sample to 8k, train 8k->16k
| upsampler)? Or better yet (but more expensive), take
| whisper, freeze it, and train the upsampler on whisper's
| loss.
| kkielhofner wrote:
| "I understand some of those words."
|
| Hah, in all seriousness I'm more of a practitioner in
| this space. If this was something I absolutely needed to
| get done who knows where it would have went. For a little
| side hacking project once I encountered these issues I
| moved on - back in the day expectations were lower for
| telephony and the 8 kHz aspire models and kaldi were
| adequate to get that "real work" done.
| andrew3726 wrote:
| Sure, that's called audio super resolution, there's a few
| papers/projects doing that. Haven't really seen models
| which are robust and have good generalization though.
| rexreed wrote:
| What is the maximum length of audio allowed? What are your costs
| in running this? Are the hardware requirements substantial?
| mayeaux wrote:
| Right now the limit is 100MB because of Cloudflare, no length
| limit, costs $1.30/h to run this, that's enough for a 2x RTX
| A6000 on Vast.ai can you check out the specs there
| lettergram wrote:
| There's a lot of startups starting in the space offering
| transcription.
|
| Read.ai - https://www.read.ai/transcription
|
| Provides transcription & diarization and the bot integrates into
| your calendar. It joins all your meetings for zoom, teams, meet,
| webex, tracks talk time, gives recommendations, etc.
|
| It's amazing how quickly this space is moving. Particularly, with
| the increase in remote work. Soon you'll be able to search all
| your meetings and find exactly when a particular topic was
| discussed! It's exciting.
| mayeaux wrote:
| Yeah I was paying $100/month for transcription services and
| turns out Whisper with the large model was much more accurate,
| and I didn't like the UI, I much prefer to just use this app as
| opposed to the paid service, and I chose it because it was the
| cheapest by far ($100/30h) as opposed to most of the other paid
| services which were $10 an hour which to me was a bit much
| really. But Whisper is really a game changer I don't know how
| those companies stay in business really.
| pessimizer wrote:
| Do you just hold the page open after upload and wait for it to
| update?
| mayeaux wrote:
| Yeah, there is a websocket connection and when the
| transcription is done it will update the frontend with the
| links to .srt, .vtt and .txt file downloads
| pessimizer wrote:
| Thanks. Does the queue position update through the websocket
| too?
| mayeaux wrote:
| Yeah it does, I think there's a bug with it for saying what
| your position is, but when the others are done it will
| start correctly even if the frontend shows like position -2
| or something. There's 2 uploads in the queue atm so not bad
| pessimizer wrote:
| Thanks a lot for this. I've wanted to test whisper's
| usefulness for vintage movie subtitling projects, but
| haven't had such a straightforward, preconfigured
| opportunity. I promise I'll beat the subs into some sort
| of shape as long as the timings are at least vaguely
| alright, and not waste your money.
| mayeaux wrote:
| Hey, glad I could be of use. The problem with Whisper is
| that it needs a lot of GPU. Actually my Mac can't even
| use my GPU so right away I had to get it up on a server,
| but Whisper is so powerful and it's so amazing that it's
| open source I am surprised nobody did this yet. I could
| see them charging for it but may as well use it anyways,
| the other services are insanely expensive ($10/h?!) and I
| don't really like their UIs to boot lol
| pessimizer wrote:
| Nothing has come back from the two that I tried (one
| medium in French, the other large in Spanish), meaning no
| change on the page since I uploaded them an hour an a
| half ago. I loaded the page again in another tab, though,
| and after a few seconds "finishedProcessing" appeared
| under the form. I suspect that means something.
|
| On Firefox 102.4.0esr, also uBlock Origin.
| mayeaux wrote:
| It's probably due to me rebooting to load new code, I
| will have a way to send a signal to the frontend to
| inform them but not implemented atm
| pessimizer wrote:
| I tried it again this morning. I'm getting all of the
| output properly this time, but it has hung partway
| through every time I tried.
| pessimizer wrote:
| It's me again. Ran it again, ran perfectly. Thanks for
| all of your work.
|
| edit: don't know if you'll see this any time soon, but
| I've had it fail/hang again. You might want to take a
| hash of uploads, so if the lost connections still end up
| getting transcribed, if they're reuploaded they won't get
| transcribed again.
|
| Also I haven't had success in Firefox, only Chromium.
| ako wrote:
| Recently I've been using my iPad as a transcription "keyboard"
| for my laptop when writing documents in Dropbox paper. Open the
| document on both computers, then use dictate to enter all text,
| and then correct if necessary with the laptop. I'm expecting to
| move to iPad only when I can use stage manager with an external
| monitor.
|
| Dictation is also great when writing in a foreign language: I
| speak German ok-ish, writing is harder. Dictation helps writing
| more correct German.
| Namidairo wrote:
| Shouldn't the language and model inputs be dropdowns instead of
| text input?
|
| I'm going to hope/assume you're doing some sort of sanitisation
| on those inputs.
|
| Additionally, wouldn't you lose the language detection that's
| done for no language input? (IIRC, it uses the first 30 seconds
| to detect language if you don't specify one)
| mayeaux wrote:
| Yeah someone submitted a PR for those to be fixed, I'm just
| wary about restarting the server because I haven't setup a way
| to be able to reboot without losing the websockets
|
| Well those inputs should all error unless they are a valid
| value.
|
| Yes if nothing is input it will automatically detect the
| language based on the first 30s of input
| Namidairo wrote:
| I wasn't even aware that it was on GitHub, which I suppose is
| an issue in itself.
| TheBrokenRail wrote:
| > I'm just wary about restarting the server because I haven't
| setup a way to be able to reboot without losing the
| websockets
|
| Wait what. Not being able to safely restart the server sounds
| like a disaster waiting to happen.
| mayeaux wrote:
| This was just a personal project a couple hours ago so it's
| not setup properly to do safe reboots and a lot of other
| things, I was just using it locally and now it's in in the
| wild, will take some time to get everything refined and
| professional
| bergenty wrote:
| Isn't this very expensive to host? Are you aware this
| could cost A LOT?
| jazzyjackson wrote:
| in another comment they state:
|
| > I'm just running this off of a 2x RTX A6000 server on
| Vast.ai at the moment, about $1.30/h
|
| whether that's a lot is a matter of perspective
| jazzyjackson wrote:
| > losing the websockets
|
| users would lose the session and have to start over, not
| the end of the world
| mayeaux wrote:
| I'm not even using sessions just localstorage and
| websockets lol.
| mayeaux wrote:
| Rebooted the app do to queue upgrades, if you had a pending
| upload please reupload it, thanks!
| mayeaux wrote:
| I've made some updates and the queue works how I wanted it, I
| rebooted the server and I think I will leave it like this for a
| while
| tomcam wrote:
| Congratulations on taking this to completion and announcing here!
| Love your approach to this!
| mayeaux wrote:
| Thanks! I wrote it for myself over a weekend and have really
| enjoyed it ever since I'm glad others were able to get
| something out of it! It seems to run pretty well but I have
| some improvements planned, first is I will take the whisper
| output and feed it to you when you're in the queue so you can
| see them progressing. Will be pretty trivial to implement but I
| am feeling bored in the queue at the site atm so that is the
| next killer feature lol
| miki123211 wrote:
| If somebody wants to run Whisper for dirt cheap, vast.ai is the
| way to go.
|
| I can usually get an (unverified) 1x RTX 3090 instance for about
| $0.10/hr, and that processes audio at something like 1.5X speeds.
| Unverified instances do crash once in a while, but as long as you
| back up the output every few hours, it's fine, you just set up a
| new one in case something happens. I wouldn't use this for
| confidential company meetings, but it's good enough for podcasts,
| Youtube Videos and other public or semi-public stuff.
| j45 wrote:
| Exciting to see.
|
| Curious if there was a benefit to using whisper over something
| like vosk which can transcribe on mobile device pretty decently.
|
| Whisper has other interesting functionality but for straight
| transcription it seems a bit heavy. Still learning about it and
| putting it through its paces.
| nshm wrote:
| We did comparison of recent Vosk and Whisper models here:
|
| https://alphacephei.com/nsh/2022/10/22/whisper.html
|
| In general, Whisper is more accurate but much more resource
| heavy. Vosk runs on single core while Whisper needs all CPU
| cores.
|
| Accuracy difference for clean speech between Vosk-small and
| Whisper tiny is 2-3% absolute, 20% relative. Not sure how
| important is it, I would claim it is not that critical.
|
| Numbers there are for original Whisper. Whisper.cpp recommended
| here is actually 10% worse than stock Whisper for speed
| considerations. Not that simple.
|
| Vosk is streaming design, you get results with minimum latency
| of 200ms. Whisper requires you to wait for significant amount
| of time. If you refactor Whisper for lower latency you will
| loose a lot of accuracy advantage. Latency is very importan for
| interactive applications like assistants.
|
| Whisper is multilingual and has punctuation, that is a clearly
| a good advantage. It also can use context properly improving
| for long recordings.
|
| So on mobile Vosk is still a viable option actually as many
| others mobile-focused engines.
|
| For server based transcription Whisper is certainly better. But
| not much better than Nvidia Nemo for example. Not that much
| publicity for the former though.
| Void_ wrote:
| I'm sorry to self promote again - but: https://whispermemos.com
|
| I'm in love with the idea of pressing button on my Lock Screen
| and getting a perfect transcription in my inbox.
|
| Also, just added emoji summarization in email subject, a small
| visual reminder of what your memo was about.
|
| I hope this is useful to someone!
| o_____________o wrote:
| That's a cool idea. How about integrations like github or
| notion which would write out a markdown file?
| fire wrote:
| is the app open source? I'm on android, so :c
| nsriv wrote:
| If you don't mind non-OSS, the Google Recorder App is
| incredible.
| krick wrote:
| I didn't manage to transcribe anything (it just doesn't remember
| that I submitted anything), but whatever, I didn't need to
| anyway. I just wanted to ask: how good is Whisper with non-
| english? At least "major" ones, like, German, French, Russian,
| Spanish?
| smallerfish wrote:
| Spanish is great.
| mayeaux wrote:
| I've got people to test and usually with the 'medium' and
| 'large' models it works really efficiently. Honestly, I just
| use the large model for everything, because might as well have
| the best quality if you're going to do the effort.
| stefan_ wrote:
| I've been using it on lots of Russian stuff and it's great,
| even translates for you.
| gdz wrote:
| WER breakdown by languages of Fleurs dataset, using the large
| model: https://github.com/openai/whisper/blob/main/language-
| breakdo...
|
| from https://github.com/openai/whisper#available-models-and-
| langu...
| jerpint wrote:
| I've tried it on French and results were pretty good
| braindead_in wrote:
| This is cool. We have integrated Whisper with our human in the
| loop tech at Scribie [1] and the results have been great. We
| offer free credits if you want to try it out.
|
| [1] https://scribie.com
| donqu1xote1 wrote:
| This is an exciting one! We are building an open-source low-code
| alts to Retool and I think we can build integration with your
| projects! Take a look at ours and see if you want to collaborate
| or not. Here you go:https://github.com/illacloud/illa-builder
| bl4ckneon wrote:
| Please stop spamming your link everywhere. You know people can
| see your post history?
| si_164 wrote:
| I've been running whisper in the terminal with Python and I've
| found whisper surprisingly accurate with the transcription even
| from Chinese.
|
| Just given your site a try, nicely done. One feedback - would be
| great to have a progress indicator on the processing page, I have
| no idea what stage it's at or how much longer I need to wait.
| mayeaux wrote:
| Yeah Whisper is top of the line, they posted their performance
| compared to the industry standard and it's right in there at
| the top.
|
| It _should_ show the data via processing.. it 's setup to just
| take whatever stdout/stderr comes back from Whisper and send it
| directly to the frontend via websockets, I'm surprised you got
| stuck there :thinking:
| mayeaux wrote:
| Okay that should be my last reboot for a while, I've got it ready
| how I'd like it for now, feel free to give any feedback!
| marcooliv wrote:
| Good luck with your AWS bill Haha. But seriously. How?
| mayeaux wrote:
| I am just paying for a somewhat expensive server and I love how
| it's really fast but also I have a lot of free GPU time so
| might as well let others use it too lol. It's an experiment to
| see if people will use it productively or if someone will abuse
| it and ruin it for others lol
| marcooliv wrote:
| I appreciate your initiative!
| mayeaux wrote:
| Thanks! Whisper is a lot of fun but it didn't take long
| before I wanted to build a frontend. And then I built
| something that I think came out super nice so why not share
| it with people. I used to pay $100/month for transcriptions
| and this works a lot better for me so might as well open-
| source transcription if I can, but I give all the credit to
| Whisper that module they put out is amazing
| sali0 wrote:
| Very generous of you. I made a similar free service 3
| years ago using much worse tools and it's so cool to see
| whisper making it all so much better and more efficient.
| Thanks for releasing for free
| mayeaux wrote:
| No problem! I am just seeing how it runs, I might throw
| up a referral link to Vast and put up a tutorial on how
| to host your own service, maybe that can offset the cost
| a bit? The current server is $700/month, maybe it could
| just run off donations who knows
| bayramn wrote:
| $700/month? Where digital ocean? I am new to python and
| ML, curious to know why..
| knicholes wrote:
| someone will abuse it and ruin it for others
| mayeaux wrote:
| Maybe I will put in some mechanism to prevent that but for
| now I just want to see if people could find it useful. I
| also have the code open source and will write tutorials for
| people to put up their own instance as well
| rexreed wrote:
| I'd love to read a tutorial on how to do this for myself.
| trompetenaccoun wrote:
| It would be more useful if one could directly paste links
| to videos online as well. But yeah, in general this is
| extremely useful. I'm looking forward to video site
| integration. Would be great if youtube could finally
| retire their horrible auto caption function for something
| that actually works. Being able to easily watch media in
| different languages from around the world will be an
| absolute game changer.
| mayeaux wrote:
| Also, I have that tested (auto download) with YouTube-Dl,
| it works fine but haven't put it into the frontend, but
| may as well, it helps a lot on your own instance so you
| don't have to download it first and then upload it
| mayeaux wrote:
| I also plan to support automatic language translation I
| have that working locally already actually, and I work
| for one of the big alt-video platforms and rumour has it
| that I will be shipping this feature for them soon (auto
| transcription with auto translated subtitles)
| trompetenaccoun wrote:
| Really cool, that would be a killer feature. Definitely
| post here when this gets released!
| mayeaux wrote:
| Yeah it's all ready to go using LibreTranslate, they have
| about 25 languages, maybe I'll finish that this weekend
| and put it up, it's really inexpensive to make the
| translations compared to making the original
| transcription so may as well. Coming soon!
| tipsytoad wrote:
| Would love to locally host this, do you have a source?
| mayeaux wrote:
| No docs or anything yet but:
| https://github.com/mayeaux/generate-subtitles
| Super_Jambo wrote:
| thank you!
| giantg2 wrote:
| Not even abuse, but just intensive use cases. Like the guy
| who posted a few says ago about recording and transcribing
| all day.
| mayeaux wrote:
| I setup the server to only transcribe two files at a
| time, so yeah someone could abuse it for sure with two
| big uploads and stick everyone else on the queue. But for
| me, even a 3 hour video translates with large model in
| about ~30 minutes so it wouldn't be too bad, but
| hopefully everyone is conscious to not do that, so far
| nobody has abused it which is cool.
| kkielhofner wrote:
| Me again - why two at a time? In my initial testing with
| whisper-asr-webservice and my RTX 3090 I could pretty
| easily throw ~10 different files at it simultaneously as
| there is some natural staggering between API entry, CPU
| conversion/resampling/transcoding of audio, the actual
| audio length, network effects like upload speed, etc.
|
| I also implemented some anti-abuse-ish features between
| traefik and Cloudflare that should help it stand up
| better in the face of bad actors abusing it.
|
| Certainly not something to necessarily depend on but I
| thought I'd mention it.
| oefrha wrote:
| > I am just paying for a somewhat expensive server and I
| love how it's really fast but also I have a lot of free
| GPU time so might as well let others use it too lol.
|
| They are donating some spare capacity.
| jw1224 wrote:
| Free startup idea: Use Whisper with pyannote-audio[0]'s speaker
| diarization. Upload a recording, get back a multi-speaker
| annotated transcription.
|
| Make a JSON API and I'll be your first customer.
|
| [0] https://github.com/pyannote/pyannote-audio
| mayeaux wrote:
| It sounds pretty good, this is my first time hearing about it
| but it looks good. Even if it does detect that they are
| separate entities talking, how does it label it in a way that's
| helpful/useful for you? I guess it comes out as 'Speaker 1',
| 'Speaker 2', etc in the end? And you can find/replace the
| speakers with the actual people?
| kkielhofner wrote:
| I think there's been talk to do speaker diarization with
| whisper-asr-webservice[0] which is also written in python and
| should be able to make use of goodies such as pyannote-audio,
| py-webrtcvad, etc.
|
| Whisper is great but at the point we get to kludging various
| things together it might start to make more sense to use
| something like Nvidia NeMo[1] which was built with all of this
| in mind and more.
|
| [0] - https://github.com/ahmetoner/whisper-asr-webservice
|
| [1] - https://github.com/NVIDIA/NeMo
| coder543 wrote:
| It's not as if people aren't trying to do that:
| https://github.com/openai/whisper/discussions/264
|
| I tried out this notebook about a month ago, and it was
| _rough_. After spending an evening improving it, I got
| everything "working", but pyannote was not reliable. I tried
| it against an hour-ish audio sample, and I found no way to tune
| pyannote to keep track of ~10 speakers over the course of that
| audio. It would identify some of the earlier speakers, but then
| it felt like it lost attention and would just start labeling
| every new speaker as the same speaker. There is an option to
| force the minimum number of speakers higher, and that just
| caused it to split some of the earlier speakers into multiple
| labels. It did nothing to address the latter half of the audio.
|
| So, sure, someone should continue working on putting the pieces
| together, and I'm sure the notebook in the discussion I linked
| has probably improved since then, but I think pyannote itself
| needs some improvement first.
|
| Sadly, I think using separate models for transcription and
| diarization ends up being clunky to the point that it won't
| ever be polished, no matter how good pyannote might get. If you
| have a podcast-like environment where people get excited and
| start talking over each other, then even if pyannote correctly
| identifies all of the speakers during the overlapping segments
| and when they spoke... Whisper cannot be used to separate
| speakers. You end up with either duplicate transcripts
| attributed to everyone involved, or something worse.
| Impressively, I _have_ seen pyannote do exactly that, when it
| 's working.
|
| At the end of the day, I think someone is going to need to
| either train Whisper to also perform diarization, or we're
| going to need to wait until someone else open sources a model
| that does both transcription and diarization simultaneously.
| Unfortunately, it seems like most of these really big advances
| in ML only happen when a corporate benefactor is willing to
| dump money into the problem and then release the result, so we
| might be waiting awhile. I'm trying to learn more about machine
| learning, but I'm not at the point where I have any realistic
| chance of making such an improvement to Whisper. Maybe someone
| else around here can proven me wrong by just making it happen.
| password4321 wrote:
| Speaker recognition is another piece that isn't usually as
| high a priority as recognizing the speech.
| mayeaux wrote:
| It's a new thing to me, I hadn't really considered it. Do
| they have that for movies and stuff? I can't think of a
| clear case when I've seen it
| [deleted]
| ramraj07 wrote:
| And you expect the API to be free? If not why not use one of a
| million other such services?
| mayeaux wrote:
| Well the code is open source, I don't know what I plan to do
| with this, depends on how people like it, but for the
| meantime you can use it to transcribe stuff for free which is
| a victory unto itself
| pessimizer wrote:
| It's the idea that's free, not the API.
| mayeaux wrote:
| Yeah I apologize, the queue is a little messed up. It's not
| showing your progress properly and it's not stopping people's
| processing if they left (their websocket dies). I'm going to fix
| these and reboot the server and the experience will be a lot
| better, sorry. It is working properly though and transcribing all
| this stuff but the queue needs some TLC, brb.
| seligman99 wrote:
| It broke when I tried to feed it an entire podcast file, but
| still, I took this as a push to try out Whisper AI for myself,
| turns out it's easier to use than I thought. Long story short, I
| used it to transcriptify a podcast:
|
| https://scotts-podcasts.s3.amazonaws.com/temp/whisper/Intern...
|
| Not sure if there's a use for this that's not me, but I like the
| idea of having subtitles for a podcast I'm listening to.
| KevinBenSmith wrote:
| In that case you could have a look at the Snipd podcast app.
| They have Whisper built in :)
| JanSt wrote:
| What tool did you use for the player-text presentation?
| seligman99 wrote:
| Mostly just vanilla JS on that page, and a tiny bit of Python
| glue code to turn the WebVTT output from Whisper into a data
| format for the JS.
| bergenty wrote:
| It's stunning how it just looks like a human written copy. The
| punctuation is close to perfect and I couldn't find any
| transcription mistakes.
| n8cpdx wrote:
| This is a cool project. I've been very happy with whisper as an
| alternative to otter; it works better and solves real problems
| for me.
|
| I feel compelled to point out whisper.cpp. It may be cheaper for
| the author but is relevant for others.
|
| I was running whisper on a gtx 1070 to get decent performance; it
| was terribly slow on M1 Mac. Whisper.cpp has comparable
| performance to the 1070 while running on M1 CPU. It is easy to
| build and run and well documented.
|
| https://github.com/ggerganov/whisper.cpp
|
| I hope this doesn't come off the wrong way, I love this project
| and I'm glad to see the technology democratized. Easily
| accessible high-quality transcription will be a game changer for
| many people and organizations.
| emadda wrote:
| How long would whisper.cpp take to transcribe 2 hours of audio
| on M1?
| kevinak wrote:
| Not sure about M1, but on the Macbook Pro 14" with an M1 Max
| using 8 threads I transcribed a 44 minute podcast in 16
| minutes. So about 3x "real time" speed.
| 19h wrote:
| What model are you using? I guess large, as my M1 Max takes
| about 1.4 min for a 4 min file (35% of total time)?
| kevinak wrote:
| Yep, large model.
| tpetry wrote:
| Thanks for sharing! I was looking for a M1 solution weeks ago
| snd couldn't find any working one. Will try that one now!
| Looking around for servers with GPUs etc. resulted in stopping
| me at playing around with it as i got overwhelmed with options.
| kkielhofner wrote:
| What resources do you have for hosting this?
|
| I setup a whisper-asr-api backend this week with gobs of CPU and
| RAM and an RTX 3090. I'd be interested in making the API endpoint
| available to you and working on the overall architecture to
| spread the load, improve scalability, etc.
|
| Let me know!
| mayeaux wrote:
| I'm just running this off of a 2x RTX A6000 server on Vast.ai
| at the moment, about $1.30/h and then using nginx on another
| server to reverse proxy it to Vast
|
| Open an issue on the Github repo and we can collab for sure!:
| https://github.com/mayeaux/generate-subtitles/issues
| kkielhofner wrote:
| Cool - will do!
|
| Through a series of events I'm in the beneficial position of
| my hosting costs (real datacenter, gig port, etc) being zero
| and the hardware has long since paid for itself. I'm almost
| just looking for ways to make it more productive at this
| point.
| mayeaux wrote:
| Hey, I know the feeling, I felt bad when I had my GPU just
| sitting there and it's just a little Vast server lol. If
| you want to use your hardware to run this software I'd be
| more than happy to help get it setup!
| kkielhofner wrote:
| For what's it worth my approach has been running a
| tweaked whisper-asr-webservice[0] behind traefik behind
| Cloudflare. Traefik enables end to end SSL (with
| Cloudlare MITM, I know) and also helps put the brakes on
| a little so even legitimate traffic that makes it through
| Cloudflare gets handled optimally and gracefully. I could
| easily deploy your express + node code instead (and
| probably will anyway because I just like that approach
| more than python).
|
| Anyway, I'll be making an issue soon!
|
| [0] - https://github.com/ahmetoner/whisper-asr-webservice
| mayeaux wrote:
| Right on, looking forward to it! Yeah I saw that module
| and was planning to use it but I just wrote up an
| Express/Node implementation first and never really looked
| back. But looking forward to collabing I will await your
| issues, cheers!
| notpushkin wrote:
| Should this have the "Show HN" tag?
|
| Also, labels' [for] attributes are all "file" instead of
| "language" and "model", so all labels trigger the file selection
| dialog on click :-)
|
| UPD: https://github.com/mayeaux/generate-subtitles/pull/1
| mayeaux wrote:
| Merged, thanks. Do I just add 'Show HN:' to the beginning of
| the title?
| notpushkin wrote:
| I think so, yeah! https://news.ycombinator.com/showhn.html
| mayeaux wrote:
| Done, thanks! I'm at the top, feels good! I got zero
| traction on Reddit so glad to see Hacker News was there for
| me lol
| mayeaux wrote:
| I just rebooted the server, now if the websocket disconnects
| (person closes the browser) it will kill the processing and move
| the queue, so that should help unclog the queue. I'm going to add
| a couple more queue touchups and then it should be stable again
| (no reboots), but running well in the meantime
___________________________________________________________________
(page generated 2022-11-19 23:02 UTC)