[HN Gopher] FFmpeg 8.0 adds Whisper support
___________________________________________________________________
FFmpeg 8.0 adds Whisper support
Author : rilawa
Score : 766 points
Date : 2025-08-13 10:19 UTC (12 hours ago)
(HTM) web link (code.ffmpeg.org)
(TXT) w3m dump (code.ffmpeg.org)
| ggap wrote:
| Very interesting to see this!
| zzsshh wrote:
| Does this finally enable dynamically generating subtitles for
| movies with AI?
| diggan wrote:
| Finally? I think VLC demo'd this a while ago at some conference
| where they had a table, if I remember correctly.
| SSLy wrote:
| VLC and ffmpeg are unrelated projects
| demurgos wrote:
| I'm not very familiar with them, but I always assumed that
| there is a lot of overlap between the maintainers of both
| projects.
| SSLy wrote:
| Well, they are just unrelated. VLC has a plugin to access
| ffmpeg codecs via libav*, that's about it.
| guipsp wrote:
| They are not completly unrelated. There is significant
| overlap. FFMPEG also uses libs from VLC.
| mmmpetrichor wrote:
| I've been waiting a while now for automatic translated
| subtitles in vlc. I thought it would be here by now. I'm
| probably underestimating the difficulty but I'm surprised
| some video player hasn't done it by now. (as far as I know).
| jeroenhd wrote:
| A lot of subtitles from commercial media use a subtitle
| format that's essentially a bitmap that the video player
| overlays on top of the video. There are tools to decode
| this using OCR, but it's not something I'd enable by
| default.
|
| For text/srt subtitles, translation would probably be
| easier. There's a plugin for that already if you're okay
| with online translation services:
| https://github.com/nopium/vlc-trans-lua
| jeroenhd wrote:
| Docs say: If set, the transcription output
| will be sent to the specified file or URL (use one of
| the FFmpeg AVIO protocols); otherwise, the output will be
| logged as info messages. The output will also be set in
| the "lavfi.whisper.text" frame metadata. If the
| destination is a file and it already exists, it will be
| overwritten. @item format The destination
| format string; it could be "text" (only the transcribed text
| will be sent to the destination), "srt" (subtitle format) or
| "json". Default value: @code{"text"}
|
| I don't know if this can embed the subtitles, but it does
| support generating accompanying srt files.
|
| Of course, you could already do that by just manually calling
| whisper on files, but now you don't need to export parts or
| transformed media files to feed into whisper.
| regularfry wrote:
| If you have enough processing power. Without a GPU it's going
| to lag.
| KeplerBoy wrote:
| Whisper is pretty fast.
| jeroenhd wrote:
| In my experience, a small/tiny whisper model has pretty okay
| English decoding speed on something relatively modern even
| without GPU support. There's a bunch of latency in the
| process (because of technological limitations) but the
| optimised C++ version shouldn't pose too much of a problem
| unless you're running in power saving mode. Battery life may
| be a problem on older laptops, though.
| boutell wrote:
| Shut off the broken bot filter so we can read it please
| diggan wrote:
| Took my iPhone 12 Mini a whole of 0.1 seconds to pass it. What
| hardware/OS are you using?
| johnisgood wrote:
| Took me 8 seconds on my shitty desktop.
| londons_explore wrote:
| Took about 30 secs for me (5 yr old intel cpu). Looked like
| there was a progress bar, but it didn't progress. Maybe the
| difficulty varies depending on IP address?
| jeroenhd wrote:
| Anubis has config for that:
| https://anubis.techaro.lol/docs/admin/policies#request-
| weigh...
|
| It's up to the site admin to configure it that way, but
| it's possible some IP ranges/user agents are more often
| used by bots and therefore have an increased weight.
|
| For old browsers there's also an option to use meta refresh
| instead of JS (https://anubis.techaro.lol/docs/admin/config
| uration/challeng...) but that's quite a recent addition and
| not enabled by default.
| diggan wrote:
| > Maybe the difficulty varies depending on IP address?
|
| I'm currently roaming in Finland with a Spanish SIM so
| would have expected the opposite in that case.
| ta1243 wrote:
| my i5-6200U with firefox/linux is about 10 years old. I
| used a variety of add blocking and fingerprint blocking
| techniques. Cloudflare often complains and blocks me.
|
| This page loaded pretty much instantly (certainly in the
| time it took to switch to the background tab I loaded in).
| But then ffmpeg is written by old school engineers with old
| school ways of working. Their social media accounts are a
| hilarity of trolling worthy of slashdot in its peak.
| politelemon wrote:
| Took me zero seconds to be blocked with invalid response
| miloignis wrote:
| It also instantly blocks me on GrapheneOS, both Firefox and
| Vanadium. Very odd, as I've never had an issue with Anubis
| before.
| shaky-carrousel wrote:
| GrapheneOS here, with Vanadium in incognito, it doesn't
| block me, both in wifi and in mobile. Maybe it was a
| temporary hiccup.
| miloignis wrote:
| Thanks for checking! Incognito blocks me too, no idea
| whats up. Maybe I'm getting tripped up by IP reputation
| or something (though I shouldn't, normal residential
| connection).
| blahyawnblah wrote:
| The stock chrome browser Google news uses
| jeroenhd wrote:
| Check out commit 13ce36fef98a3f4e6d8360c24d6b8434cbb8869b from
| https://git.ffmpeg.org/ffmpeg.git if your web browser doesn't
| support Javascript. The linked page is just a git viewer for
| that specific commit.
| yorwba wrote:
| Or read the documentation for the new whisper filter:
| https://ffmpeg.org/ffmpeg-filters.html#whisper-1
| jeroenhd wrote:
| That also works, I assumed the ffmpeg website would also be
| behind Anubis if the git server is, but it doesn't actually
| seem to be.
| majewsky wrote:
| Anubis is not all that useful for static websites since
| serving them does not generate high load (unlike when a
| bot traverses a Git server UI).
| QuantumNomad_ wrote:
| Archived snapshots of the linked page:
|
| https://web.archive.org/web/20250813104007/https://code.ffmp...
|
| https://archive.is/dmj17
|
| You can read it on one of these without having to pass that
| specific bot check
| majewsky wrote:
| From experience, these bot filters are usually installed
| because the site would be down entirely without rejecting AI
| scrapers, so the argument to shut it off to improve usability
| is rather silly.
| superkuh wrote:
| They don't need to shut off Anubis, they just need to configure
| it beyond the defaults. If they turned on the meta-refresh
| based challenge then all browsers could access it while still
| keeping most of the bots away. But few people ever configure
| these things and just accept the broken defaults.
|
| With the current broken default config my browser can't even
| run the JS challenge due to it using unsupported bleeding edge
| JS features.
| xena wrote:
| Hi, can you please paste the error message you get? This
| should be using features that are supported widely as of 2022
| and I regularly test on Firefox LTS.
| kwar13 wrote:
| Fantastic! I am working on a speech-to-text GNOME extension that
| would immensely benefit from this.
|
| https://github.com/kavehtehrani/gnome-speech2text
| dotancohen wrote:
| Why is this a Gnome extension? I would love to use this in KDE.
| guipsp wrote:
| Likely because they are a GNOME user and the APIs are DE
| specific.
| lawik wrote:
| I wonder if they'll be satisfied there or add a chunk of others
| now that they've started. Parakeet is supposed to be good?
|
| Should they add Voice Activity Detection? Are these separate
| filters or just making the whisper filter more fancy?
| shrx wrote:
| Voice Activity Detection support is already included.
| voxadam wrote:
| Am I correct in understanding that Whisper is a speech
| recognition AI model originally created by OpenAI?
|
| https://en.wikipedia.org/wiki/Whisper_(speech_recognition_sy...
| acidburnNSA wrote:
| Yes, according to the comments in the patch, you are correct.
| kwar13 wrote:
| yes.
| johnisgood wrote:
| Yes.
|
| From the documentation:
|
| > It runs automatic speech recognition using the OpenAI's
| Whisper model.
| voxadam wrote:
| Thanks, I was being tripped up by DDOS protection on
| code.ffmpeg.org for a minute and couldn't read the patch. The
| combo of Firefox and the fact that Quantum/Lumen/CenturyLink
| seems to get off by rotating my dynamic IP for no reason
| occasionally triggers various DDOS protections schemes.
| johnisgood wrote:
| No problem. :) Yeah, it took me 8 seconds to get through.
| It seems your issue was worse.
| Maxious wrote:
| yep, there's a c++ implementation to run it
| https://github.com/ggml-org/whisper.cpp
| oezi wrote:
| Isn't WhisperX the canonical choice for running Whisper?
| sampullman wrote:
| Maybe for running locally? whisper.cpp is nice because you
| can embed it pretty easily in apps for various targets like
| iOS, OSX, Android, wasm, etc.
| 0points wrote:
| While whisper and whisperx is python implementations, the
| whisper.cpp wins the benchmarks.
| AlienRobot wrote:
| I think so, if I remember correctly PotPlayer also supports it
| for automatic subtitling.
| cess11 wrote:
| Kind of, it's a family of audio transcription models.
|
| https://huggingface.co/search/full-text?q=whisper
| londons_explore wrote:
| Does this have the ability to edit historic words as more info
| becomes available?
|
| Eg. If I say "I scream", it sounds phonetically identical to "Ice
| cream".
|
| Yet the transcription of "I scream is the best dessert" makes a
| lot less sense than "Ice cream is the best dessert".
|
| Doing this seems necessary to have both low latency _and_ high
| accuracy, and things like transcription on android do that and
| you can see the adjusting guesses as you talk.
| ph4evers wrote:
| Whisper works on 30 second chunks. So yes it can do that and
| that's also why it can hallucinate quite a bit.
| jeroenhd wrote:
| The ffmpeg code seems to default to three second chunks
| (https://ffmpeg.org/ffmpeg-filters.html#whisper-1):
| queue The maximum size that will be
| queued into the filter before processing the audio with
| whisper. Using a small value the audio stream will be
| processed more often, but the transcription quality will be
| lower and the required processing power will be higher. Using
| a large value (e.g. 10-20s) will produce more accurate
| results using less CPU (as using the whisper-cli tool), but
| the transcription latency will be higher, thus not useful to
| process real-time streams. Consider using the vad_model
| option associated with a large queue value. Default value:
| "3"
| londons_explore wrote:
| so if "I scream" is in one chunk, and "is the best dessert"
| is in the next, then there is no way to edit the first
| chunk to correct the mistake? That seems... suboptimal!
|
| I don't think other streaming transcription services have
| this issue since, whilst they do chunk up the input, past
| chunks can still be edited. They tend to use "best of N"
| decoding, so there are always N possible outputs, each with
| a probability assigned, and as soon as one word is the same
| in all N outputs then it becomes fixed.
|
| The internal state of the decoder needs to be duplicated N
| times, but that typically isn't more than a few kilobytes
| of state so N can be hundreds to cover many combinations of
| ambiguities many words back.
| miki123211 wrote:
| The right way to do this would be to use longer,
| overlapping chunks.
|
| E.g. do thranscription every 3 seconds, but transcribe
| the most recent 15s of audio (or less if it's the
| beginning of the recording).
|
| This would increase processing requirements
| significantly, though. You could probably get around some
| of that with clever use of caching, but I don't think any
| (open) implementation actually does that.
| superluserdo wrote:
| I basically implemented exactly this on top of whisper
| since I couldn't find any implementation that allowed for
| live transcription.
|
| https://tomwh.uk/git/whisper-chunk.git/
|
| I need to get around to cleaning it up but you can
| essentially alter the number of simultaneous overlapping
| whisper processes, the chunk length, and the chunk
| overlap fraction. I found that the `tiny.en` model is
| good enough with multiple simultaneous listeners to be
| able to have highly accurate live English transcription
| with 2-3s latency on a mid-range modern consumer CPU.
| dylan604 wrote:
| If real-time transcription is so bad, why force it to be
| real-time. What happens if you give it a 2-3 second
| delay? That's pretty standard in live captioning. I get
| real-time being the ultimate goal, but we're not there
| yet. So working within the current limitations is piss
| poor transcription in real-time really more
| desirable/better than better transcriptions 2-3 second
| delay?
| llarsson wrote:
| Attention is all you need, as the transformative paper
| (pun definitely intended) put it.
|
| Unfortunately, you're only getting attention in 3 second
| chunks.
| no_wizard wrote:
| That's because at the end of the day this technology
| doesn't "think". It simply holds context until the next
| thing without regard for the previous information
| abdullahkhalids wrote:
| Which other streaming transcription services are you
| referring to?
| londons_explore wrote:
| Googles speech to text API:
| https://cloud.google.com/speech-to-text/docs/speech-to-
| text-...
|
| The "alternatives" and "confidence" field is the result
| of the N-best decodings described elsewhere in the
| thread.
| jeroenhd wrote:
| I don't know an LLM that does context based rewriting of
| interpreted text.
|
| That said, I haven't run into the icecream problem with
| Whisper. Plenty of other systems fail but Whisper just
| seems to get lucky and guess the right words more than
| anything else.
|
| The Google Meet/Android speech recognition is cool but
| terribly slow in my experience. It also has a tendency to
| over-correct for some reason, probably because of the
| "best of N" system you mention.
| 0points wrote:
| So, yes, and also no.
| anonymousiam wrote:
| Whisper is excellent, but not perfect.
|
| I used Whisper last week to transcribe a phone call. In the
| transcript, the name of the person I was speaking with (Gem)
| was alternately transcribed as either "Jim" or "Jem", but
| never "Gem."
| JohnKemeny wrote:
| Whisper supports adding a context, and if you're
| transcribing a phone call, you should probably add _"
| Transcribe this phone call with Gem"_, in which case it
| would probably transcribe more correctly.
| ctxc wrote:
| Thanks John Key Many!
| t-3 wrote:
| That's at least as good as a human, though. Getting to
| "better-than-human" in that situation would probably
| require lots of potentially-invasive integration to allow
| the software to make correct inferences about who the
| speakers are in order to spell their names correctly, or
| manually supplying context as another respondent mentioned.
| anonymousiam wrote:
| When she told me her name, I didn't ask her to repeat it,
| and I got it right through the rest of the call. Whisper
| didn't, so how is this "at least s good as a human?"
| t-3 wrote:
| I wouldn't expect any transcriber to know that the
| correct spelling in your case used a G rather than a J -
| the J is far more common in my experience. "Jim" would be
| an aberration that could be improved, but substitution
| "Jem" for "Gem" without any context to suggest the latter
| would be just fine IMO.
| shaunpud wrote:
| I Scream in the Sun
| https://carmageddon.fandom.com/wiki/I_Scream_in_the_Sun
| DiogenesKynikos wrote:
| This is what your brain does when it processes language.
|
| I find that in languages I don't speak well, my ability to
| understand degrades much more quickly as the audio quality goes
| down. But in my native language, even with piss poor audio
| quality, my brain fills in the garbled words with its prior
| expectation of what those words should be, based on context.
| mockingloris wrote:
| A slight segue to this; I was made aware of the phenomena
| that - The language in which you think in, sets the
| constraints to which you level of expanse the brain can think
| and parse information in.
|
| I think in English fortunately and it's an ever evolving
| language so, expanding as the world does. That is compared to
| the majority of people where I'm from; English was a second
| language they had to learn and the people that thought them
| weren't well equipped with the resources to do a good job.
|
| |
|
| +-- Dey well; Be well
| cyphar wrote:
| This is called linguist relativity (nee. The Sapir-Whorf
| hypothesis) and the strong form you describe has fallen out
| of favour in modern linguistics.
|
| A surprising number of monolingual people think their own
| language is the most adaptable and modern language, but
| this is obviously untrue. All languages evolve to fit the
| needs of speakers.
|
| Also, the idea that people "think in language X" is heavily
| disputed. One obvious counterargument is that most people
| have experienced the feeling of being unable to express
| what they are thinking into words -- if you truly did think
| in the language you speak, how could this situation happen?
| My personal experience is that I do not actively hear any
| language in my head while unless I actively try to think
| about it (at least, since I was a teenager).
|
| (This is all ignoring the comments about ESL speakers that
| I struggle to read as anything but racism. As someone who
| speaks multiple languages, it astounds me how many people
| seem to think that struggling to express something in your
| non-native language means that you're struggling to think
| and are therefore stupid.)
| codedokode wrote:
| My experience is that sometimes, for example, when I
| watch a lecture in a foreign language, there could be
| some terms for which I don't know the correct translation
| so I cannot think about or mention them in my native
| language, while I understand what they mean.
| numpad0 wrote:
| > if you truly did think in the language you speak, how
| could this situation happen?
|
| As far as how it happens to me is concerned, either
| something closer to speech than raw thoughts reports back
| the data in shared memory is invalid for selected
| language, or I find there's no text representation exist
| for what I am trying to say.
|
| The "raw" thoughts work with the currently active
| language, for me, so at least for me, I just know strong
| Sapir-Whorf hypothesis is not even a hypothesis, but just
| a reasonable verbalization closely matching my own
| observations.
|
| I don't get why people can't take it, even in the age of
| LLMs. It is what it is and that old guy is just never
| correct even for once.
| sigbottle wrote:
| I think it's more like, you have a thought X, that has so
| many dimensions to it, but the way you serialize it to
| something that's actually discussable and comparable to
| other thoughts is language. And sometimes that language
| naturally loves slicing one part of that thought one way
| or the other.
|
| (then there's also a feedback loop type of argument, that
| always happens when discussing any sort of perception-
| reality distinction, but let's ignore that for now)
|
| At least for me, my brain is so bad and it's hard for me
| to truly hold a single thought in my head for a long
| time. Maybe it eventually settles into my subconscious
| but I don't really have a way to verify that.
| lgessler wrote:
| I recommend having a look at 16.3 onward here if you're curious
| about this: https://web.stanford.edu/~jurafsky/slp3/16.pdf
|
| I'm not familiar with Whisper in particular, but typically what
| happens in an ASR model is that the decoder, speaking loosely,
| sees "the future" (i.e. the audio after the chunk it's trying
| to decode) in a sentence like this, and also has the benefit of
| a language model guiding its decoding so that grammatical
| productions like "I like ice cream" are favored over "I like I
| scream".
| didacusc wrote:
| what would it make of this?
| https://www.youtube.com/watch?v=zyvZUxnIC3k
| yvdriess wrote:
| A good opportunity to point people to the paper with my
| favorite title of all time:
|
| "How to wreck a nice beach you sing calm incense"
|
| https://dl.acm.org/doi/10.1145/1040830.1040898
| abound wrote:
| For folks like me puzzling over what the correct
| transcription of the title should be, I think it's "How to
| recognize speech using common sense"
| fiatjaf wrote:
| Thank you very much!
| strken wrote:
| Thank you! "Calm incense" makes very little sense when said
| in an accent where calm isn't pronounced like com.
| solardev wrote:
| How is calm pronounced in those accents?
| drited wrote:
| Cahm
| solardev wrote:
| Like the "cam" in "camera"?
| strken wrote:
| In Australian English, calm rhymes with farm and uses a
| long vowel, while com uses a short vowel and would rhyme
| with prom. (I know this doesn't help much because some
| American accents also rhyme prom with farm).
|
| Consider the way "Commonwealth Bank" is pronounced in
| this news story: https://youtube.com/watch?v=MhkuHGRAAbg.
| An Australian English speaker would consider (most)
| Americans to be saying something like "Carmenwealth"
| rather "Commonwealth". See also the pronunciation of dog
| vs father in
| https://www.goalsenglish.com/lessons/2020/5/4/australian-
| eng....
|
| It really ruins some poetry.
| efilife wrote:
| Thanks. Now I know that I'm not that stupid and this
| actually makes no sense
| chipsrafferty wrote:
| It actually does make sense. Not saying you're stupid,
| but in standard English, if you say it quickly, the two
| sentences are nearly identical.
| mjw_byrne wrote:
| They're pretty different in British English, I struggled
| to figure it out until I started thinking about how it
| would sound with an American accent.
| codedokode wrote:
| But in "you sing", "s" is pronounced as "s", not as "z"
| from "using", right?
| squeaky-clean wrote:
| I pronounce using with an S unless I'm saying it very
| slowly
| codedokode wrote:
| I only got the "How to recognize" part. Also I think
| "using" should sound more like "you zinc" than "you sing".
| wdaher wrote:
| This is the correct parsing of it. (I can't take credit for
| coming up with the title, but I worked on the project.)
| fmx wrote:
| The paper: https://sci-
| hub.st/https://dl.acm.org/doi/10.1145/1040830.10...
|
| (Agree that the title is awesome, by the way!)
| brcmthrowaway wrote:
| Do AI voice recognition still use markov models for this?
| sva_ wrote:
| Whisper uses an encoder-decoder transformer.
| xyse53 wrote:
| My favorite is:
|
| "Threesomes, with and without blame"
|
| https://dl.acm.org/doi/10.1145/1570506.1570511
|
| (From a professor I worked with a bit in grad school)
| ThinkingGuy wrote:
| Also relevant: The Two Ronnies - "Four Candles"
|
| https://www.youtube.com/watch?v=gi_6SaqVQSw
| Fluorescence wrote:
| It makes me curious about how human subtitlers or even
| scriptwriters choose to transcribe intentionally ambiguous
| speech, puns and narratively important mishearings. It's like
| you need to subtitle what is heard not what is said.
|
| Do those born profoundly deaf specifically study word sounds in
| order to understand/create puns, rhymes and such so they don't
| need assistance understanding narrative mishearings?
|
| It must feel like a form of abstract mathematics without the
| experiential component... but then I suspect mathematicians
| manufacture an experiential phenomena with their abstractions
| with their claims of a beauty like music... hmm!
| dylan604 wrote:
| I had similar thoughts when reading Huck Finn. It's not just
| phonetically spelled, it's much different. Almost like Twain
| came up with a list of words, and then had a bunch of 2nd
| graders tell him the spelling of words they had seen. I guess
| at some point, you just get good at bad spelling?
| spauldo wrote:
| Writing in the vernacular, I believe it's called. I do
| something like that if I'm texting.
|
| The book "Feersum Endjinn" by Iain M. Banks uses something
| like this for one of its characters to quite good effect.
| dylan604 wrote:
| Except it forces me to slow down to "decypher" the text
| and makes the reading labored. I understand the point as
| it is part of the character, but it is easier to
| understand someone speaking in that vernacular vs reading
| the forced misspellings. I definitely don't want to get
| to the point of being good at reading it though. I wonder
| if this is how second grade teachers feel reading the
| class' schoolwork?
| spauldo wrote:
| That's true. I'm sure Twain and Banks were aware of this,
| though. Apparently they considered the immersion to be
| worth a little extra work on the part of the reader.
| Whether the reader agrees is a different story.
|
| I try to limit my use of it to just enough for my accent
| and way of talking to bleed through. I don't go for full-
| on phonetics, but I'm often "droppin' my g's and usin'
| lotsa regional sayin's." It probably helps that the
| people I text have the same accent I do, though.
| 0cf8612b2e1e wrote:
| The quality of subtitles implies that almost no effort is
| being put into their creation. Watch even a high budget
| movie/TV show and be aghast at how frequently they diverge.
| smallpipe wrote:
| A good subtitle isn't a perfect copy of what was said.
| herbcso wrote:
| Tom Scott would agree with you.
| https://m.youtube.com/watch?v=pU9sHwNKc2c
| kstrauser wrote:
| Hard disagree. When I'm reading a transcript, I want
| word-for-word what the people said, not a creative edit.
| I want the speakers' voice, not the transcriptionist's.
|
| And when I'm watching subtitles in my own language (say
| because I want the volume low so I'm not disturbing
| others), I hate when the words I see don't match the
| words I hear. It's the quickest way I can imagine to get
| sucked out of the content and into awareness of the
| delivery of the content.
| stavros wrote:
| But then what about deliberate mishearings and ambiguous
| speech, like the GP said?
| crazygringo wrote:
| I mean, subtitles are _mostly_ the same.
|
| Sometimes they're edited down simply for space, because
| there wouldn't be time to easily read all the dialog
| otherwise. And sometimes repetition of words or phrases
| is removed, because it's clearer, and the emphasis is
| obvious from watching the moving image. And filler words
| like "uh" or "um" generally aren't included unless they
| were in the original script.
|
| Most interestingly, swearing is sometimes toned down,
| just by skipping it -- removing an f-word in a sentence
| or similar. Not out of any kind of puritanism, but
| because swear words genuinely come across as more
| powerful in print than they do in speech. What sounds
| right when spoken can sometimes look like too much in
| print.
|
| Subtitles are an art. Determining when to best time them,
| how to split up long sentences, how to handle different
| speakers, how to handle repetition, how to handle limited
| space. I used to want subtitles that were perfectly
| faithful to what was spoken. Then I actually got involved
| in making subtitles at one point, and was very surprised
| to discover that perfectly faithful subtitles didn't
| actually do the best job of communicating meaning.
|
| Fictional subtitles aren't court transcripts. They serve
| the purpose of storytelling, which is the combination of
| a visible moving image full of emotion and action, and
| the subtitles. Their interplay is complex.
| creesch wrote:
| > When I'm reading a transcript
|
| That's the thing though, subtitles _aren 't intended as
| full transcripts_. They are intended to allow a wide
| variety of people to follow the content.
|
| A lot of people read slower than they would hear speech.
| So subtitles often need to condense or rephrase speech to
| keep pace with the video. The goal is usually to convey
| meaning clearly within the time available on screen. Not
| to capture every single word.
|
| If they tried to be fully verbatim, you'd either have
| subtitles disappearing before most viewers could finish
| reading them or large blocks of text covering the screen.
| Subtitlers also have to account for things like
| overlapping dialogue, filler words, and false starts,
| which can make exact transcriptions harder to read and
| more distracting in a visual medium.
|
| I mean, yeah in your own native language I agree it sort
| of sucks if you can still hear the spoken words as well.
| But, to be frank, you are also the minority group here as
| far as subtitle target audiences go.
|
| And to be honest, if they were fully verbatim, I'd wager
| you quickly would be annoyed as well. Simply because you
| will notice how much attention they then draw, making you
| less able to actually view the content.
| iczero wrote:
| I regularly enable YouTube subtitles. Almost always, they
| are a 100% verbatim transcription, excluding errors from
| auto-transcription. I am not annoyed in the slightest,
| and in fact I very much prefer that they are verbatim.
|
| If you are too slow at reading subtitles, you can either
| slow down the video or train yourself to read faster. Or
| you can just disable the subtitles.
| creesch wrote:
| > If you are too slow at reading subtitles, you can
| either slow down the video or train yourself to read
| faster. Or you can just disable the subtitles.
|
| That's just plain tone deaf, plain and simple. I was not
| talking about myself, or just youtube. You are not
| everyone else, your use case is not everyone else their
| use case. It really isn't that difficult.
| numpad0 wrote:
| Aren't same-language subtitles supposed to be perfect
| literal transcripts, while cross-language subtitling is
| supposed to be compressed creative interpretations?
| re wrote:
| I've been playing with whisper to try to do local transcription
| of long videos, but one issue I've found is that long (>15
| seconds) spans without any speech tend to send it into a
| hallucination loops that it often can't recover from. I wonder
| if, with direct integration into ffmpeg, they will be able to
| configure it in a way that can improve that situation.
| 42lux wrote:
| You usually delete silence before using something like whisper.
| re wrote:
| I've heard that, but that doesn't sound like a useful
| approach for videos where (1) non-speech segments can have
| plenty of other sound (music, noise) and (2) you want
| timestamps to match up with the original video, like for
| subtitles. But maybe there are known mitigations for both of
| those issues that I'm not aware of. And if they do exist
| maybe they can be included in the ffmpeg whisper integration.
| miki123211 wrote:
| By "delete", people mostly mean "detect", so that you can
| avoid processing such segments through Whisper. There's no
| reason to actually cut the silence out from the original
| audio file.
| hnlmorg wrote:
| This is designed for real time use too. And in such cases,
| you couldn't delete the silence before use.
| 42lux wrote:
| The ffmpeg implementation might be the example was not.
| franga2000 wrote:
| Whisper is supposed to be used with voice activity detection
| and all production implementations that I've seen do that. The
| raw model is known to make up nonsense for silence because, as
| I understand it, it was never trained not to do that, assuming
| everyone will use VAD
| bondarchuk wrote:
| Can whisper do multilingual yet? Last time I tried it on some
| mixed dutch/english text it would spit out english translations
| for some of the dutch text. Strange bug/feature since from all
| appearances it had understood the dutch text perfectly fine.
| ph4evers wrote:
| Whisper-v3 works well for multi-lingual. I tried it with Dutch,
| German and English
| jeroenhd wrote:
| I found that it works quite well for Dutch+English as long as
| you use one of the larger models. But that may just be luck, I
| imagine mixing Italian and Swedish will have very different
| results.
| guilamu wrote:
| Whisper has been multilingual for 5 years at least.
| bondarchuk wrote:
| I know it is ostensibly multilingual, it's less than a year
| since I tried, but it does this thing where it then
| translates everything (or only some things) into a single
| language regardless with no way to turn it off.
| guilamu wrote:
| Sorry, I've been using it for French audio files since 5
| years and never had this issues.
| woodson wrote:
| Except it's only been released in September 2022 (not even 3
| years ago).
| kwar13 wrote:
| Best for English, but I've found it pretty decent for Spanish.
| MaKey wrote:
| It's even better for some languages other than English (e. g.
| Spanish), see: https://github.com/openai/whisper?tab=readme-
| ov-file#availab...
| clarionbell wrote:
| I think the Dutch/English is probably the worst combination for
| this. Languages are rather close.
| bondarchuk wrote:
| I don't understand how this would happen, though. It's not
| like it will mishear a dutch sentence as if it's english; it
| will correctly pick up the dutch sentence, but (since the
| language is auto-detected as english at the start of the
| segment), seemingly auto-translate that (correct and
| correctly heard) dutch text to english. All we need is a way
| to get the dutch text that's surely somewhere in there,
| before the translation happens.
|
| Unless it was trained end-to-end on dutch-subtitled english
| text?? Which might make the translation a somewhat
| inextricable part of the model..? Does anyone know?
| numpad0 wrote:
| Isn't that a bit much for ASR models? Humans can't handle
| simultaneous multilingual dictation task either, I have to stop
| and reinitialize ears before switching languages between
| English and my primary one.
| bondarchuk wrote:
| Seems like it already has the capability somewhere in the
| model though - see my reply to clarionbell.
| cenamus wrote:
| Isn't that exactly what intepreters do?
| numpad0 wrote:
| If they're like what I am, they seem to just coordinate
| constant staggered resets for sub-systems of language
| processing pipeline while keeping internal representations
| of inputs in half-text state so that input come back out
| through the pipeline in the other configurations.
|
| That's how I anecdotally feel and interpret how my own
| brain appear to work, so it could be different from how
| interpreters work or how actual human brains work, but as
| far as I see it, professional simultaneous interpreters
| don't seem to be agnostic for relevant pairs of languages
| at all.
| abdullahkhalids wrote:
| In South Asia, it's quite common for people to speak a
| combination of their local language and English. Not just
| alternating sentences between the two languages, but in fact,
| constructing sentences using compound phrases from the two
| languages.
|
| "Madam, please believe me, maine homework kiya ha" [I did my
| homework].
| yewenjie wrote:
| I have recently found that parakeet from NVIDIA is way faster and
| pretty much as correct as Whisper, but it only works with
| English.
| instagraham wrote:
| Does this mean that any software which uses ffmpeg can now add a
| transcription option? Audacity, Chrome, OBS etc
| ks2048 wrote:
| If they want to support it out-of-the box, they'll still have
| to embed a model file (roughly 500 MB - 3GB, varying size and
| quality)
| einpoklum wrote:
| Can't you point ffmpeg to a model file using some preferences
| dialog?
| Lio wrote:
| Once local transcription is in more places hopefully we can
| persuade content creator not to burn bouncing sub-titles into
| their videos.
|
| I've seen professionally produced recordings on dry and technical
| subjects with good sound quality where they've decided to use
| distracting sub-titles with no way to disable them.
|
| It seems so unnecessary if you're not making novelty videos about
| cats.
|
| Also local transcription allows for automatic translation and
| again overlaying subtitles on top of an existing burnt in set is
| a really poor reading experience.
| HPsquared wrote:
| The other problem with burned-in subtitles is you can't change
| the language.
| rkomorn wrote:
| True, but (as someone who not infrequently has to rewind
| content on just about all streaming apps because it decided
| one particular subtitle only needed to be display for less
| than 200ms this time around) sometimes burned-in seems like a
| good idea.
|
| I don't understand why the problem seems so pervasive (I've
| seen it on Netflix, Viki, and Apple TV, at least) and so
| transient.
| t-3 wrote:
| It's a newer problem IME, so I'd guess it's cause by people
| using auto-transcription/translation tools to generate
| subtitles. For eg. Chinese content, I'll see stuff on Viki
| where the OG Mandarin subs are formatted sanely and the
| English is piecemeal follow-the-audio style. I can't
| imagine this happening in any other way than use of a
| transcription+translation tool without review.
| rkomorn wrote:
| I don't think it's an automation-related thing. It
| happens even on big name shows on big apps.
|
| I think it's a toolkit thing where some sort of event or
| timer goes off at the wrong time and the subtitles get
| cleared when they shouldn't. And then if you rewind and
| replay, it doesn't happen again (because spurious
| event/timer issue).
| t-3 wrote:
| At least with vtt and srt, the chunk of text displayed is
| explicitly associated with a chunk of time, so something
| like that really shouldn't be happening. Maybe there is
| some sort of subtitle-writing on the fly like what is
| sometimes done with transcoding video, but that would be
| really strange for a plaintext format that is so light
| compared to the video and audio coming with it.
| rkomorn wrote:
| > so something like that really shouldn't be happening
|
| I don't disagree, yet here we are. It's got race
| condition vibes.
|
| I don't know if it's related to the TV OS (LG WebOS in
| our case) but I guess that would be the common factor
| since it happens across multiple apps and languages.
|
| Anyway, it's quirky and occasionally annoying, but that's
| about it. :)
| LorenDB wrote:
| The other other problem with burned-in subtitles is that they
| normally have horrible formatting. Who wants to try to read
| single words that only flash on-screen while they are being
| spoken?
| preisschild wrote:
| They could also just upload those transcriptions as normal
| closed-captioning srt subtitles...
| jimkleiber wrote:
| not all social media will show subtitles/captions tho, which
| is the challenge. YouTube Shorts, TikTok videos, IG reels, FB
| reels, Whatsapp statuses, and more. I think some allow cc but
| some don't, and if someone reshares to another platform, it
| may not be there, so some of us burn them in begrudgingly :-)
| ambicapter wrote:
| They do that because it increases "engagement", not because
| they care about the user's experience with the subtitles.
| iAMkenough wrote:
| Also some social media platforms don't offer subtitle
| functionality, so burned-in is the only way if you want to
| serve your content to people that require subtitles or refuse
| to unmute their phones while they watch from their toilet.
| dzhiurgis wrote:
| It's just so annyoing how someone like Netflix offers like 3-4
| languages for most of its content when you can basically get it
| for free via browser extensions (if you watch on browser).
|
| Must be union thing.
| dewey wrote:
| That Netflix who would need to pay more to license more
| subtitles can't compete with pirated or unlicensed auto-
| generated subtitles shouldn't really be a surprise.
|
| It's also annoying that you have to pay for Netflix when you
| can get the same movies for free with less restrictions on a
| pirate site.
| whywhywhywhy wrote:
| Algorithm boosts it that's why they do it. Even if every device
| had real time 100% accurate subtitling built in they'd still do
| it if they video performs better with it.
| absoflutely wrote:
| I think this trend is partially driven by the silent auto play
| that happens on YouTube. Baked in subtitles help draw people
| into the video.
| jiehong wrote:
| Those burned in subtitles still aren't as cool as theme-matched
| anime subtitles during intro music sequences from fansubs 15
| years ago.
|
| Those are still cool IMO
| trenchpilgrim wrote:
| Or how the fansubbers will create masks to translate diegetic
| text like signage and written notes
| zoobab wrote:
| Not sure it will be packaged in Debian, with an external binary
| model god knows how it was produced...
| majewsky wrote:
| It looks like the model file needs to be supplied at invocation
| time, so the binary blob would not be required for packaging.
| zoobab wrote:
| so 'apt install ffmpeg' won't be enough to have the feature?
| SahAssar wrote:
| You'd have the feature, but you also need to supply the
| model. The feature seems to just be that ffmpeg has the
| ability to run the model, it does not include the model.
| martzoukos wrote:
| I guess that there is no streaming option for sending generated
| tokens to, say, an LLM service to process the text in real-time.
| nomad_horse wrote:
| Whisper has the encoder-decoder architecture, so it's hard to
| run streaming efficiently, though whisper-streaming is a thing.
|
| https://kyutai.org/next/stt is natively streaming STT.
| woodson wrote:
| There are many streaming ASR models based on CTC or RNNT.
| Look for example at sherpa (https://github.com/k2-fsa/sherpa-
| onnx), which can run streaming ASR, VAD, diarization, and
| many more.
| donatj wrote:
| I know nothing about Whisper, is this usable for automated
| translation?
|
| I own a couple very old and as far as I'm aware never translated
| Japanese movies. I don't speak Japanese but I'd love to watch
| them.
|
| A couple years ago I had been negotiating with a guy on Fiver to
| translate them. At his usual rate-per-minute of footage it would
| have cost thousands of dollars but I'd negotiated him down to a
| couple hundred before he presumably got sick of me and ghosted
| me.
| poglet wrote:
| Yep, whisper can do that. You can also try whisperx
| (https://github.com/m-bain/whisperX) for a possibly better
| experience with aligning of subtitles to spoken words.
| _def wrote:
| May I ask which movies? I'm just curious
| trenchpilgrim wrote:
| Whisper has quite bad issues with hallucination. It will inject
| sentences that were never said in the audio.
|
| It's decent for classification but poor at transcription.
| neckro23 wrote:
| Pre-processing with a vocal extraction model (bs-rofomer or
| similar) helps a lot with the hallucinations, especially with
| poor quality sources.
| trenchpilgrim wrote:
| I'm working with fairly "clean" audio (voice only) and
| still see ridiculous hallucinations.
| prmoustache wrote:
| My personnal experience trying to transcribe (not translate)
| was a complete failure. The thing would invent stuff. It would
| also be completely lost when more than one language is used.
|
| It also doesn't understand contexts so does a lot of errors you
| see in automatic translations from videos in youtube for
| example.
| okdood64 wrote:
| It's curious how YouTube's is so bad still given the current
| state of the art; but it has got a lot better in the last 6
| months.
| ethan_smith wrote:
| Whisper can indeed transcribe Japanese and translate it to
| English, though quality varies by dialect and audio clarity.
| You'll need the "large-v3" model for best results, and you can
| use ffmpeg's new integration with a command like `ffmpeg -i
| movie.mp4 -af whisper=model=large-v3:task=translate
| output.srt`.
| waltbosz wrote:
| I wonder how the results of an AI Japanese-audio-to-English-
| subtitles would compare to a fansub-ed anime. I'm guessing it
| would be a more literal translation vs. contextual or
| cultural.
|
| I found an interesting article about trollsubs, which I guess
| are fansubs made with a contemptuous flare.
| https://neemblog.home.blog/2020/08/19/the-lost-art-of-fan-
| ma...
|
| Tangent: I'm one of those people who watch movies with closed
| captions. Anime is difficult because the subtitle track is
| often the original Japanese-to-English subtitles and not
| closed captions, so the text does not match the English
| audio.
| chazeon wrote:
| I do japanese transcription + gemini translations. It's
| worse than fansub, but its much much better than nothing.
| First thing that could struggle is actually the vad, then
| is special names and places, prompting can help but not
| always. Finally it's uniformity (or style). I still feel
| that I can't control the punctuation well.
| numpad0 wrote:
| I was recently just playing around with Google Cloud ASR as
| well as smaller Whisper models, and I can say it hasn't
| gotten to that point: Japanese ASRs/STTs all generate final
| kanji-kana mixed text, and since kanji:pronunciation is n:n
| maps, it's non-trivial enough that it currently need hands
| from human native speakers to fix misheard texts in a lot
| of cases. LLMs should be theoretically good at this type of
| tasks, but they're somehow clueless about how Japanese
| pronunciation works, and they just rubber-stamp inputs as
| written.
|
| The conversion process from pronunciation to intended text
| is not deterministic either, so it probably can't be solved
| by "simply" generating all-pronunciation outputs. Maybe a
| multimodal LLM as ASR/STT, or a novel dual input as-
| spoken+estimated-text validation model could be made? I
| wouldn't know, though. It seemed like a semi-open question.
| neckro23 wrote:
| In my experience it works ok. The "English" model actually
| knows a lot of languages and will translate directly to
| English.
|
| You can also transcribe it to Japanese and use a translator to
| convert to English. This can sometimes help for more
| semantically complex dialogue.
|
| For example, using faster-whisper-xxl [1]:
|
| Direct translation: faster-whisper-xxl.exe
| --language English --model large-v2 --ff_vocal_extract mdx_kim2
| --vad_method pyannote_v3 --standard <input>
|
| Use Japanese, then translate: faster-whisper-
| xxl.exe --language Japanese --task translate --model large-v2
| --ff_vocal_extract mdx_kim2 --vad_method pyannote_v3 --standard
| <input>
|
| 1. https://github.com/Purfview/whisper-standalone-win
| BetterWhisper wrote:
| Hey, indeed Whisper can do the transcription of Japanese and
| even the translation (but only to English). For the best
| results you need to use the largest model which depending on
| your hardware might be slow or fast.
|
| Another option is to use something like VideoToTextAI which
| allows you to transcribe it fast and then translate it into
| 100+ languages which you can then export the subtitle (SRT)
| file for
| mockingloris wrote:
| How could one in theory, use this to train on a new language? Say
| for a hubby project; I have recordings of some old folks stories
| in my local dialect.
|
| |
|
| +-- Dey well; Be well
| notpublic wrote:
| https://huggingface.co/blog/fine-tune-whisper
| dncornholio wrote:
| I was expecting a lot more comments on if this is a necessary
| feature or if this even belongs in a library like ffmpeg. I think
| this is bloat, especially when the feature doesn't work flawless,
| whisper is very limited.
| MrGilbert wrote:
| The only item that was discussed was that the subtitle workflow
| does not seem to be that good, afaict:
|
| https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20022#issuecomme...
| kmfrk wrote:
| Whisper is genuinely amazing - with the right nudging. It's the
| one AI thing that has genuinely turned my life upside-down in an
| unambiguously good way.
|
| People should check out Subtitle Edit (and throw the dev some
| money) which is a great interface for experimenting with Whisper
| transcription. It's basically Aegisub 2.0, if you're old, like
| me.
|
| HOWTO:
|
| Drop a video or audio file to the right window, then go to Video
| > Audio to text (Whisper). I get the best results with Faster-
| Whisper-XXL. Use large-v2 if you can (v3 has some regressions),
| and you've got an easy transcription and translation workflow.
| The results aren't perfect, but Subtitle Edit is for cleaning up
| imperfect transcripts with features like Tools > Fix common
| errors.
|
| EDIT: Oh, and if you're on the current gen of Nvidia card, you
| might have to add "--compute_type float32" to make the
| transcription run correctly. I think the error is about an empty
| file, output or something like that.
|
| EDIT2: And if you get another error, possibly about whisper.exe,
| iirc I had to reinstall the Torch libs from a specific index like
| something along these lines (depending on whether you use pip or
| uv): pip3 install torch torchvision torchaudio
| --index-url https://download.pytorch.org/whl/cu118
| uv pip install --system torch torchvision torchaudio --index-url
| https://download.pytorch.org/whl/cu118
|
| If you get the errors and the above fixes work, please type your
| error message in a reply with what worked to help those who come
| after. Or at least the web crawlers for those searching for help.
|
| https://www.nikse.dk/subtitleedit
|
| https://www.nikse.dk/donate
|
| https://github.com/SubtitleEdit/subtitleedit/releases
| tossit444 wrote:
| Aegisub is still actively developed (forked), and imo, both
| software can't really be compared to one another. They can
| complement each other, but SE is much better for actual
| transcription. Aegisub still does the heavy lifting for
| typesetting and the like.
| pawelduda wrote:
| Can you give an example why it made your life that much better?
| shrx wrote:
| As a hard of hearing person, I can now download any video
| from the internet (e.g. youtube) and generate subtitles on
| the fly, not having to struggle to understand badly recorded
| or unintelligible speech.
| dylan604 wrote:
| IF the dialog is badly recorded or unintelligible speech,
| how would a transcription process get it correct?
| gregoryl wrote:
| Because it can use the full set of information of the
| audio - people with hearing difficulties cannot. Also
| interesting, people with perfectly functional hearing,
| but whom have "software" bugs (i.e. I find it extremely
| hard to process voices with significant background nose)
| can also benefit :)
| spauldo wrote:
| I have that issue as well - I can hear faint noises OK
| but if there's background noise I can't understand what
| people say. But I'm pretty sure there's a physical issue
| at the root of it in my case. The problem showed up after
| several practice sessions with a band whose guitarist
| insisted on always playing at full volume.
| dylan604 wrote:
| > I have that issue as well
|
| You say issue, I say feature. It's a great way to just
| ignore boring babbling at parties or other social
| engagements where you're just not that engaged. Sort of
| like selective hearing in relationships, but used on a
| wider audience
| mschuster91 wrote:
| The definition of "unintelligible" varies by person,
| especially by accent. Like, I got no problem with
| understanding the average person from Germany... but
| someone from the deep backwaters of Saxony, forget about
| that.
| 3036e4 wrote:
| I did this as recently as today, for that reason, using
| ffmpeg and whisper.cpp. But not on the fly. I ran it on a
| few videos to generate VTT files.
| kmfrk wrote:
| Aside from accessibility as mentioned, you can catch up on
| videos that are hours long. Orders of magnitude faster than
| watching on 3-4x playback speed. If you catch up through
| something like Subtitle Edit, you can also click on relevant
| parts of the transcript and replay it.
|
| But transcribing and passably translating everything goes a
| long way too. Even if you can hear what's being said, it's
| still less straining to hear when there's captions for it.
|
| Obviously one important factor to the convenience is how fast
| your computer is at transcription or translation. I don't use
| the features in real-time personally currently, although I'd
| like to if a great UX comes along through other software.
|
| There's also a great podcast app opportunity here I hope
| someone seizes.
| 3036e4 wrote:
| I used it like sibling commenter to get subtitles for
| downloaded videos. My hearing is bad. Whisper seems much
| better that YouTube's built-in auto-subtitles, so sometimes
| it is worth the extra trouble for me to download a video just
| to generate good subtitles and then watch it offline.
|
| I also used whisper.cpp to transcribe all my hoarded podcast
| episodes. Took days of my poor old CPU working at 100% on all
| cores (and then a few shorter runs to transcribe new episodes
| I have downloaded since). Worked as good as I could possibly
| hope. Of course it gets the spelling of names wrong, but I
| don't expect anything (or anyone) to do much better. It is
| great to be able to run ripgrep to find old episodes on some
| topic and sometimes now I read an episode instead of listen,
| or listen to it with mpv with subtitles.
| joshvm wrote:
| I don't know about _much_ better, but I like Whisper 's
| ability to subtitle foreign language content on YouTube that
| (somehow) doesn't have auto-generated subs. For example some
| relatively obscure comedy sketches from Germany where I'm not
| quite fluent enough to go by ear.
|
| 10 years ago you'd be searching through random databases to
| see if someone had synchronized subtitles for the exact copy
| of the video that you had. Or older lecture videos that don't
| have transcripts. Many courses had to, in order to comply
| with federal funding, but not all. And lots of international
| courses don't have this requirement at all (for example some
| great introductory CS/maths courses from German + Swiss
| institutions). Also think about taking this auto generated
| output and then generating summaries for lecture notes,
| reading recommendations - this sort of stuff is what LLMs are
| great at.
|
| You can do some clever things like take the foreign sub, have
| Whisper also transcribe it and then ask a big model like
| Gemini to go line by line and check the translation to
| English. This can include accounting for common transcription
| errors or idiomatic difference between langauges. I do it in
| Cursor to keep track of what the model has changed and for
| easy rollback. It's often good enough to correct mis-heard
| words that would be garbled through a cheaper model. _And_
| you can even query the model to ask about why a particular
| translation was made and what would be a more natural way to
| say the same thing. Sometimes it even figures out jokes. It
| 's not a fast or fully automatic process, but the quality can
| be extremely good if you put some time into reviewing.
|
| Having 90% of this be possible offline/open access is also
| very impressive. I've not tried newer OSS models like Qwen3
| but I imagine it'd do a decent job of the cleanup.
| notatallshaw wrote:
| > uv pip install --system torch torchvision torchaudio --index-
| url https://download.pytorch.org/whl/cu118
|
| uv has a feature to get the correct version of torch based on
| your available cuda (and some non-cuda) drivers (though I
| suggest using a venv not the system Python):
|
| > uv pip install torch torchvision torchaudio --torch-
| backend=auto
|
| More details:
| https://docs.astral.sh/uv/guides/integration/pytorch/#automa...
|
| This also means you can safely mix torch requirements with non-
| torch requirements as it will only pull the torch related
| things from the torch index and everything else from PyPI.
| xrd wrote:
| I love uv and really feel like I only need to know "uv add"
| and "uv sync" to be effective using it with python. That's an
| incredible feat.
|
| But, when I hear about these kinds of extras, it makes me
| even more excited. Getting cuda and torch to work together is
| something I have struggled countless times.
|
| The team at Astral should be nominated for a Nobel Peace
| Prize.
| eigenvalue wrote:
| They've definitely saved me many hours of wasted time
| between uv and ruff.
| danudey wrote:
| > "uv add"
|
| One life-changing thing I've been using `uv` for:
|
| System python version is 3.12: $ python3
| --version Python 3.12.3
|
| A script that requires a library we don't have, and won't
| work on our local python: $ cat test.py
| #!/usr/bin/env python3 import sys from
| rich import print if sys.version_info < (3,
| 13): print("This script will not work on Python
| 3.12") else: print(f"Hello world, this
| is python {sys.version}")
|
| It fails: $ python3 test.py
| Traceback (most recent call last): File
| "/tmp/tmp/test.py", line 10, in <module> from
| rich import print ModuleNotFoundError: No module
| named 'rich'
|
| Tell `uv` what our requirements are $ uv
| add --script=test.py --python '3.13' rich Updated
| `test.py`
|
| `uv` updates the script: $ cat test.py
| #!/usr/bin/env python3 # /// script #
| requires-python = ">=3.13" # dependencies = [
| # "rich", # ] # /// import
| sys from rich import print if
| sys.version_info < (3, 13): print("This script
| will not work on Python 3.12") else:
| print(f"Hello world, this is python {sys.version}")
|
| `uv` runs the script, after installing packages and
| fetching Python 3.13 $ uv run test.py
| Downloading cpython-3.13.5-linux-x86_64-gnu (download)
| (33.8MiB) Downloading
| cpython-3.13.5-linux-x86_64-gnu (download)
| Installed 4 packages in 7ms Hello world, this is
| python 3.13.5 (main, Jun 12 2025, 12:40:22) [Clang 20.1.4 ]
|
| And if we run it with Python 3.12, we can see that errors:
| $ uv run --python 3.12 test.py warning: The
| requested interpreter resolved to Python 3.12.3, which is
| incompatible with the script's Python requirement: `>=3.13`
| Installed 4 packages in 7ms This script will not
| work on Python 3.12
|
| Works for any Python you're likely to want:
| $ uv python list cpython-3.14.0b2-linux-x86_64-gnu
| <download available> cpython-3.14.0b2+freethreaded-
| linux-x86_64-gnu <download available>
| cpython-3.13.5-linux-x86_64-gnu /home/dan
| /.local/share/uv/python/cpython-3.13.5-linux-x86_64-gnu/bin
| /python3.13 cpython-3.13.5+freethreaded-
| linux-x86_64-gnu <download available>
| cpython-3.12.11-linux-x86_64-gnu <download
| available> cpython-3.12.3-linux-x86_64-gnu
| /usr/bin/python3.12 cpython-3.12.3-linux-x86_64-gnu
| /usr/bin/python3 -> python3.12
| cpython-3.11.13-linux-x86_64-gnu /home/dan
| /.local/share/uv/python/cpython-3.11.13-linux-x86_64-gnu/bi
| n/python3.11 cpython-3.10.18-linux-x86_64-gnu
| /home/dan/.local/share/uv/python/cpython-3.10.18-linux-x86_
| 64-gnu/bin/python3.10
| cpython-3.9.23-linux-x86_64-gnu <download
| available> cpython-3.8.20-linux-x86_64-gnu
| <download available> pypy-3.11.11-linux-x86_64-gnu
| <download available> pypy-3.10.16-linux-x86_64-gnu
| <download available> pypy-3.9.19-linux-x86_64-gnu
| <download available> pypy-3.8.16-linux-x86_64-gnu
| <download available>
| graalpy-3.11.0-linux-x86_64-gnu <download
| available> graalpy-3.10.0-linux-x86_64-gnu
| <download available> graalpy-3.8.5-linux-x86_64-gnu
| <download available>
| taminka wrote:
| whisper is great, i wonder why youtube's auto generated subs
| are still so bad? even the smallest whisper is way better than
| google's solution? is it licensing issue? harder to deploy at
| scale?
| briansm wrote:
| I believe youtube still uses 40 mel-scale vectors as feature
| data, whisper uses 80 (which provides finer spectral detail
| but is computationally more intensive to process naturally,
| but modern hardware allows for that)
| ec109685 wrote:
| You'd think they'd use the better model for at least videos
| that have a large view counts (they already do that when
| deciding compression optimizations).
| jokethrowaway wrote:
| whisper is definitely nice, but it's a bit too slow. Having
| subtitles and transcription for everything is great - but Nemo
| Parakeet (pretty much whisper by nvidia) completely changed how
| I interact with the computer.
|
| It enables dictation that actually works and it's as fast as
| you can think. I also have a set of scripts which just wait for
| voice commands and do things. I can pipe the results to an LLM,
| run commands, synthesize a voice with F5-TTS back and it's like
| having a local Jarvis.
|
| The main limitation is being english only.
| threecheese wrote:
| Would you share the scripts?
| ec109685 wrote:
| Or at least more details. Very cool!
| throwoutway wrote:
| I found this online demo of it:
| https://www.nikse.dk/subtitleedit/online
| codedokode wrote:
| Kdeenlive also supports auto-generating subtitles which need
| some editing, but it is faster than create them from scratch.
| Actually I would be happy even with a simple voice detector so
| that I don't have to set the timings manually.
| hart_russell wrote:
| Is there a way to use it to generate a srt subtitle file given
| a video file?
| prurigro wrote:
| It generates a few formats by default including srt
| guluarte wrote:
| you can install suing winget or chocolately
| winget install --id=Nikse.SubtitleEdit -e
| Morizero wrote:
| You don't happen to know a whisper solution that combines
| diarization with live audio transcription, do you?
| kmfrk wrote:
| Proper diarization still remains a white whale for me,
| unfortunately.
|
| Last I looked into it, the main options required API access
| to external services, which put me off. I think it was
| pyannotate.audio[1].
|
| [1]: https://github.com/pyannote/pyannote-audio
| jduckles wrote:
| WhipserX's diarization is great imo:
| whisperx input.mp3 --language en --diarize --output_format
| vtt --model large-v2
|
| Works a treat for Zoom interviews. Diarization is sometimes a
| bit off, but generally its correct.
| Morizero wrote:
| > input.mp3
|
| Thanks but I'm looking for live diarization.
| BrunoJo wrote:
| Subtitle Edit is great if you have the hardware to run it. If
| you don't have GPUs available or don't want to manage the
| servers I built a simple to use and affordable API that you can
| use: https://lemonfox.ai/
| kanemcgrath wrote:
| Subtitle edit is great, and their subtitle library libse was
| exactly what I needed for a project I did.
| JohnKemeny wrote:
| Related, a blog article by the author of the patch:
|
| _Run Whisper audio transcriptions with one FFmpeg command_
|
| https://medium.com/@vpalmisano/run-whisper-audio-transcripti...
|
| Posted here, with 0 comments:
| https://news.ycombinator.com/item?id=44869254
| eXpl0it3r wrote:
| Link is broken, full link: https://medium.com/@vpalmisano/run-
| whisper-audio-transcripti...
| NiekvdMaas wrote:
| Correct URL: https://medium.com/@vpalmisano/run-whisper-audio-
| transcripti...
| webinar wrote:
| I've been using FFmpeg and Whisper to record and transcribe live
| police scanner audio for my city, and update it in real-time to a
| live website. It works great, with the expected transcription
| errors and hallucinations.
| Xunjin wrote:
| Is this website open? Would love to see your work :P
| webinar wrote:
| somerville.votolab.com
| mkayokay wrote:
| Looks like this is a nice case were the LLM thinks that
| silence is "thanks for watching" which was discussed on
| here a few days ago.
| jaster wrote:
| All the "Thanks for watching!" gave me a good chuckle.
|
| Remind me of one of my own experiences with one of the
| Whisper model, where some random noise in the middle of the
| conversation was translated into "Don't forget to like and
| subscribe".
|
| Really illustrate where the training data is coming from.
| waltbosz wrote:
| I wanted to do this for my local county council meetings. I
| think in this context speaker recognition would be important.
| thedangler wrote:
| Does this whisper also do text-to-speech?
| dotancohen wrote:
| No
| porridgeraisin wrote:
| I had a small bash pipeline for doing this until now.
| ffmpeg -f pulse -i "$(pactl get-default-source)" -t 5 -f wav -ar
| 16000 -ac 1 -c:a pcm_s16le - \ | ./main - \ | head -2
| \ | tail -1 \ | cut -d] -f2 \ | awk '{$1=$1};1'
|
| The reading from mic part (-f pulse, pactl...) is linux-specific
| rest of it should be cross platform. The `main` executable is the
| whisper.cpp executable (see whisper.cpp github readme, it's just
| the output of `make base.en` from that).
|
| Edit: -t 5 controls recording duration.
|
| Oh and add 2>/dev/null to silence the debug output. I copied this
| from a pipe that further sends it into an LLM that then looks at
| the meaning and turns it into a variety of structured data
| (reminders, todo items, etc) which I then....
| dotancohen wrote:
| > which I then....
|
| Yes, please, go on...
| porridgeraisin wrote:
| The LLM turns my unstructured command into structured command
| (a limited set of commands hardcoded in the prompt) and a
| script takes that and executes it. I have it do stuff like
| interact with google keep/google calendar using the CLI.
| Those are the most used actions but there's a few others . Of
| course all actions can be scheduled.
|
| The LLM can screw up now and then and output absolute
| garbage. But I've got a knack now for figuring out what
| prompts it's gonna be hopeless on and I manually enter those.
|
| Example:
|
| Saying
|
| Remove makhana from shopping list
|
| Ends up running the command
|
| gkeep items edit shopping_list --check makhana
|
| There is a direct text interface too that skips the voice
| transcription.
|
| The main thing is it does in a background window without
| interrupting my screen or me needing to wait for whatever
| slow webpage to load. I had it do a few things on GitHub like
| remind me when checks pass on PRs. You could potentially
| connect it to various things like your amazon account to
| check on your order, etc,.. as I write this I now realise I
| did what basically amounts to what folks do with MCP today.
| Maybe I should update it to use the protocol.
|
| These days I have a little more idle time as a grad student
| than I did in a tech company, and I don't really need to
| manage home/cooking/... so I don't really use some of the
| more complicated features. I mostly just use it to schedule
| 1on1s with my guide and add reminders about assignments and
| TA work and talks and my music class.
| dotancohen wrote:
| That is fascinating, thank you very much for sharing. Good
| luck with the grad work.
| porridgeraisin wrote:
| Thank you:)
| MaxikCZ wrote:
| I tried to use whisper to generate non-english subs from english
| audio, but wasnt able to figure out. I know it can do english
| subs from non-english audio, and that earlier (less precise)
| versions could do any language audio -> any language subs, but
| latest whisper only to english subs.
|
| Anyone found a way?
| abdusco wrote:
| I solved it by generating English subtitles, then passing those
| to an LLM in chunks that are ~20 entries in size. Include
| preceding and following subtitles as context for better
| translation. Make sure to replace the timestamps with simple
| integer ids, because LLMs like to mangle those, no matter how
| hard you prompt.
|
| I could share a python script that is working pretty reliably
| for me.
| vevoe wrote:
| I'd love to see that script, do you have a link?
| abdusco wrote:
| https://gist.github.com/abdusco/5bd5c909547f5f9b935dbd2fb2f
| e...
| realxrobau wrote:
| Annoyingly, something is broken with their anti not stuff, as it
| keeps refusing to let me see the page.
| correa_brian wrote:
| hell yeah
| pmarreck wrote:
| Now if it only did separate speaker identification (diarization)
| shmerl wrote:
| Did ffmpeg move their bug tracker to Forgejo?
|
| https://code.ffmpeg.org/FFmpeg/FFmpeg/issues
|
| I still see their old one too, but Forgejo one is nice.
| de6u99er wrote:
| That's great. How does Whisper compare to Google Gemini's
| transcription capabilities?
| mkbkn wrote:
| How can I run Whisper or this software in Linux or Android as a
| non-technical user?
|
| Basically a simple audio-to-text for personal use?
| 3036e4 wrote:
| I don't think installing (i.e. compiling) whisper.cpp and using
| it to do audio-to-text is very difficult. If the documentation
| is too technical I am sure you can ask some LLM to walk you
| through it. I have used it on Android in termux and on my
| FreeBSD desktop computer. Would not expect any difficulties on
| any modern Linux.
| iambvk wrote:
| Is anyone able to get streaming audio to text conversion working
| with whisper.cpp?
|
| I tried several times to get this into a reasonable shape, but
| all have been failures. If anyone has pointers I really
| appreciate it.
| dotancohen wrote:
| Why would one use FFmpeg with Whisper support, instead of using
| Whisper directly?
| lbrito wrote:
| I run a service that does transcriptions as part of the
| pipeline, and I use ffmpeg for other parts (such as speeding up
| audio). Having it all on a single command might make sense for
| some people if the costs work out.
| dotancohen wrote:
| Terrific, thank you.
| 3036e4 wrote:
| At least whisper.cpp only supports a few input formats like WAV
| and MP3. To get subtitles for videos I always have to first run
| ffmpeg to get an audio file and then run whisper.cpp. Guess
| this new feature may mean that I can do it in just one step, so
| slightly more convenient?
| miladyincontrol wrote:
| on an aside, my favorite whisper 'hack' is you can just speed up
| audio 10x to process it 10x faster, then adjust the timings after
| yieldcrv wrote:
| Labeling multiple people talking is something i found lacking
| with whisper, is it better now?
| WanderPanda wrote:
| Is Whisper still SOTA 3 years later? It does not seem there is a
| clearly better open model. Alec Radford really is a genius!
| jiehong wrote:
| NVIDIA Nemo Parakeet for English. Mistral's recent Voxtral is
| supposed to be nice and open source
| generalizations wrote:
| Looks like there's a leaderboard:
| https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
| vitorgrs wrote:
| 3 years later and Youtube CCs is still horrible lol
| jhatemyjob wrote:
| I wish they worked with the mpv folks instead of shoehorning this
| in. Based on the docs it looks like getting live transcription
| for a video will involve running the demuxer/decoder on one
| thread, and this whisper filter on another thread, using ffmpeg's
| AVIO (or to a REST API [1].... _shudders_ ) to synchronize those
| two parallel jobs. It could have been way simpler.
|
| Other than for the "live transcription" usecase (that they made
| unnecessarily complicated), I don't see how this is any better
| than running Whisper.cpp directly. Other people in this thread
| are basically saying "ffmpeg's interface is better understood"
| [2] but LLMs make that point moot since you can just ask them to
| do the drudgery for you.
|
| [1] https://medium.com/@vpalmisano/run-whisper-audio-
| transcripti...
|
| [2] https://news.ycombinator.com/item?id=44890067
| superkuh wrote:
| "Making sure you're not a bot!" with no way to get to the actual
| document that is supposed to be at the URL. Anubis can be
| configured to be accessible for people without the latest
| computers by using the meta-refresh proof of work but very few
| people take any time to configure it and just deploy the
| defaults. Just like with cloudflare.
|
| That said, I suppose I'm glad they're concentrating on making the
| ffmpeg code better rather than fixing bugs in the web interface
| for the development tracker. Having whisper integrated will be
| really useful. I'm already imagining automatic subtitle
| generation... imagining because I can't read the page or the code
| to know what it is.
| sorenjan wrote:
| I hope this is the start of more ML filters in ffmpeg. They added
| the sr (super resolution) filter years ago, but it's old and it's
| difficult to get the weights so you can run it, since they're not
| included. They have added support for multiple inference
| libraries like libtorch, but again, it's difficult to even get
| started. Hopefully they can get behind a consistent ML strategy,
| ideally with a "models" directory with ready to use models for
| upscaling, temporal upscaling, noise cancelling, etc. A lot of
| audio and video filter research use ML now, new codecs will
| probably also use it soon.
| manca wrote:
| The only problem with this PR/diff is that it creates just a
| avfilter wrapper around whisper.cpp library and requires the user
| to manage the dependencies on their own. This is not helpful for
| novice users who will first need to:
|
| 1. git clone whisper.cpp
|
| 2. Make sure they have all dependencies for `that` library
|
| 3. Hope the build passes
|
| 4. Download the actual model
|
| AND only then be able to use `-af "whisper=model...` filter.
|
| If they try to use the filter without all the prereqs they'll
| fail and it'll create frustration.
|
| It'd be better to natively create a Whisper avfilter and only
| require the user to download the model -- I feel like this would
| streamline the whole process and actually make people use it much
| more.
| slhck wrote:
| While that would be nicer from an end-user perspective, it's
| something hard to maintain for FFmpeg itself. Consider the
| velocity of the whisper-cpp project. I'm sure that - just like
| with filters such as vmaf, which also require building a
| dependency and downloading a model - precompiled versions will
| become available for novice users to directly download.
| Especially considering whisper-cpp is MIT-licensed.
| cheerioty wrote:
| OH: "New changelog entries go to the bottom, @vpalmisano ..
| Didn't I tell you this once?"
| igorguerrero wrote:
| Aww, I literally just implemented this using whisper.cpp and
| ffmpeg lib, code is even similar...
| jd3 wrote:
| took me longer than i'd care to admit to figure out how to
| install whisper as a user/system package on macOS w/o brew (which
| pulls in all of llvm@16 during install) brew
| install uv uv tool install openai-whisper then
| add ~/.local/bin/ to $PATH
| hbn wrote:
| I wonder if Apple's upcoming speech APIs can be added too. Would
| be cool to have it just work out of the box on Macs, without
| needing to source a model.
|
| https://developer.apple.com/documentation/speech/speechtrans...
|
| https://developer.apple.com/documentation/speech/speechanaly...
|
| https://www.macstories.net/stories/hands-on-how-apples-new-s...
| XCSme wrote:
| Unrelated, but can I use Whisper in DaVinci resolve to
| automatically transcribe my videos and add subs?
| cadamsdotcom wrote:
| Unrelated, but why isn't Europe a country already. It's been
| ages!
___________________________________________________________________
(page generated 2025-08-13 23:00 UTC)