[HN Gopher] FFmpeg 8.0 adds Whisper support
       ___________________________________________________________________
        
       FFmpeg 8.0 adds Whisper support
        
       Author : rilawa
       Score  : 766 points
       Date   : 2025-08-13 10:19 UTC (12 hours ago)
        
 (HTM) web link (code.ffmpeg.org)
 (TXT) w3m dump (code.ffmpeg.org)
        
       | ggap wrote:
       | Very interesting to see this!
        
       | zzsshh wrote:
       | Does this finally enable dynamically generating subtitles for
       | movies with AI?
        
         | diggan wrote:
         | Finally? I think VLC demo'd this a while ago at some conference
         | where they had a table, if I remember correctly.
        
           | SSLy wrote:
           | VLC and ffmpeg are unrelated projects
        
             | demurgos wrote:
             | I'm not very familiar with them, but I always assumed that
             | there is a lot of overlap between the maintainers of both
             | projects.
        
               | SSLy wrote:
               | Well, they are just unrelated. VLC has a plugin to access
               | ffmpeg codecs via libav*, that's about it.
        
               | guipsp wrote:
               | They are not completly unrelated. There is significant
               | overlap. FFMPEG also uses libs from VLC.
        
           | mmmpetrichor wrote:
           | I've been waiting a while now for automatic translated
           | subtitles in vlc. I thought it would be here by now. I'm
           | probably underestimating the difficulty but I'm surprised
           | some video player hasn't done it by now. (as far as I know).
        
             | jeroenhd wrote:
             | A lot of subtitles from commercial media use a subtitle
             | format that's essentially a bitmap that the video player
             | overlays on top of the video. There are tools to decode
             | this using OCR, but it's not something I'd enable by
             | default.
             | 
             | For text/srt subtitles, translation would probably be
             | easier. There's a plugin for that already if you're okay
             | with online translation services:
             | https://github.com/nopium/vlc-trans-lua
        
         | jeroenhd wrote:
         | Docs say:                   If set, the transcription output
         | will be sent to the specified file or URL         (use one of
         | the FFmpeg AVIO protocols); otherwise, the output will be
         | logged as info messages.         The output will also be set in
         | the "lavfi.whisper.text" frame metadata.         If the
         | destination is a file and it already exists, it will be
         | overwritten.              @item format         The destination
         | format string; it could be "text" (only the transcribed text
         | will be sent to the destination), "srt" (subtitle format) or
         | "json".         Default value: @code{"text"}
         | 
         | I don't know if this can embed the subtitles, but it does
         | support generating accompanying srt files.
         | 
         | Of course, you could already do that by just manually calling
         | whisper on files, but now you don't need to export parts or
         | transformed media files to feed into whisper.
        
         | regularfry wrote:
         | If you have enough processing power. Without a GPU it's going
         | to lag.
        
           | KeplerBoy wrote:
           | Whisper is pretty fast.
        
           | jeroenhd wrote:
           | In my experience, a small/tiny whisper model has pretty okay
           | English decoding speed on something relatively modern even
           | without GPU support. There's a bunch of latency in the
           | process (because of technological limitations) but the
           | optimised C++ version shouldn't pose too much of a problem
           | unless you're running in power saving mode. Battery life may
           | be a problem on older laptops, though.
        
       | boutell wrote:
       | Shut off the broken bot filter so we can read it please
        
         | diggan wrote:
         | Took my iPhone 12 Mini a whole of 0.1 seconds to pass it. What
         | hardware/OS are you using?
        
           | johnisgood wrote:
           | Took me 8 seconds on my shitty desktop.
        
           | londons_explore wrote:
           | Took about 30 secs for me (5 yr old intel cpu). Looked like
           | there was a progress bar, but it didn't progress. Maybe the
           | difficulty varies depending on IP address?
        
             | jeroenhd wrote:
             | Anubis has config for that:
             | https://anubis.techaro.lol/docs/admin/policies#request-
             | weigh...
             | 
             | It's up to the site admin to configure it that way, but
             | it's possible some IP ranges/user agents are more often
             | used by bots and therefore have an increased weight.
             | 
             | For old browsers there's also an option to use meta refresh
             | instead of JS (https://anubis.techaro.lol/docs/admin/config
             | uration/challeng...) but that's quite a recent addition and
             | not enabled by default.
        
             | diggan wrote:
             | > Maybe the difficulty varies depending on IP address?
             | 
             | I'm currently roaming in Finland with a Spanish SIM so
             | would have expected the opposite in that case.
        
             | ta1243 wrote:
             | my i5-6200U with firefox/linux is about 10 years old. I
             | used a variety of add blocking and fingerprint blocking
             | techniques. Cloudflare often complains and blocks me.
             | 
             | This page loaded pretty much instantly (certainly in the
             | time it took to switch to the background tab I loaded in).
             | But then ffmpeg is written by old school engineers with old
             | school ways of working. Their social media accounts are a
             | hilarity of trolling worthy of slashdot in its peak.
        
           | politelemon wrote:
           | Took me zero seconds to be blocked with invalid response
        
             | miloignis wrote:
             | It also instantly blocks me on GrapheneOS, both Firefox and
             | Vanadium. Very odd, as I've never had an issue with Anubis
             | before.
        
               | shaky-carrousel wrote:
               | GrapheneOS here, with Vanadium in incognito, it doesn't
               | block me, both in wifi and in mobile. Maybe it was a
               | temporary hiccup.
        
               | miloignis wrote:
               | Thanks for checking! Incognito blocks me too, no idea
               | whats up. Maybe I'm getting tripped up by IP reputation
               | or something (though I shouldn't, normal residential
               | connection).
        
           | blahyawnblah wrote:
           | The stock chrome browser Google news uses
        
         | jeroenhd wrote:
         | Check out commit 13ce36fef98a3f4e6d8360c24d6b8434cbb8869b from
         | https://git.ffmpeg.org/ffmpeg.git if your web browser doesn't
         | support Javascript. The linked page is just a git viewer for
         | that specific commit.
        
           | yorwba wrote:
           | Or read the documentation for the new whisper filter:
           | https://ffmpeg.org/ffmpeg-filters.html#whisper-1
        
             | jeroenhd wrote:
             | That also works, I assumed the ffmpeg website would also be
             | behind Anubis if the git server is, but it doesn't actually
             | seem to be.
        
               | majewsky wrote:
               | Anubis is not all that useful for static websites since
               | serving them does not generate high load (unlike when a
               | bot traverses a Git server UI).
        
         | QuantumNomad_ wrote:
         | Archived snapshots of the linked page:
         | 
         | https://web.archive.org/web/20250813104007/https://code.ffmp...
         | 
         | https://archive.is/dmj17
         | 
         | You can read it on one of these without having to pass that
         | specific bot check
        
         | majewsky wrote:
         | From experience, these bot filters are usually installed
         | because the site would be down entirely without rejecting AI
         | scrapers, so the argument to shut it off to improve usability
         | is rather silly.
        
         | superkuh wrote:
         | They don't need to shut off Anubis, they just need to configure
         | it beyond the defaults. If they turned on the meta-refresh
         | based challenge then all browsers could access it while still
         | keeping most of the bots away. But few people ever configure
         | these things and just accept the broken defaults.
         | 
         | With the current broken default config my browser can't even
         | run the JS challenge due to it using unsupported bleeding edge
         | JS features.
        
           | xena wrote:
           | Hi, can you please paste the error message you get? This
           | should be using features that are supported widely as of 2022
           | and I regularly test on Firefox LTS.
        
       | kwar13 wrote:
       | Fantastic! I am working on a speech-to-text GNOME extension that
       | would immensely benefit from this.
       | 
       | https://github.com/kavehtehrani/gnome-speech2text
        
         | dotancohen wrote:
         | Why is this a Gnome extension? I would love to use this in KDE.
        
           | guipsp wrote:
           | Likely because they are a GNOME user and the APIs are DE
           | specific.
        
       | lawik wrote:
       | I wonder if they'll be satisfied there or add a chunk of others
       | now that they've started. Parakeet is supposed to be good?
       | 
       | Should they add Voice Activity Detection? Are these separate
       | filters or just making the whisper filter more fancy?
        
         | shrx wrote:
         | Voice Activity Detection support is already included.
        
       | voxadam wrote:
       | Am I correct in understanding that Whisper is a speech
       | recognition AI model originally created by OpenAI?
       | 
       | https://en.wikipedia.org/wiki/Whisper_(speech_recognition_sy...
        
         | acidburnNSA wrote:
         | Yes, according to the comments in the patch, you are correct.
        
         | kwar13 wrote:
         | yes.
        
         | johnisgood wrote:
         | Yes.
         | 
         | From the documentation:
         | 
         | > It runs automatic speech recognition using the OpenAI's
         | Whisper model.
        
           | voxadam wrote:
           | Thanks, I was being tripped up by DDOS protection on
           | code.ffmpeg.org for a minute and couldn't read the patch. The
           | combo of Firefox and the fact that Quantum/Lumen/CenturyLink
           | seems to get off by rotating my dynamic IP for no reason
           | occasionally triggers various DDOS protections schemes.
        
             | johnisgood wrote:
             | No problem. :) Yeah, it took me 8 seconds to get through.
             | It seems your issue was worse.
        
         | Maxious wrote:
         | yep, there's a c++ implementation to run it
         | https://github.com/ggml-org/whisper.cpp
        
           | oezi wrote:
           | Isn't WhisperX the canonical choice for running Whisper?
        
             | sampullman wrote:
             | Maybe for running locally? whisper.cpp is nice because you
             | can embed it pretty easily in apps for various targets like
             | iOS, OSX, Android, wasm, etc.
        
             | 0points wrote:
             | While whisper and whisperx is python implementations, the
             | whisper.cpp wins the benchmarks.
        
         | AlienRobot wrote:
         | I think so, if I remember correctly PotPlayer also supports it
         | for automatic subtitling.
        
         | cess11 wrote:
         | Kind of, it's a family of audio transcription models.
         | 
         | https://huggingface.co/search/full-text?q=whisper
        
       | londons_explore wrote:
       | Does this have the ability to edit historic words as more info
       | becomes available?
       | 
       | Eg. If I say "I scream", it sounds phonetically identical to "Ice
       | cream".
       | 
       | Yet the transcription of "I scream is the best dessert" makes a
       | lot less sense than "Ice cream is the best dessert".
       | 
       | Doing this seems necessary to have both low latency _and_ high
       | accuracy, and things like transcription on android do that and
       | you can see the adjusting guesses as you talk.
        
         | ph4evers wrote:
         | Whisper works on 30 second chunks. So yes it can do that and
         | that's also why it can hallucinate quite a bit.
        
           | jeroenhd wrote:
           | The ffmpeg code seems to default to three second chunks
           | (https://ffmpeg.org/ffmpeg-filters.html#whisper-1):
           | queue                       The maximum size that will be
           | queued into the filter before processing the audio with
           | whisper. Using a small value the audio stream will be
           | processed more often, but the transcription quality will be
           | lower and the required processing power will be higher. Using
           | a large value (e.g. 10-20s) will produce more accurate
           | results using less CPU (as using the whisper-cli tool), but
           | the transcription latency will be higher, thus not useful to
           | process real-time streams. Consider using the vad_model
           | option associated with a large queue value. Default value:
           | "3"
        
             | londons_explore wrote:
             | so if "I scream" is in one chunk, and "is the best dessert"
             | is in the next, then there is no way to edit the first
             | chunk to correct the mistake? That seems... suboptimal!
             | 
             | I don't think other streaming transcription services have
             | this issue since, whilst they do chunk up the input, past
             | chunks can still be edited. They tend to use "best of N"
             | decoding, so there are always N possible outputs, each with
             | a probability assigned, and as soon as one word is the same
             | in all N outputs then it becomes fixed.
             | 
             | The internal state of the decoder needs to be duplicated N
             | times, but that typically isn't more than a few kilobytes
             | of state so N can be hundreds to cover many combinations of
             | ambiguities many words back.
        
               | miki123211 wrote:
               | The right way to do this would be to use longer,
               | overlapping chunks.
               | 
               | E.g. do thranscription every 3 seconds, but transcribe
               | the most recent 15s of audio (or less if it's the
               | beginning of the recording).
               | 
               | This would increase processing requirements
               | significantly, though. You could probably get around some
               | of that with clever use of caching, but I don't think any
               | (open) implementation actually does that.
        
               | superluserdo wrote:
               | I basically implemented exactly this on top of whisper
               | since I couldn't find any implementation that allowed for
               | live transcription.
               | 
               | https://tomwh.uk/git/whisper-chunk.git/
               | 
               | I need to get around to cleaning it up but you can
               | essentially alter the number of simultaneous overlapping
               | whisper processes, the chunk length, and the chunk
               | overlap fraction. I found that the `tiny.en` model is
               | good enough with multiple simultaneous listeners to be
               | able to have highly accurate live English transcription
               | with 2-3s latency on a mid-range modern consumer CPU.
        
               | dylan604 wrote:
               | If real-time transcription is so bad, why force it to be
               | real-time. What happens if you give it a 2-3 second
               | delay? That's pretty standard in live captioning. I get
               | real-time being the ultimate goal, but we're not there
               | yet. So working within the current limitations is piss
               | poor transcription in real-time really more
               | desirable/better than better transcriptions 2-3 second
               | delay?
        
               | llarsson wrote:
               | Attention is all you need, as the transformative paper
               | (pun definitely intended) put it.
               | 
               | Unfortunately, you're only getting attention in 3 second
               | chunks.
        
               | no_wizard wrote:
               | That's because at the end of the day this technology
               | doesn't "think". It simply holds context until the next
               | thing without regard for the previous information
        
               | abdullahkhalids wrote:
               | Which other streaming transcription services are you
               | referring to?
        
               | londons_explore wrote:
               | Googles speech to text API:
               | https://cloud.google.com/speech-to-text/docs/speech-to-
               | text-...
               | 
               | The "alternatives" and "confidence" field is the result
               | of the N-best decodings described elsewhere in the
               | thread.
        
               | jeroenhd wrote:
               | I don't know an LLM that does context based rewriting of
               | interpreted text.
               | 
               | That said, I haven't run into the icecream problem with
               | Whisper. Plenty of other systems fail but Whisper just
               | seems to get lucky and guess the right words more than
               | anything else.
               | 
               | The Google Meet/Android speech recognition is cool but
               | terribly slow in my experience. It also has a tendency to
               | over-correct for some reason, probably because of the
               | "best of N" system you mention.
        
           | 0points wrote:
           | So, yes, and also no.
        
           | anonymousiam wrote:
           | Whisper is excellent, but not perfect.
           | 
           | I used Whisper last week to transcribe a phone call. In the
           | transcript, the name of the person I was speaking with (Gem)
           | was alternately transcribed as either "Jim" or "Jem", but
           | never "Gem."
        
             | JohnKemeny wrote:
             | Whisper supports adding a context, and if you're
             | transcribing a phone call, you should probably add _"
             | Transcribe this phone call with Gem"_, in which case it
             | would probably transcribe more correctly.
        
               | ctxc wrote:
               | Thanks John Key Many!
        
             | t-3 wrote:
             | That's at least as good as a human, though. Getting to
             | "better-than-human" in that situation would probably
             | require lots of potentially-invasive integration to allow
             | the software to make correct inferences about who the
             | speakers are in order to spell their names correctly, or
             | manually supplying context as another respondent mentioned.
        
               | anonymousiam wrote:
               | When she told me her name, I didn't ask her to repeat it,
               | and I got it right through the rest of the call. Whisper
               | didn't, so how is this "at least s good as a human?"
        
               | t-3 wrote:
               | I wouldn't expect any transcriber to know that the
               | correct spelling in your case used a G rather than a J -
               | the J is far more common in my experience. "Jim" would be
               | an aberration that could be improved, but substitution
               | "Jem" for "Gem" without any context to suggest the latter
               | would be just fine IMO.
        
         | shaunpud wrote:
         | I Scream in the Sun
         | https://carmageddon.fandom.com/wiki/I_Scream_in_the_Sun
        
         | DiogenesKynikos wrote:
         | This is what your brain does when it processes language.
         | 
         | I find that in languages I don't speak well, my ability to
         | understand degrades much more quickly as the audio quality goes
         | down. But in my native language, even with piss poor audio
         | quality, my brain fills in the garbled words with its prior
         | expectation of what those words should be, based on context.
        
           | mockingloris wrote:
           | A slight segue to this; I was made aware of the phenomena
           | that - The language in which you think in, sets the
           | constraints to which you level of expanse the brain can think
           | and parse information in.
           | 
           | I think in English fortunately and it's an ever evolving
           | language so, expanding as the world does. That is compared to
           | the majority of people where I'm from; English was a second
           | language they had to learn and the people that thought them
           | weren't well equipped with the resources to do a good job.
           | 
           | |
           | 
           | +-- Dey well; Be well
        
             | cyphar wrote:
             | This is called linguist relativity (nee. The Sapir-Whorf
             | hypothesis) and the strong form you describe has fallen out
             | of favour in modern linguistics.
             | 
             | A surprising number of monolingual people think their own
             | language is the most adaptable and modern language, but
             | this is obviously untrue. All languages evolve to fit the
             | needs of speakers.
             | 
             | Also, the idea that people "think in language X" is heavily
             | disputed. One obvious counterargument is that most people
             | have experienced the feeling of being unable to express
             | what they are thinking into words -- if you truly did think
             | in the language you speak, how could this situation happen?
             | My personal experience is that I do not actively hear any
             | language in my head while unless I actively try to think
             | about it (at least, since I was a teenager).
             | 
             | (This is all ignoring the comments about ESL speakers that
             | I struggle to read as anything but racism. As someone who
             | speaks multiple languages, it astounds me how many people
             | seem to think that struggling to express something in your
             | non-native language means that you're struggling to think
             | and are therefore stupid.)
        
               | codedokode wrote:
               | My experience is that sometimes, for example, when I
               | watch a lecture in a foreign language, there could be
               | some terms for which I don't know the correct translation
               | so I cannot think about or mention them in my native
               | language, while I understand what they mean.
        
               | numpad0 wrote:
               | > if you truly did think in the language you speak, how
               | could this situation happen?
               | 
               | As far as how it happens to me is concerned, either
               | something closer to speech than raw thoughts reports back
               | the data in shared memory is invalid for selected
               | language, or I find there's no text representation exist
               | for what I am trying to say.
               | 
               | The "raw" thoughts work with the currently active
               | language, for me, so at least for me, I just know strong
               | Sapir-Whorf hypothesis is not even a hypothesis, but just
               | a reasonable verbalization closely matching my own
               | observations.
               | 
               | I don't get why people can't take it, even in the age of
               | LLMs. It is what it is and that old guy is just never
               | correct even for once.
        
               | sigbottle wrote:
               | I think it's more like, you have a thought X, that has so
               | many dimensions to it, but the way you serialize it to
               | something that's actually discussable and comparable to
               | other thoughts is language. And sometimes that language
               | naturally loves slicing one part of that thought one way
               | or the other.
               | 
               | (then there's also a feedback loop type of argument, that
               | always happens when discussing any sort of perception-
               | reality distinction, but let's ignore that for now)
               | 
               | At least for me, my brain is so bad and it's hard for me
               | to truly hold a single thought in my head for a long
               | time. Maybe it eventually settles into my subconscious
               | but I don't really have a way to verify that.
        
         | lgessler wrote:
         | I recommend having a look at 16.3 onward here if you're curious
         | about this: https://web.stanford.edu/~jurafsky/slp3/16.pdf
         | 
         | I'm not familiar with Whisper in particular, but typically what
         | happens in an ASR model is that the decoder, speaking loosely,
         | sees "the future" (i.e. the audio after the chunk it's trying
         | to decode) in a sentence like this, and also has the benefit of
         | a language model guiding its decoding so that grammatical
         | productions like "I like ice cream" are favored over "I like I
         | scream".
        
         | didacusc wrote:
         | what would it make of this?
         | https://www.youtube.com/watch?v=zyvZUxnIC3k
        
         | yvdriess wrote:
         | A good opportunity to point people to the paper with my
         | favorite title of all time:
         | 
         | "How to wreck a nice beach you sing calm incense"
         | 
         | https://dl.acm.org/doi/10.1145/1040830.1040898
        
           | abound wrote:
           | For folks like me puzzling over what the correct
           | transcription of the title should be, I think it's "How to
           | recognize speech using common sense"
        
             | fiatjaf wrote:
             | Thank you very much!
        
             | strken wrote:
             | Thank you! "Calm incense" makes very little sense when said
             | in an accent where calm isn't pronounced like com.
        
               | solardev wrote:
               | How is calm pronounced in those accents?
        
               | drited wrote:
               | Cahm
        
               | solardev wrote:
               | Like the "cam" in "camera"?
        
               | strken wrote:
               | In Australian English, calm rhymes with farm and uses a
               | long vowel, while com uses a short vowel and would rhyme
               | with prom. (I know this doesn't help much because some
               | American accents also rhyme prom with farm).
               | 
               | Consider the way "Commonwealth Bank" is pronounced in
               | this news story: https://youtube.com/watch?v=MhkuHGRAAbg.
               | An Australian English speaker would consider (most)
               | Americans to be saying something like "Carmenwealth"
               | rather "Commonwealth". See also the pronunciation of dog
               | vs father in
               | https://www.goalsenglish.com/lessons/2020/5/4/australian-
               | eng....
               | 
               | It really ruins some poetry.
        
             | efilife wrote:
             | Thanks. Now I know that I'm not that stupid and this
             | actually makes no sense
        
               | chipsrafferty wrote:
               | It actually does make sense. Not saying you're stupid,
               | but in standard English, if you say it quickly, the two
               | sentences are nearly identical.
        
               | mjw_byrne wrote:
               | They're pretty different in British English, I struggled
               | to figure it out until I started thinking about how it
               | would sound with an American accent.
        
               | codedokode wrote:
               | But in "you sing", "s" is pronounced as "s", not as "z"
               | from "using", right?
        
               | squeaky-clean wrote:
               | I pronounce using with an S unless I'm saying it very
               | slowly
        
             | codedokode wrote:
             | I only got the "How to recognize" part. Also I think
             | "using" should sound more like "you zinc" than "you sing".
        
             | wdaher wrote:
             | This is the correct parsing of it. (I can't take credit for
             | coming up with the title, but I worked on the project.)
        
           | fmx wrote:
           | The paper: https://sci-
           | hub.st/https://dl.acm.org/doi/10.1145/1040830.10...
           | 
           | (Agree that the title is awesome, by the way!)
        
           | brcmthrowaway wrote:
           | Do AI voice recognition still use markov models for this?
        
             | sva_ wrote:
             | Whisper uses an encoder-decoder transformer.
        
           | xyse53 wrote:
           | My favorite is:
           | 
           | "Threesomes, with and without blame"
           | 
           | https://dl.acm.org/doi/10.1145/1570506.1570511
           | 
           | (From a professor I worked with a bit in grad school)
        
           | ThinkingGuy wrote:
           | Also relevant: The Two Ronnies - "Four Candles"
           | 
           | https://www.youtube.com/watch?v=gi_6SaqVQSw
        
         | Fluorescence wrote:
         | It makes me curious about how human subtitlers or even
         | scriptwriters choose to transcribe intentionally ambiguous
         | speech, puns and narratively important mishearings. It's like
         | you need to subtitle what is heard not what is said.
         | 
         | Do those born profoundly deaf specifically study word sounds in
         | order to understand/create puns, rhymes and such so they don't
         | need assistance understanding narrative mishearings?
         | 
         | It must feel like a form of abstract mathematics without the
         | experiential component... but then I suspect mathematicians
         | manufacture an experiential phenomena with their abstractions
         | with their claims of a beauty like music... hmm!
        
           | dylan604 wrote:
           | I had similar thoughts when reading Huck Finn. It's not just
           | phonetically spelled, it's much different. Almost like Twain
           | came up with a list of words, and then had a bunch of 2nd
           | graders tell him the spelling of words they had seen. I guess
           | at some point, you just get good at bad spelling?
        
             | spauldo wrote:
             | Writing in the vernacular, I believe it's called. I do
             | something like that if I'm texting.
             | 
             | The book "Feersum Endjinn" by Iain M. Banks uses something
             | like this for one of its characters to quite good effect.
        
               | dylan604 wrote:
               | Except it forces me to slow down to "decypher" the text
               | and makes the reading labored. I understand the point as
               | it is part of the character, but it is easier to
               | understand someone speaking in that vernacular vs reading
               | the forced misspellings. I definitely don't want to get
               | to the point of being good at reading it though. I wonder
               | if this is how second grade teachers feel reading the
               | class' schoolwork?
        
               | spauldo wrote:
               | That's true. I'm sure Twain and Banks were aware of this,
               | though. Apparently they considered the immersion to be
               | worth a little extra work on the part of the reader.
               | Whether the reader agrees is a different story.
               | 
               | I try to limit my use of it to just enough for my accent
               | and way of talking to bleed through. I don't go for full-
               | on phonetics, but I'm often "droppin' my g's and usin'
               | lotsa regional sayin's." It probably helps that the
               | people I text have the same accent I do, though.
        
           | 0cf8612b2e1e wrote:
           | The quality of subtitles implies that almost no effort is
           | being put into their creation. Watch even a high budget
           | movie/TV show and be aghast at how frequently they diverge.
        
             | smallpipe wrote:
             | A good subtitle isn't a perfect copy of what was said.
        
               | herbcso wrote:
               | Tom Scott would agree with you.
               | https://m.youtube.com/watch?v=pU9sHwNKc2c
        
               | kstrauser wrote:
               | Hard disagree. When I'm reading a transcript, I want
               | word-for-word what the people said, not a creative edit.
               | I want the speakers' voice, not the transcriptionist's.
               | 
               | And when I'm watching subtitles in my own language (say
               | because I want the volume low so I'm not disturbing
               | others), I hate when the words I see don't match the
               | words I hear. It's the quickest way I can imagine to get
               | sucked out of the content and into awareness of the
               | delivery of the content.
        
               | stavros wrote:
               | But then what about deliberate mishearings and ambiguous
               | speech, like the GP said?
        
               | crazygringo wrote:
               | I mean, subtitles are _mostly_ the same.
               | 
               | Sometimes they're edited down simply for space, because
               | there wouldn't be time to easily read all the dialog
               | otherwise. And sometimes repetition of words or phrases
               | is removed, because it's clearer, and the emphasis is
               | obvious from watching the moving image. And filler words
               | like "uh" or "um" generally aren't included unless they
               | were in the original script.
               | 
               | Most interestingly, swearing is sometimes toned down,
               | just by skipping it -- removing an f-word in a sentence
               | or similar. Not out of any kind of puritanism, but
               | because swear words genuinely come across as more
               | powerful in print than they do in speech. What sounds
               | right when spoken can sometimes look like too much in
               | print.
               | 
               | Subtitles are an art. Determining when to best time them,
               | how to split up long sentences, how to handle different
               | speakers, how to handle repetition, how to handle limited
               | space. I used to want subtitles that were perfectly
               | faithful to what was spoken. Then I actually got involved
               | in making subtitles at one point, and was very surprised
               | to discover that perfectly faithful subtitles didn't
               | actually do the best job of communicating meaning.
               | 
               | Fictional subtitles aren't court transcripts. They serve
               | the purpose of storytelling, which is the combination of
               | a visible moving image full of emotion and action, and
               | the subtitles. Their interplay is complex.
        
               | creesch wrote:
               | > When I'm reading a transcript
               | 
               | That's the thing though, subtitles _aren 't intended as
               | full transcripts_. They are intended to allow a wide
               | variety of people to follow the content.
               | 
               | A lot of people read slower than they would hear speech.
               | So subtitles often need to condense or rephrase speech to
               | keep pace with the video. The goal is usually to convey
               | meaning clearly within the time available on screen. Not
               | to capture every single word.
               | 
               | If they tried to be fully verbatim, you'd either have
               | subtitles disappearing before most viewers could finish
               | reading them or large blocks of text covering the screen.
               | Subtitlers also have to account for things like
               | overlapping dialogue, filler words, and false starts,
               | which can make exact transcriptions harder to read and
               | more distracting in a visual medium.
               | 
               | I mean, yeah in your own native language I agree it sort
               | of sucks if you can still hear the spoken words as well.
               | But, to be frank, you are also the minority group here as
               | far as subtitle target audiences go.
               | 
               | And to be honest, if they were fully verbatim, I'd wager
               | you quickly would be annoyed as well. Simply because you
               | will notice how much attention they then draw, making you
               | less able to actually view the content.
        
               | iczero wrote:
               | I regularly enable YouTube subtitles. Almost always, they
               | are a 100% verbatim transcription, excluding errors from
               | auto-transcription. I am not annoyed in the slightest,
               | and in fact I very much prefer that they are verbatim.
               | 
               | If you are too slow at reading subtitles, you can either
               | slow down the video or train yourself to read faster. Or
               | you can just disable the subtitles.
        
               | creesch wrote:
               | > If you are too slow at reading subtitles, you can
               | either slow down the video or train yourself to read
               | faster. Or you can just disable the subtitles.
               | 
               | That's just plain tone deaf, plain and simple. I was not
               | talking about myself, or just youtube. You are not
               | everyone else, your use case is not everyone else their
               | use case. It really isn't that difficult.
        
               | numpad0 wrote:
               | Aren't same-language subtitles supposed to be perfect
               | literal transcripts, while cross-language subtitling is
               | supposed to be compressed creative interpretations?
        
       | re wrote:
       | I've been playing with whisper to try to do local transcription
       | of long videos, but one issue I've found is that long (>15
       | seconds) spans without any speech tend to send it into a
       | hallucination loops that it often can't recover from. I wonder
       | if, with direct integration into ffmpeg, they will be able to
       | configure it in a way that can improve that situation.
        
         | 42lux wrote:
         | You usually delete silence before using something like whisper.
        
           | re wrote:
           | I've heard that, but that doesn't sound like a useful
           | approach for videos where (1) non-speech segments can have
           | plenty of other sound (music, noise) and (2) you want
           | timestamps to match up with the original video, like for
           | subtitles. But maybe there are known mitigations for both of
           | those issues that I'm not aware of. And if they do exist
           | maybe they can be included in the ffmpeg whisper integration.
        
             | miki123211 wrote:
             | By "delete", people mostly mean "detect", so that you can
             | avoid processing such segments through Whisper. There's no
             | reason to actually cut the silence out from the original
             | audio file.
        
           | hnlmorg wrote:
           | This is designed for real time use too. And in such cases,
           | you couldn't delete the silence before use.
        
             | 42lux wrote:
             | The ffmpeg implementation might be the example was not.
        
         | franga2000 wrote:
         | Whisper is supposed to be used with voice activity detection
         | and all production implementations that I've seen do that. The
         | raw model is known to make up nonsense for silence because, as
         | I understand it, it was never trained not to do that, assuming
         | everyone will use VAD
        
       | bondarchuk wrote:
       | Can whisper do multilingual yet? Last time I tried it on some
       | mixed dutch/english text it would spit out english translations
       | for some of the dutch text. Strange bug/feature since from all
       | appearances it had understood the dutch text perfectly fine.
        
         | ph4evers wrote:
         | Whisper-v3 works well for multi-lingual. I tried it with Dutch,
         | German and English
        
         | jeroenhd wrote:
         | I found that it works quite well for Dutch+English as long as
         | you use one of the larger models. But that may just be luck, I
         | imagine mixing Italian and Swedish will have very different
         | results.
        
         | guilamu wrote:
         | Whisper has been multilingual for 5 years at least.
        
           | bondarchuk wrote:
           | I know it is ostensibly multilingual, it's less than a year
           | since I tried, but it does this thing where it then
           | translates everything (or only some things) into a single
           | language regardless with no way to turn it off.
        
             | guilamu wrote:
             | Sorry, I've been using it for French audio files since 5
             | years and never had this issues.
        
           | woodson wrote:
           | Except it's only been released in September 2022 (not even 3
           | years ago).
        
         | kwar13 wrote:
         | Best for English, but I've found it pretty decent for Spanish.
        
           | MaKey wrote:
           | It's even better for some languages other than English (e. g.
           | Spanish), see: https://github.com/openai/whisper?tab=readme-
           | ov-file#availab...
        
         | clarionbell wrote:
         | I think the Dutch/English is probably the worst combination for
         | this. Languages are rather close.
        
           | bondarchuk wrote:
           | I don't understand how this would happen, though. It's not
           | like it will mishear a dutch sentence as if it's english; it
           | will correctly pick up the dutch sentence, but (since the
           | language is auto-detected as english at the start of the
           | segment), seemingly auto-translate that (correct and
           | correctly heard) dutch text to english. All we need is a way
           | to get the dutch text that's surely somewhere in there,
           | before the translation happens.
           | 
           | Unless it was trained end-to-end on dutch-subtitled english
           | text?? Which might make the translation a somewhat
           | inextricable part of the model..? Does anyone know?
        
         | numpad0 wrote:
         | Isn't that a bit much for ASR models? Humans can't handle
         | simultaneous multilingual dictation task either, I have to stop
         | and reinitialize ears before switching languages between
         | English and my primary one.
        
           | bondarchuk wrote:
           | Seems like it already has the capability somewhere in the
           | model though - see my reply to clarionbell.
        
           | cenamus wrote:
           | Isn't that exactly what intepreters do?
        
             | numpad0 wrote:
             | If they're like what I am, they seem to just coordinate
             | constant staggered resets for sub-systems of language
             | processing pipeline while keeping internal representations
             | of inputs in half-text state so that input come back out
             | through the pipeline in the other configurations.
             | 
             | That's how I anecdotally feel and interpret how my own
             | brain appear to work, so it could be different from how
             | interpreters work or how actual human brains work, but as
             | far as I see it, professional simultaneous interpreters
             | don't seem to be agnostic for relevant pairs of languages
             | at all.
        
           | abdullahkhalids wrote:
           | In South Asia, it's quite common for people to speak a
           | combination of their local language and English. Not just
           | alternating sentences between the two languages, but in fact,
           | constructing sentences using compound phrases from the two
           | languages.
           | 
           | "Madam, please believe me, maine homework kiya ha" [I did my
           | homework].
        
       | yewenjie wrote:
       | I have recently found that parakeet from NVIDIA is way faster and
       | pretty much as correct as Whisper, but it only works with
       | English.
        
       | instagraham wrote:
       | Does this mean that any software which uses ffmpeg can now add a
       | transcription option? Audacity, Chrome, OBS etc
        
         | ks2048 wrote:
         | If they want to support it out-of-the box, they'll still have
         | to embed a model file (roughly 500 MB - 3GB, varying size and
         | quality)
        
           | einpoklum wrote:
           | Can't you point ffmpeg to a model file using some preferences
           | dialog?
        
       | Lio wrote:
       | Once local transcription is in more places hopefully we can
       | persuade content creator not to burn bouncing sub-titles into
       | their videos.
       | 
       | I've seen professionally produced recordings on dry and technical
       | subjects with good sound quality where they've decided to use
       | distracting sub-titles with no way to disable them.
       | 
       | It seems so unnecessary if you're not making novelty videos about
       | cats.
       | 
       | Also local transcription allows for automatic translation and
       | again overlaying subtitles on top of an existing burnt in set is
       | a really poor reading experience.
        
         | HPsquared wrote:
         | The other problem with burned-in subtitles is you can't change
         | the language.
        
           | rkomorn wrote:
           | True, but (as someone who not infrequently has to rewind
           | content on just about all streaming apps because it decided
           | one particular subtitle only needed to be display for less
           | than 200ms this time around) sometimes burned-in seems like a
           | good idea.
           | 
           | I don't understand why the problem seems so pervasive (I've
           | seen it on Netflix, Viki, and Apple TV, at least) and so
           | transient.
        
             | t-3 wrote:
             | It's a newer problem IME, so I'd guess it's cause by people
             | using auto-transcription/translation tools to generate
             | subtitles. For eg. Chinese content, I'll see stuff on Viki
             | where the OG Mandarin subs are formatted sanely and the
             | English is piecemeal follow-the-audio style. I can't
             | imagine this happening in any other way than use of a
             | transcription+translation tool without review.
        
               | rkomorn wrote:
               | I don't think it's an automation-related thing. It
               | happens even on big name shows on big apps.
               | 
               | I think it's a toolkit thing where some sort of event or
               | timer goes off at the wrong time and the subtitles get
               | cleared when they shouldn't. And then if you rewind and
               | replay, it doesn't happen again (because spurious
               | event/timer issue).
        
               | t-3 wrote:
               | At least with vtt and srt, the chunk of text displayed is
               | explicitly associated with a chunk of time, so something
               | like that really shouldn't be happening. Maybe there is
               | some sort of subtitle-writing on the fly like what is
               | sometimes done with transcoding video, but that would be
               | really strange for a plaintext format that is so light
               | compared to the video and audio coming with it.
        
               | rkomorn wrote:
               | > so something like that really shouldn't be happening
               | 
               | I don't disagree, yet here we are. It's got race
               | condition vibes.
               | 
               | I don't know if it's related to the TV OS (LG WebOS in
               | our case) but I guess that would be the common factor
               | since it happens across multiple apps and languages.
               | 
               | Anyway, it's quirky and occasionally annoying, but that's
               | about it. :)
        
           | LorenDB wrote:
           | The other other problem with burned-in subtitles is that they
           | normally have horrible formatting. Who wants to try to read
           | single words that only flash on-screen while they are being
           | spoken?
        
         | preisschild wrote:
         | They could also just upload those transcriptions as normal
         | closed-captioning srt subtitles...
        
           | jimkleiber wrote:
           | not all social media will show subtitles/captions tho, which
           | is the challenge. YouTube Shorts, TikTok videos, IG reels, FB
           | reels, Whatsapp statuses, and more. I think some allow cc but
           | some don't, and if someone reshares to another platform, it
           | may not be there, so some of us burn them in begrudgingly :-)
        
         | ambicapter wrote:
         | They do that because it increases "engagement", not because
         | they care about the user's experience with the subtitles.
        
           | iAMkenough wrote:
           | Also some social media platforms don't offer subtitle
           | functionality, so burned-in is the only way if you want to
           | serve your content to people that require subtitles or refuse
           | to unmute their phones while they watch from their toilet.
        
         | dzhiurgis wrote:
         | It's just so annyoing how someone like Netflix offers like 3-4
         | languages for most of its content when you can basically get it
         | for free via browser extensions (if you watch on browser).
         | 
         | Must be union thing.
        
           | dewey wrote:
           | That Netflix who would need to pay more to license more
           | subtitles can't compete with pirated or unlicensed auto-
           | generated subtitles shouldn't really be a surprise.
           | 
           | It's also annoying that you have to pay for Netflix when you
           | can get the same movies for free with less restrictions on a
           | pirate site.
        
         | whywhywhywhy wrote:
         | Algorithm boosts it that's why they do it. Even if every device
         | had real time 100% accurate subtitling built in they'd still do
         | it if they video performs better with it.
        
         | absoflutely wrote:
         | I think this trend is partially driven by the silent auto play
         | that happens on YouTube. Baked in subtitles help draw people
         | into the video.
        
         | jiehong wrote:
         | Those burned in subtitles still aren't as cool as theme-matched
         | anime subtitles during intro music sequences from fansubs 15
         | years ago.
         | 
         | Those are still cool IMO
        
           | trenchpilgrim wrote:
           | Or how the fansubbers will create masks to translate diegetic
           | text like signage and written notes
        
       | zoobab wrote:
       | Not sure it will be packaged in Debian, with an external binary
       | model god knows how it was produced...
        
         | majewsky wrote:
         | It looks like the model file needs to be supplied at invocation
         | time, so the binary blob would not be required for packaging.
        
           | zoobab wrote:
           | so 'apt install ffmpeg' won't be enough to have the feature?
        
             | SahAssar wrote:
             | You'd have the feature, but you also need to supply the
             | model. The feature seems to just be that ffmpeg has the
             | ability to run the model, it does not include the model.
        
       | martzoukos wrote:
       | I guess that there is no streaming option for sending generated
       | tokens to, say, an LLM service to process the text in real-time.
        
         | nomad_horse wrote:
         | Whisper has the encoder-decoder architecture, so it's hard to
         | run streaming efficiently, though whisper-streaming is a thing.
         | 
         | https://kyutai.org/next/stt is natively streaming STT.
        
           | woodson wrote:
           | There are many streaming ASR models based on CTC or RNNT.
           | Look for example at sherpa (https://github.com/k2-fsa/sherpa-
           | onnx), which can run streaming ASR, VAD, diarization, and
           | many more.
        
       | donatj wrote:
       | I know nothing about Whisper, is this usable for automated
       | translation?
       | 
       | I own a couple very old and as far as I'm aware never translated
       | Japanese movies. I don't speak Japanese but I'd love to watch
       | them.
       | 
       | A couple years ago I had been negotiating with a guy on Fiver to
       | translate them. At his usual rate-per-minute of footage it would
       | have cost thousands of dollars but I'd negotiated him down to a
       | couple hundred before he presumably got sick of me and ghosted
       | me.
        
         | poglet wrote:
         | Yep, whisper can do that. You can also try whisperx
         | (https://github.com/m-bain/whisperX) for a possibly better
         | experience with aligning of subtitles to spoken words.
        
         | _def wrote:
         | May I ask which movies? I'm just curious
        
         | trenchpilgrim wrote:
         | Whisper has quite bad issues with hallucination. It will inject
         | sentences that were never said in the audio.
         | 
         | It's decent for classification but poor at transcription.
        
           | neckro23 wrote:
           | Pre-processing with a vocal extraction model (bs-rofomer or
           | similar) helps a lot with the hallucinations, especially with
           | poor quality sources.
        
             | trenchpilgrim wrote:
             | I'm working with fairly "clean" audio (voice only) and
             | still see ridiculous hallucinations.
        
         | prmoustache wrote:
         | My personnal experience trying to transcribe (not translate)
         | was a complete failure. The thing would invent stuff. It would
         | also be completely lost when more than one language is used.
         | 
         | It also doesn't understand contexts so does a lot of errors you
         | see in automatic translations from videos in youtube for
         | example.
        
           | okdood64 wrote:
           | It's curious how YouTube's is so bad still given the current
           | state of the art; but it has got a lot better in the last 6
           | months.
        
         | ethan_smith wrote:
         | Whisper can indeed transcribe Japanese and translate it to
         | English, though quality varies by dialect and audio clarity.
         | You'll need the "large-v3" model for best results, and you can
         | use ffmpeg's new integration with a command like `ffmpeg -i
         | movie.mp4 -af whisper=model=large-v3:task=translate
         | output.srt`.
        
           | waltbosz wrote:
           | I wonder how the results of an AI Japanese-audio-to-English-
           | subtitles would compare to a fansub-ed anime. I'm guessing it
           | would be a more literal translation vs. contextual or
           | cultural.
           | 
           | I found an interesting article about trollsubs, which I guess
           | are fansubs made with a contemptuous flare.
           | https://neemblog.home.blog/2020/08/19/the-lost-art-of-fan-
           | ma...
           | 
           | Tangent: I'm one of those people who watch movies with closed
           | captions. Anime is difficult because the subtitle track is
           | often the original Japanese-to-English subtitles and not
           | closed captions, so the text does not match the English
           | audio.
        
             | chazeon wrote:
             | I do japanese transcription + gemini translations. It's
             | worse than fansub, but its much much better than nothing.
             | First thing that could struggle is actually the vad, then
             | is special names and places, prompting can help but not
             | always. Finally it's uniformity (or style). I still feel
             | that I can't control the punctuation well.
        
             | numpad0 wrote:
             | I was recently just playing around with Google Cloud ASR as
             | well as smaller Whisper models, and I can say it hasn't
             | gotten to that point: Japanese ASRs/STTs all generate final
             | kanji-kana mixed text, and since kanji:pronunciation is n:n
             | maps, it's non-trivial enough that it currently need hands
             | from human native speakers to fix misheard texts in a lot
             | of cases. LLMs should be theoretically good at this type of
             | tasks, but they're somehow clueless about how Japanese
             | pronunciation works, and they just rubber-stamp inputs as
             | written.
             | 
             | The conversion process from pronunciation to intended text
             | is not deterministic either, so it probably can't be solved
             | by "simply" generating all-pronunciation outputs. Maybe a
             | multimodal LLM as ASR/STT, or a novel dual input as-
             | spoken+estimated-text validation model could be made? I
             | wouldn't know, though. It seemed like a semi-open question.
        
         | neckro23 wrote:
         | In my experience it works ok. The "English" model actually
         | knows a lot of languages and will translate directly to
         | English.
         | 
         | You can also transcribe it to Japanese and use a translator to
         | convert to English. This can sometimes help for more
         | semantically complex dialogue.
         | 
         | For example, using faster-whisper-xxl [1]:
         | 
         | Direct translation:                   faster-whisper-xxl.exe
         | --language English --model large-v2 --ff_vocal_extract mdx_kim2
         | --vad_method pyannote_v3 --standard <input>
         | 
         | Use Japanese, then translate:                   faster-whisper-
         | xxl.exe --language Japanese --task translate --model large-v2
         | --ff_vocal_extract mdx_kim2 --vad_method pyannote_v3 --standard
         | <input>
         | 
         | 1. https://github.com/Purfview/whisper-standalone-win
        
         | BetterWhisper wrote:
         | Hey, indeed Whisper can do the transcription of Japanese and
         | even the translation (but only to English). For the best
         | results you need to use the largest model which depending on
         | your hardware might be slow or fast.
         | 
         | Another option is to use something like VideoToTextAI which
         | allows you to transcribe it fast and then translate it into
         | 100+ languages which you can then export the subtitle (SRT)
         | file for
        
       | mockingloris wrote:
       | How could one in theory, use this to train on a new language? Say
       | for a hubby project; I have recordings of some old folks stories
       | in my local dialect.
       | 
       | |
       | 
       | +-- Dey well; Be well
        
         | notpublic wrote:
         | https://huggingface.co/blog/fine-tune-whisper
        
       | dncornholio wrote:
       | I was expecting a lot more comments on if this is a necessary
       | feature or if this even belongs in a library like ffmpeg. I think
       | this is bloat, especially when the feature doesn't work flawless,
       | whisper is very limited.
        
         | MrGilbert wrote:
         | The only item that was discussed was that the subtitle workflow
         | does not seem to be that good, afaict:
         | 
         | https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20022#issuecomme...
        
       | kmfrk wrote:
       | Whisper is genuinely amazing - with the right nudging. It's the
       | one AI thing that has genuinely turned my life upside-down in an
       | unambiguously good way.
       | 
       | People should check out Subtitle Edit (and throw the dev some
       | money) which is a great interface for experimenting with Whisper
       | transcription. It's basically Aegisub 2.0, if you're old, like
       | me.
       | 
       | HOWTO:
       | 
       | Drop a video or audio file to the right window, then go to Video
       | > Audio to text (Whisper). I get the best results with Faster-
       | Whisper-XXL. Use large-v2 if you can (v3 has some regressions),
       | and you've got an easy transcription and translation workflow.
       | The results aren't perfect, but Subtitle Edit is for cleaning up
       | imperfect transcripts with features like Tools > Fix common
       | errors.
       | 
       | EDIT: Oh, and if you're on the current gen of Nvidia card, you
       | might have to add "--compute_type float32" to make the
       | transcription run correctly. I think the error is about an empty
       | file, output or something like that.
       | 
       | EDIT2: And if you get another error, possibly about whisper.exe,
       | iirc I had to reinstall the Torch libs from a specific index like
       | something along these lines (depending on whether you use pip or
       | uv):                   pip3 install torch torchvision torchaudio
       | --index-url https://download.pytorch.org/whl/cu118
       | uv pip install --system torch torchvision torchaudio --index-url
       | https://download.pytorch.org/whl/cu118
       | 
       | If you get the errors and the above fixes work, please type your
       | error message in a reply with what worked to help those who come
       | after. Or at least the web crawlers for those searching for help.
       | 
       | https://www.nikse.dk/subtitleedit
       | 
       | https://www.nikse.dk/donate
       | 
       | https://github.com/SubtitleEdit/subtitleedit/releases
        
         | tossit444 wrote:
         | Aegisub is still actively developed (forked), and imo, both
         | software can't really be compared to one another. They can
         | complement each other, but SE is much better for actual
         | transcription. Aegisub still does the heavy lifting for
         | typesetting and the like.
        
         | pawelduda wrote:
         | Can you give an example why it made your life that much better?
        
           | shrx wrote:
           | As a hard of hearing person, I can now download any video
           | from the internet (e.g. youtube) and generate subtitles on
           | the fly, not having to struggle to understand badly recorded
           | or unintelligible speech.
        
             | dylan604 wrote:
             | IF the dialog is badly recorded or unintelligible speech,
             | how would a transcription process get it correct?
        
               | gregoryl wrote:
               | Because it can use the full set of information of the
               | audio - people with hearing difficulties cannot. Also
               | interesting, people with perfectly functional hearing,
               | but whom have "software" bugs (i.e. I find it extremely
               | hard to process voices with significant background nose)
               | can also benefit :)
        
               | spauldo wrote:
               | I have that issue as well - I can hear faint noises OK
               | but if there's background noise I can't understand what
               | people say. But I'm pretty sure there's a physical issue
               | at the root of it in my case. The problem showed up after
               | several practice sessions with a band whose guitarist
               | insisted on always playing at full volume.
        
               | dylan604 wrote:
               | > I have that issue as well
               | 
               | You say issue, I say feature. It's a great way to just
               | ignore boring babbling at parties or other social
               | engagements where you're just not that engaged. Sort of
               | like selective hearing in relationships, but used on a
               | wider audience
        
               | mschuster91 wrote:
               | The definition of "unintelligible" varies by person,
               | especially by accent. Like, I got no problem with
               | understanding the average person from Germany... but
               | someone from the deep backwaters of Saxony, forget about
               | that.
        
             | 3036e4 wrote:
             | I did this as recently as today, for that reason, using
             | ffmpeg and whisper.cpp. But not on the fly. I ran it on a
             | few videos to generate VTT files.
        
           | kmfrk wrote:
           | Aside from accessibility as mentioned, you can catch up on
           | videos that are hours long. Orders of magnitude faster than
           | watching on 3-4x playback speed. If you catch up through
           | something like Subtitle Edit, you can also click on relevant
           | parts of the transcript and replay it.
           | 
           | But transcribing and passably translating everything goes a
           | long way too. Even if you can hear what's being said, it's
           | still less straining to hear when there's captions for it.
           | 
           | Obviously one important factor to the convenience is how fast
           | your computer is at transcription or translation. I don't use
           | the features in real-time personally currently, although I'd
           | like to if a great UX comes along through other software.
           | 
           | There's also a great podcast app opportunity here I hope
           | someone seizes.
        
           | 3036e4 wrote:
           | I used it like sibling commenter to get subtitles for
           | downloaded videos. My hearing is bad. Whisper seems much
           | better that YouTube's built-in auto-subtitles, so sometimes
           | it is worth the extra trouble for me to download a video just
           | to generate good subtitles and then watch it offline.
           | 
           | I also used whisper.cpp to transcribe all my hoarded podcast
           | episodes. Took days of my poor old CPU working at 100% on all
           | cores (and then a few shorter runs to transcribe new episodes
           | I have downloaded since). Worked as good as I could possibly
           | hope. Of course it gets the spelling of names wrong, but I
           | don't expect anything (or anyone) to do much better. It is
           | great to be able to run ripgrep to find old episodes on some
           | topic and sometimes now I read an episode instead of listen,
           | or listen to it with mpv with subtitles.
        
           | joshvm wrote:
           | I don't know about _much_ better, but I like Whisper 's
           | ability to subtitle foreign language content on YouTube that
           | (somehow) doesn't have auto-generated subs. For example some
           | relatively obscure comedy sketches from Germany where I'm not
           | quite fluent enough to go by ear.
           | 
           | 10 years ago you'd be searching through random databases to
           | see if someone had synchronized subtitles for the exact copy
           | of the video that you had. Or older lecture videos that don't
           | have transcripts. Many courses had to, in order to comply
           | with federal funding, but not all. And lots of international
           | courses don't have this requirement at all (for example some
           | great introductory CS/maths courses from German + Swiss
           | institutions). Also think about taking this auto generated
           | output and then generating summaries for lecture notes,
           | reading recommendations - this sort of stuff is what LLMs are
           | great at.
           | 
           | You can do some clever things like take the foreign sub, have
           | Whisper also transcribe it and then ask a big model like
           | Gemini to go line by line and check the translation to
           | English. This can include accounting for common transcription
           | errors or idiomatic difference between langauges. I do it in
           | Cursor to keep track of what the model has changed and for
           | easy rollback. It's often good enough to correct mis-heard
           | words that would be garbled through a cheaper model. _And_
           | you can even query the model to ask about why a particular
           | translation was made and what would be a more natural way to
           | say the same thing. Sometimes it even figures out jokes. It
           | 's not a fast or fully automatic process, but the quality can
           | be extremely good if you put some time into reviewing.
           | 
           | Having 90% of this be possible offline/open access is also
           | very impressive. I've not tried newer OSS models like Qwen3
           | but I imagine it'd do a decent job of the cleanup.
        
         | notatallshaw wrote:
         | > uv pip install --system torch torchvision torchaudio --index-
         | url https://download.pytorch.org/whl/cu118
         | 
         | uv has a feature to get the correct version of torch based on
         | your available cuda (and some non-cuda) drivers (though I
         | suggest using a venv not the system Python):
         | 
         | > uv pip install torch torchvision torchaudio --torch-
         | backend=auto
         | 
         | More details:
         | https://docs.astral.sh/uv/guides/integration/pytorch/#automa...
         | 
         | This also means you can safely mix torch requirements with non-
         | torch requirements as it will only pull the torch related
         | things from the torch index and everything else from PyPI.
        
           | xrd wrote:
           | I love uv and really feel like I only need to know "uv add"
           | and "uv sync" to be effective using it with python. That's an
           | incredible feat.
           | 
           | But, when I hear about these kinds of extras, it makes me
           | even more excited. Getting cuda and torch to work together is
           | something I have struggled countless times.
           | 
           | The team at Astral should be nominated for a Nobel Peace
           | Prize.
        
             | eigenvalue wrote:
             | They've definitely saved me many hours of wasted time
             | between uv and ruff.
        
             | danudey wrote:
             | > "uv add"
             | 
             | One life-changing thing I've been using `uv` for:
             | 
             | System python version is 3.12:                   $ python3
             | --version         Python 3.12.3
             | 
             | A script that requires a library we don't have, and won't
             | work on our local python:                   $ cat test.py
             | #!/usr/bin/env python3              import sys         from
             | rich import print              if sys.version_info < (3,
             | 13):             print("This script will not work on Python
             | 3.12")         else:             print(f"Hello world, this
             | is python {sys.version}")
             | 
             | It fails:                   $ python3 test.py
             | Traceback (most recent call last):         File
             | "/tmp/tmp/test.py", line 10, in <module>             from
             | rich import print         ModuleNotFoundError: No module
             | named 'rich'
             | 
             | Tell `uv` what our requirements are                   $ uv
             | add --script=test.py --python '3.13' rich         Updated
             | `test.py`
             | 
             | `uv` updates the script:                   $ cat test.py
             | #!/usr/bin/env python3         # /// script         #
             | requires-python = ">=3.13"         # dependencies = [
             | #     "rich",         # ]         # ///              import
             | sys         from rich import print              if
             | sys.version_info < (3, 13):             print("This script
             | will not work on Python 3.12")         else:
             | print(f"Hello world, this is python {sys.version}")
             | 
             | `uv` runs the script, after installing packages and
             | fetching Python 3.13                   $ uv run test.py
             | Downloading cpython-3.13.5-linux-x86_64-gnu (download)
             | (33.8MiB)         Downloading
             | cpython-3.13.5-linux-x86_64-gnu (download)
             | Installed 4 packages in 7ms         Hello world, this is
             | python 3.13.5 (main, Jun 12 2025, 12:40:22) [Clang 20.1.4 ]
             | 
             | And if we run it with Python 3.12, we can see that errors:
             | $ uv run --python 3.12 test.py         warning: The
             | requested interpreter resolved to Python 3.12.3, which is
             | incompatible with the script's Python requirement: `>=3.13`
             | Installed 4 packages in 7ms         This script will not
             | work on Python 3.12
             | 
             | Works for any Python you're likely to want:
             | $ uv python list         cpython-3.14.0b2-linux-x86_64-gnu
             | <download available>         cpython-3.14.0b2+freethreaded-
             | linux-x86_64-gnu    <download available>
             | cpython-3.13.5-linux-x86_64-gnu                   /home/dan
             | /.local/share/uv/python/cpython-3.13.5-linux-x86_64-gnu/bin
             | /python3.13         cpython-3.13.5+freethreaded-
             | linux-x86_64-gnu      <download available>
             | cpython-3.12.11-linux-x86_64-gnu                  <download
             | available>         cpython-3.12.3-linux-x86_64-gnu
             | /usr/bin/python3.12         cpython-3.12.3-linux-x86_64-gnu
             | /usr/bin/python3 -> python3.12
             | cpython-3.11.13-linux-x86_64-gnu                  /home/dan
             | /.local/share/uv/python/cpython-3.11.13-linux-x86_64-gnu/bi
             | n/python3.11         cpython-3.10.18-linux-x86_64-gnu
             | /home/dan/.local/share/uv/python/cpython-3.10.18-linux-x86_
             | 64-gnu/bin/python3.10
             | cpython-3.9.23-linux-x86_64-gnu                   <download
             | available>         cpython-3.8.20-linux-x86_64-gnu
             | <download available>         pypy-3.11.11-linux-x86_64-gnu
             | <download available>         pypy-3.10.16-linux-x86_64-gnu
             | <download available>         pypy-3.9.19-linux-x86_64-gnu
             | <download available>         pypy-3.8.16-linux-x86_64-gnu
             | <download available>
             | graalpy-3.11.0-linux-x86_64-gnu                   <download
             | available>         graalpy-3.10.0-linux-x86_64-gnu
             | <download available>         graalpy-3.8.5-linux-x86_64-gnu
             | <download available>
        
         | taminka wrote:
         | whisper is great, i wonder why youtube's auto generated subs
         | are still so bad? even the smallest whisper is way better than
         | google's solution? is it licensing issue? harder to deploy at
         | scale?
        
           | briansm wrote:
           | I believe youtube still uses 40 mel-scale vectors as feature
           | data, whisper uses 80 (which provides finer spectral detail
           | but is computationally more intensive to process naturally,
           | but modern hardware allows for that)
        
           | ec109685 wrote:
           | You'd think they'd use the better model for at least videos
           | that have a large view counts (they already do that when
           | deciding compression optimizations).
        
         | jokethrowaway wrote:
         | whisper is definitely nice, but it's a bit too slow. Having
         | subtitles and transcription for everything is great - but Nemo
         | Parakeet (pretty much whisper by nvidia) completely changed how
         | I interact with the computer.
         | 
         | It enables dictation that actually works and it's as fast as
         | you can think. I also have a set of scripts which just wait for
         | voice commands and do things. I can pipe the results to an LLM,
         | run commands, synthesize a voice with F5-TTS back and it's like
         | having a local Jarvis.
         | 
         | The main limitation is being english only.
        
           | threecheese wrote:
           | Would you share the scripts?
        
             | ec109685 wrote:
             | Or at least more details. Very cool!
        
         | throwoutway wrote:
         | I found this online demo of it:
         | https://www.nikse.dk/subtitleedit/online
        
         | codedokode wrote:
         | Kdeenlive also supports auto-generating subtitles which need
         | some editing, but it is faster than create them from scratch.
         | Actually I would be happy even with a simple voice detector so
         | that I don't have to set the timings manually.
        
         | hart_russell wrote:
         | Is there a way to use it to generate a srt subtitle file given
         | a video file?
        
           | prurigro wrote:
           | It generates a few formats by default including srt
        
         | guluarte wrote:
         | you can install suing winget or chocolately
         | winget install --id=Nikse.SubtitleEdit  -e
        
         | Morizero wrote:
         | You don't happen to know a whisper solution that combines
         | diarization with live audio transcription, do you?
        
           | kmfrk wrote:
           | Proper diarization still remains a white whale for me,
           | unfortunately.
           | 
           | Last I looked into it, the main options required API access
           | to external services, which put me off. I think it was
           | pyannotate.audio[1].
           | 
           | [1]: https://github.com/pyannote/pyannote-audio
        
           | jduckles wrote:
           | WhipserX's diarization is great imo:
           | whisperx input.mp3 --language en --diarize --output_format
           | vtt --model large-v2
           | 
           | Works a treat for Zoom interviews. Diarization is sometimes a
           | bit off, but generally its correct.
        
             | Morizero wrote:
             | > input.mp3
             | 
             | Thanks but I'm looking for live diarization.
        
         | BrunoJo wrote:
         | Subtitle Edit is great if you have the hardware to run it. If
         | you don't have GPUs available or don't want to manage the
         | servers I built a simple to use and affordable API that you can
         | use: https://lemonfox.ai/
        
         | kanemcgrath wrote:
         | Subtitle edit is great, and their subtitle library libse was
         | exactly what I needed for a project I did.
        
       | JohnKemeny wrote:
       | Related, a blog article by the author of the patch:
       | 
       |  _Run Whisper audio transcriptions with one FFmpeg command_
       | 
       | https://medium.com/@vpalmisano/run-whisper-audio-transcripti...
       | 
       | Posted here, with 0 comments:
       | https://news.ycombinator.com/item?id=44869254
        
         | eXpl0it3r wrote:
         | Link is broken, full link: https://medium.com/@vpalmisano/run-
         | whisper-audio-transcripti...
        
         | NiekvdMaas wrote:
         | Correct URL: https://medium.com/@vpalmisano/run-whisper-audio-
         | transcripti...
        
       | webinar wrote:
       | I've been using FFmpeg and Whisper to record and transcribe live
       | police scanner audio for my city, and update it in real-time to a
       | live website. It works great, with the expected transcription
       | errors and hallucinations.
        
         | Xunjin wrote:
         | Is this website open? Would love to see your work :P
        
           | webinar wrote:
           | somerville.votolab.com
        
             | mkayokay wrote:
             | Looks like this is a nice case were the LLM thinks that
             | silence is "thanks for watching" which was discussed on
             | here a few days ago.
        
             | jaster wrote:
             | All the "Thanks for watching!" gave me a good chuckle.
             | 
             | Remind me of one of my own experiences with one of the
             | Whisper model, where some random noise in the middle of the
             | conversation was translated into "Don't forget to like and
             | subscribe".
             | 
             | Really illustrate where the training data is coming from.
        
         | waltbosz wrote:
         | I wanted to do this for my local county council meetings. I
         | think in this context speaker recognition would be important.
        
       | thedangler wrote:
       | Does this whisper also do text-to-speech?
        
         | dotancohen wrote:
         | No
        
       | porridgeraisin wrote:
       | I had a small bash pipeline for doing this until now.
       | ffmpeg -f pulse -i "$(pactl get-default-source)" -t 5 -f wav -ar
       | 16000 -ac 1 -c:a pcm_s16le - \       | ./main - \       | head -2
       | \       | tail -1 \       | cut -d] -f2 \       | awk '{$1=$1};1'
       | 
       | The reading from mic part (-f pulse, pactl...) is linux-specific
       | rest of it should be cross platform. The `main` executable is the
       | whisper.cpp executable (see whisper.cpp github readme, it's just
       | the output of `make base.en` from that).
       | 
       | Edit: -t 5 controls recording duration.
       | 
       | Oh and add 2>/dev/null to silence the debug output. I copied this
       | from a pipe that further sends it into an LLM that then looks at
       | the meaning and turns it into a variety of structured data
       | (reminders, todo items, etc) which I then....
        
         | dotancohen wrote:
         | > which I then....
         | 
         | Yes, please, go on...
        
           | porridgeraisin wrote:
           | The LLM turns my unstructured command into structured command
           | (a limited set of commands hardcoded in the prompt) and a
           | script takes that and executes it. I have it do stuff like
           | interact with google keep/google calendar using the CLI.
           | Those are the most used actions but there's a few others . Of
           | course all actions can be scheduled.
           | 
           | The LLM can screw up now and then and output absolute
           | garbage. But I've got a knack now for figuring out what
           | prompts it's gonna be hopeless on and I manually enter those.
           | 
           | Example:
           | 
           | Saying
           | 
           | Remove makhana from shopping list
           | 
           | Ends up running the command
           | 
           | gkeep items edit shopping_list --check makhana
           | 
           | There is a direct text interface too that skips the voice
           | transcription.
           | 
           | The main thing is it does in a background window without
           | interrupting my screen or me needing to wait for whatever
           | slow webpage to load. I had it do a few things on GitHub like
           | remind me when checks pass on PRs. You could potentially
           | connect it to various things like your amazon account to
           | check on your order, etc,.. as I write this I now realise I
           | did what basically amounts to what folks do with MCP today.
           | Maybe I should update it to use the protocol.
           | 
           | These days I have a little more idle time as a grad student
           | than I did in a tech company, and I don't really need to
           | manage home/cooking/... so I don't really use some of the
           | more complicated features. I mostly just use it to schedule
           | 1on1s with my guide and add reminders about assignments and
           | TA work and talks and my music class.
        
             | dotancohen wrote:
             | That is fascinating, thank you very much for sharing. Good
             | luck with the grad work.
        
               | porridgeraisin wrote:
               | Thank you:)
        
       | MaxikCZ wrote:
       | I tried to use whisper to generate non-english subs from english
       | audio, but wasnt able to figure out. I know it can do english
       | subs from non-english audio, and that earlier (less precise)
       | versions could do any language audio -> any language subs, but
       | latest whisper only to english subs.
       | 
       | Anyone found a way?
        
         | abdusco wrote:
         | I solved it by generating English subtitles, then passing those
         | to an LLM in chunks that are ~20 entries in size. Include
         | preceding and following subtitles as context for better
         | translation. Make sure to replace the timestamps with simple
         | integer ids, because LLMs like to mangle those, no matter how
         | hard you prompt.
         | 
         | I could share a python script that is working pretty reliably
         | for me.
        
           | vevoe wrote:
           | I'd love to see that script, do you have a link?
        
             | abdusco wrote:
             | https://gist.github.com/abdusco/5bd5c909547f5f9b935dbd2fb2f
             | e...
        
       | realxrobau wrote:
       | Annoyingly, something is broken with their anti not stuff, as it
       | keeps refusing to let me see the page.
        
       | correa_brian wrote:
       | hell yeah
        
       | pmarreck wrote:
       | Now if it only did separate speaker identification (diarization)
        
       | shmerl wrote:
       | Did ffmpeg move their bug tracker to Forgejo?
       | 
       | https://code.ffmpeg.org/FFmpeg/FFmpeg/issues
       | 
       | I still see their old one too, but Forgejo one is nice.
        
       | de6u99er wrote:
       | That's great. How does Whisper compare to Google Gemini's
       | transcription capabilities?
        
       | mkbkn wrote:
       | How can I run Whisper or this software in Linux or Android as a
       | non-technical user?
       | 
       | Basically a simple audio-to-text for personal use?
        
         | 3036e4 wrote:
         | I don't think installing (i.e. compiling) whisper.cpp and using
         | it to do audio-to-text is very difficult. If the documentation
         | is too technical I am sure you can ask some LLM to walk you
         | through it. I have used it on Android in termux and on my
         | FreeBSD desktop computer. Would not expect any difficulties on
         | any modern Linux.
        
       | iambvk wrote:
       | Is anyone able to get streaming audio to text conversion working
       | with whisper.cpp?
       | 
       | I tried several times to get this into a reasonable shape, but
       | all have been failures. If anyone has pointers I really
       | appreciate it.
        
       | dotancohen wrote:
       | Why would one use FFmpeg with Whisper support, instead of using
       | Whisper directly?
        
         | lbrito wrote:
         | I run a service that does transcriptions as part of the
         | pipeline, and I use ffmpeg for other parts (such as speeding up
         | audio). Having it all on a single command might make sense for
         | some people if the costs work out.
        
           | dotancohen wrote:
           | Terrific, thank you.
        
         | 3036e4 wrote:
         | At least whisper.cpp only supports a few input formats like WAV
         | and MP3. To get subtitles for videos I always have to first run
         | ffmpeg to get an audio file and then run whisper.cpp. Guess
         | this new feature may mean that I can do it in just one step, so
         | slightly more convenient?
        
       | miladyincontrol wrote:
       | on an aside, my favorite whisper 'hack' is you can just speed up
       | audio 10x to process it 10x faster, then adjust the timings after
        
       | yieldcrv wrote:
       | Labeling multiple people talking is something i found lacking
       | with whisper, is it better now?
        
       | WanderPanda wrote:
       | Is Whisper still SOTA 3 years later? It does not seem there is a
       | clearly better open model. Alec Radford really is a genius!
        
         | jiehong wrote:
         | NVIDIA Nemo Parakeet for English. Mistral's recent Voxtral is
         | supposed to be nice and open source
        
         | generalizations wrote:
         | Looks like there's a leaderboard:
         | https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
        
         | vitorgrs wrote:
         | 3 years later and Youtube CCs is still horrible lol
        
       | jhatemyjob wrote:
       | I wish they worked with the mpv folks instead of shoehorning this
       | in. Based on the docs it looks like getting live transcription
       | for a video will involve running the demuxer/decoder on one
       | thread, and this whisper filter on another thread, using ffmpeg's
       | AVIO (or to a REST API [1].... _shudders_ ) to synchronize those
       | two parallel jobs. It could have been way simpler.
       | 
       | Other than for the "live transcription" usecase (that they made
       | unnecessarily complicated), I don't see how this is any better
       | than running Whisper.cpp directly. Other people in this thread
       | are basically saying "ffmpeg's interface is better understood"
       | [2] but LLMs make that point moot since you can just ask them to
       | do the drudgery for you.
       | 
       | [1] https://medium.com/@vpalmisano/run-whisper-audio-
       | transcripti...
       | 
       | [2] https://news.ycombinator.com/item?id=44890067
        
       | superkuh wrote:
       | "Making sure you're not a bot!" with no way to get to the actual
       | document that is supposed to be at the URL. Anubis can be
       | configured to be accessible for people without the latest
       | computers by using the meta-refresh proof of work but very few
       | people take any time to configure it and just deploy the
       | defaults. Just like with cloudflare.
       | 
       | That said, I suppose I'm glad they're concentrating on making the
       | ffmpeg code better rather than fixing bugs in the web interface
       | for the development tracker. Having whisper integrated will be
       | really useful. I'm already imagining automatic subtitle
       | generation... imagining because I can't read the page or the code
       | to know what it is.
        
       | sorenjan wrote:
       | I hope this is the start of more ML filters in ffmpeg. They added
       | the sr (super resolution) filter years ago, but it's old and it's
       | difficult to get the weights so you can run it, since they're not
       | included. They have added support for multiple inference
       | libraries like libtorch, but again, it's difficult to even get
       | started. Hopefully they can get behind a consistent ML strategy,
       | ideally with a "models" directory with ready to use models for
       | upscaling, temporal upscaling, noise cancelling, etc. A lot of
       | audio and video filter research use ML now, new codecs will
       | probably also use it soon.
        
       | manca wrote:
       | The only problem with this PR/diff is that it creates just a
       | avfilter wrapper around whisper.cpp library and requires the user
       | to manage the dependencies on their own. This is not helpful for
       | novice users who will first need to:
       | 
       | 1. git clone whisper.cpp
       | 
       | 2. Make sure they have all dependencies for `that` library
       | 
       | 3. Hope the build passes
       | 
       | 4. Download the actual model
       | 
       | AND only then be able to use `-af "whisper=model...` filter.
       | 
       | If they try to use the filter without all the prereqs they'll
       | fail and it'll create frustration.
       | 
       | It'd be better to natively create a Whisper avfilter and only
       | require the user to download the model -- I feel like this would
       | streamline the whole process and actually make people use it much
       | more.
        
         | slhck wrote:
         | While that would be nicer from an end-user perspective, it's
         | something hard to maintain for FFmpeg itself. Consider the
         | velocity of the whisper-cpp project. I'm sure that - just like
         | with filters such as vmaf, which also require building a
         | dependency and downloading a model - precompiled versions will
         | become available for novice users to directly download.
         | Especially considering whisper-cpp is MIT-licensed.
        
       | cheerioty wrote:
       | OH: "New changelog entries go to the bottom, @vpalmisano ..
       | Didn't I tell you this once?"
        
       | igorguerrero wrote:
       | Aww, I literally just implemented this using whisper.cpp and
       | ffmpeg lib, code is even similar...
        
       | jd3 wrote:
       | took me longer than i'd care to admit to figure out how to
       | install whisper as a user/system package on macOS w/o brew (which
       | pulls in all of llvm@16 during install)                   brew
       | install uv         uv tool install openai-whisper         then
       | add ~/.local/bin/ to $PATH
        
       | hbn wrote:
       | I wonder if Apple's upcoming speech APIs can be added too. Would
       | be cool to have it just work out of the box on Macs, without
       | needing to source a model.
       | 
       | https://developer.apple.com/documentation/speech/speechtrans...
       | 
       | https://developer.apple.com/documentation/speech/speechanaly...
       | 
       | https://www.macstories.net/stories/hands-on-how-apples-new-s...
        
       | XCSme wrote:
       | Unrelated, but can I use Whisper in DaVinci resolve to
       | automatically transcribe my videos and add subs?
        
         | cadamsdotcom wrote:
         | Unrelated, but why isn't Europe a country already. It's been
         | ages!
        
       ___________________________________________________________________
       (page generated 2025-08-13 23:00 UTC)