[HN Gopher] OpenAI charges by the minute, so speed up your audio
       ___________________________________________________________________
        
       OpenAI charges by the minute, so speed up your audio
        
       Author : georgemandis
       Score  : 393 points
       Date   : 2025-06-25 13:17 UTC (9 hours ago)
        
 (HTM) web link (george.mand.is)
 (TXT) w3m dump (george.mand.is)
        
       | georgemandis wrote:
       | I was trying to summarize a 40-minute talk with OpenAI's
       | transcription API, but it was too long. So I sped it up with
       | ffmpeg to fit within the 25-minute cap. It worked quite well (Up
       | to 3x speeds) and was cheaper and faster, so I wrote about it.
       | 
       | Felt like a fun trick worth sharing. There's a full script and
       | cost breakdown.
        
         | bravesoul2 wrote:
         | You could have kept quiet and started a cheaper than openai
         | transcription business :)
        
           | behnamoh wrote:
           | Sure, but now the world is a better place because he shared
           | something useful!
        
           | 4b11b4 wrote:
           | Pre-processing of the audio still a valid biz, multiple types
           | of pre-processing might be valid
        
           | hn8726 wrote:
           | Or openai will do it themselves for transcription tasks
        
           | ilyakaminsky wrote:
           | I've already done that [1]. A fraction of the price, 24-hour
           | limit per file, and speedup tricks like the OP's are welcome.
           | :)
           | 
           | [1] https://speechischeap.com
        
             | bravesoul2 wrote:
             | Nice. Don't expect you to spill the beans but is it doing
             | OK (some customers?)
             | 
             | Just wondering if I cam build a retirement out of APIs :)
        
       | ada1981 wrote:
       | We discovered this last month.
       | 
       | There is also prob a way to send a smaller sampler of audio at
       | diff speeds and compare them to get a speed optimization with no
       | quality loss unique for each clip.
        
         | moralestapia wrote:
         | >We discovered this last month.
         | 
         | Nice. Any blog post, twitter comment or anything pointing to
         | that?
        
         | babuloseo wrote:
         | source?
        
       | brendanfinan wrote:
       | would this also work for my video consisting of 10,000 PDFs?
       | 
       | https://news.ycombinator.com/item?id=44125598
        
         | jasonjmcghee wrote:
         | I can't tell if this is a meme or not.
         | 
         | And if someone had this idea and pitched it to Claude (the
         | model this project was vibe coded with) it would be like "what
         | a great idea!"
        
       | mcc1ane wrote:
       | Longer*
        
       | simonw wrote:
       | There was a similar trick which worked with Gemini versions prior
       | to Gemini 2.0: they charged a flat rate of 258 tokens for an
       | image, and it turns out you could fit more than 258 tokens of
       | text in an image of text and use that for a discount!
        
       | heeton wrote:
       | A point on skimming vs taking the time to read something
       | properly.
       | 
       | I read a transcript + summary of that exact talk. I thought it
       | was fine, but uninteresting, I moved on.
       | 
       | Later I saw it had been put on youtube and I was on the train, so
       | I watched the whole thing at normal speed. I had a huge number of
       | different ideas, thoughts and decisions, sparked by watching the
       | whole thing.
       | 
       | This happens to me in other areas too. Watching a conference talk
       | in person is far more useful to me than watching it online with
       | other distractions. Watching it online is more useful again than
       | reading a summary.
       | 
       | Going for a walk to think about something deeply beats a 10
       | minute session to "solve" the problem and forget it.
       | 
       | Slower is usually better for thinking.
        
         | pluc wrote:
         | Seriously this is bonkers to me. I, like many hackers, hated
         | school because they just threw one-size-fits-all knowledge at
         | you and here we are, paying for the privilege to have that in
         | every facet of our lives.
         | 
         | Reading is a pleasure. Watching a lecture or a talk and feeling
         | the pieces fall into place is great. Having your brain work out
         | the meaning of things is surely something that defines us as a
         | species. We're willingly heading for such stupidity, I don't
         | get it. I don't get how we can all be so blind at what this is
         | going to create.
        
           | hooverd wrote:
           | If you're not listening to summaries of different audiobooks
           | at 2x speed in each ear you're not contentmaxing.
        
             | lovestory wrote:
             | Or just use notebookLM to convert your books into an hour
             | long podcasts /s
        
           | isaacremuant wrote:
           | > We're willingly heading for such stupidity, I don't get it.
           | I don't get how we can all be so blind at what this is going
           | to create.
           | 
           | Your doomerism and superiority doesn't follow from your
           | initial "I like many hackers don't like one size fits all".
           | 
           | This is literally offering you MANY sizes and you have the
           | freedom to choose. Somehow you're pretending pushed down
           | uniformity.
           | 
           | Consume it however you want and come up with actual
           | criticisms next time?
        
           | colechristensen wrote:
           | University didn't agree with me mostly because I can't pay
           | attention to the average lecturer. Getting bored in between
           | words or while waiting for them to write means I absorbed
           | very little and had to teach myself nearly everything.
           | 
           | Audiobooks before speed tools were the worst (are they
           | _trying_ to speak extra slow?) But when I can speed things up
           | comprehension is just fine.
        
           | bisby wrote:
           | > I, like many hackers, hated school because they just threw
           | one-size-fits-all knowledge at you
           | 
           | "This specific knowledge format doesnt work for me, so I'm
           | asking OpenAI to convert this knowledge into a format that is
           | easier for me to digest" is exactly what this is about.
           | 
           | I'm not quite sure what you're upset about? Unless you're
           | referring to "one size fits all knowledge" as simplified
           | topics, so you can tackle things at a surface level? I love
           | having surface level knowledge about a LOT of things. I
           | certainly don't have time to have go deep on every topic out
           | there. But if this is a topic I find I am interested in, the
           | full talk is still available.
           | 
           | Breadth and depth are both important, and well summarized
           | talks are important for breadth, but not helpful at all for
           | depth, and that's ok.
        
         | georgemandis wrote:
         | For what it's worth, I completely agree with you, for all the
         | reasons you're saying. With talks in particular I think it's
         | seldom about the raw content and ideas presented and more about
         | the ancillary ideas they provoke and inspire, like you're
         | describing.
         | 
         | There is just _so_ much content out there. And context is
         | everything. If the person sharing it had led with some specific
         | ideas or thoughts I might have taken the time to watch and
         | looked for those ideas. But in the context it was received--a
         | quick link with no additional context--I really just wanted the
         | "gist" to know what I was even potentially responding to.
         | 
         | In this case, for me, it was worth it. I can go back and decide
         | if I want to watch it. Your comment has intrigued me so I very
         | well might!
         | 
         | ++ to "Slower is usually better for thinking"
        
         | mutagen wrote:
         | Not to discount slower speeds for thinking but I wonder if
         | there is also value in dipping into a talk or a subject and
         | then revisiting (re-watching) with the time to ponder on the
         | thoughts a little more deeply.
        
           | tass wrote:
           | This is similar to strategies in "how to read a book"
           | (Adler).
           | 
           | By understanding the outline and themes of a book (or
           | lecture, I suppose), it makes it easier to piece together
           | thoughts as you delve deeper into the full content.
        
         | conradev wrote:
         | Was it the speed or the additional information vended by the
         | audio and video? If someone is a compelling speaker, the same
         | message will be way more effective in an audiovisual format.
         | The audio has emphasis on certain parts of the content, for
         | example, which is missing from the transcript or summary
         | entirely. Video has gestural and facial cues, also often
         | utilized to make a point.
        
         | bongodongobob wrote:
         | You'd love where I work. Everything is needlessly long
         | bloviating power point meetings that could easily be ingested
         | in a 5 minute email.
        
         | itsoktocry wrote:
         | > _Slower is usually better for thinking._
         | 
         | Yeah, I see people talking about listening to podcasts or
         | audiobooks on 2x or 3x.
         | 
         | Sometimes I set mine to 0.8x. I find you get time to absorb and
         | think. Am I an outlier?
        
       | b0a04gl wrote:
       | it's still decoding every frame and matching phonemes either way,
       | but speeding it up reduces how many seconds they bill you for. so
       | you may hack their billing logic more than the model itself.
       | 
       | also means the longer you talk, the more you pay even if the
       | actual info density is the same. so if your voice has longer
       | pauses or you speak slow, you maybe subsidizing inefficiency.
       | 
       | makes me think maybe the next big compression is in delivery
       | cadence. just auto-optimize voice tone and pacing before sending
       | it to LLM. feed it synthetic fast speech with no emotion, just
       | high density words. you lose human warmth but gain 40% cost
       | savings
        
       | timerol wrote:
       | > Is It Accurate?
       | 
       | > I don't know--I didn't watch it, lol. That was the whole point.
       | And if that answer makes you uncomfortable, buckle-up for this
       | future we're hurtling toward. Boy, howdy.
       | 
       | This is a great bit of work, and the author accurately summarizes
       | my discomfort
        
         | BHSPitMonkey wrote:
         | As if human-generated transcriptions of audio ever came with
         | guarantees of accuracy?
         | 
         | This kind of transformation has always come with flaws, and I
         | think that will continue to be expected implicitly. Far more
         | worrying is the public's trust in _interpretations_ and claims
         | of _fact_ produced by gen AI services, or at least the popular
         | idea that "AI" is more trustworthy/unbiased than humans,
         | journalists, experts, etc.
        
       | jasonjmcghee wrote:
       | Heads up, the token cost breakdown tables look white on white to
       | me. I'm in dark mode on iOS using Brave.
        
         | georgemandis wrote:
         | Should be fixed now. Thank you!
        
       | w-m wrote:
       | With transcribing a talk by Andrej, you already picked the most
       | challenging case possible, speed-wise. His natural talking speed
       | is already >=1.5x that of a normal human. One of the people you
       | absolutely have to set your YouTube speed back down to 1x when
       | listening to follow what's going on.
       | 
       | In the idea of making more of an OpenAI minute, don't send it any
       | silence.
       | 
       | E.g.                   ffmpeg -i video-audio.m4a \           -af 
       | "silenceremove=start_periods=1:start_duration=0:start_threshold=-
       | 50dB:\
       | stop_periods=-1:stop_duration=0.02:stop_threshold=-50dB,\
       | apad=pad_dur=0.02" \           -c:a aac -b:a 128k
       | output_minpause.m4a -y
       | 
       | will cut the talk down from 39m31s to 31m34s, by replacing any
       | silence (with a -50dB threshold) longer than 20ms by a 20ms
       | pause. And to keep with the spirit of your post, I measured only
       | that the input file got shorter, I didn't look at all at the
       | quality of the transcription by feeding it the shorter version.
        
         | georgemandis wrote:
         | Oooh fun! I had a feeling there was more ffmpeg wizardry I
         | could be leaning into here. I'll have to try this later--thanks
         | for the idea!
        
           | w-m wrote:
           | In the meantime I realized that the apad part is nonsensical
           | - it pads the end of the stream, not at each silence-removed
           | cut. I wanted to get angry at o3 for proposing this, but then
           | I had a look at the silenceremove= documentation myself:
           | https://ffmpeg.org/ffmpeg-filters.html#silenceremove
           | 
           | Good god. You couldn't make that any more convoluted and
           | hard-to-grasp if you wanted to. You gotta love ffmpeg!
           | 
           | I now _think_ this might be a good solution:
           | ffmpeg -i video-audio.m4a \                -af "silenceremove
           | =start_periods=1:stop_periods=-1:stop_duration=0.15:stop_thre
           | shold=-40dB:detection=rms" \                -c:a aac -b:a
           | 128k output.m4a -y
        
             | snickerdoodle12 wrote:
             | I love ffmpeg but the documentation is often close to
             | incomprehensible.
        
             | squigz wrote:
             | Out of curiosity, how might you improve those docs? They
             | seem fairly reasonable to me
        
               | w-m wrote:
               | The documentation reads like it was written by a
               | programmer who documented the different parameters to
               | their implementation of a specific algorithm. Now when
               | you as the user come along and want to use silenceremove,
               | you'll have to carefully read through this, and build
               | your own mental model of that algorithm, and then you'll
               | be able to set these parameters accordingly. That takes a
               | lot of time and energy, in this case multiple read-
               | throughs and I'd say > 5 minutes.
               | 
               | Good documentation should do this work for you. It should
               | explain somewhat atomic concepts to you, that you can
               | immediately adapt, and compose. Where it already works is
               | for the "detection" and "window" parameters, which are
               | straightforward. But the actions of trimming in the
               | start/middle/end, and how to configure how long the
               | silence lasts before trimming, whether to ignore short
               | bursts of noise, whether to skip every nth silence
               | period, these are all ideas and concepts that get mushed
               | together in 10 parameters which are called start/stop-
               | duration/threshold/silence/mode/periods.
               | 
               | If you want to apply this filter, it takes a long time to
               | build mental models for these 10 parameters. You do have
               | some example calls, which is great, but which doesn't
               | help if you need to adjust any of these - then you
               | probably need to understand them all.
               | 
               | Some stuff I stumbled over when reading it:
               | 
               | "To remove silence from the middle of a file, specify a
               | stop_periods that is negative. This value is then treated
               | as a positive value [...]" - what? Why is this parameter
               | so heavily overloaded?
               | 
               | "start_duration: Specify the amount of time that non-
               | silence must be detected before it stops trimming audio"
               | - parameter is named start_something, but it's about
               | stopping? Why?
               | 
               | "start_periods: [...] Normally, [...] start_periods will
               | be 1 [...]. Default value is 0."
               | 
               | "start_mode: Specify mode of detection of silence end at
               | start": start_mode end at start?
               | 
               | It's very clunky. Every parameter has multiple modes of
               | operation. Why is it start and stop for beginning and
               | end, and why is "do stuff in the middle" part of the end?
               | Why is there no global mode?
               | 
               | You could nitpick this stuff to death. In the end, naming
               | things is famously one of the two hard problems in
               | computer science (the others being cache invalidation and
               | off-by-one errors). And writing good documentation is
               | also very, very hard work. Just exposing the internals of
               | the algorithm is often not great UX, because then every
               | user has to learn how the thing works internally before
               | they can start using it (hey, looking at you, git).
               | 
               | So while it's easy to point out where these docs fail, it
               | would be a lot of work to rewrite this documentation from
               | the top down, explaining the concepts first. Or even
               | rewriting the interface to make this more approachable,
               | and the parameters less overloaded. But since it's hard
               | work, and not sexy to programmers, it won't get done, and
               | many people will come after, having to spend time on
               | reading and re-reading this current mess.
        
         | pragmatic wrote:
         | No not really? The talk where he babbles about OSes and
         | everyone is somehow impressed?
        
         | behnamoh wrote:
         | > His natural talking speed is already >=1.5x that of a normal
         | human. One of the people you absolutely have to set your
         | YouTube speed back down to 1x when listening to follow what's
         | going on.
         | 
         | I wonder if there's a way to automatically detect how "fast" a
         | person talks in an audio file. I know it's subjective and
         | different people talk at different paces in an audio, but it'd
         | be cool to kinda know when OP's trick fails (they mention x4
         | ruined the output; maybe for karpathy that would happen at x2).
        
           | echelon wrote:
           | > I wonder if there's a way to automatically detect how
           | "fast" a person talks in an audio file.
           | 
           | Stupid heuristic: take a segment of video, transcribe text,
           | count number of words per utterance duration. If you need
           | speaker diarization, handle speaker utterance durations
           | independently. You can further slice, such as syllable count,
           | etc.
        
             | nand4011 wrote:
             | https://www.science.org/doi/10.1126/sciadv.aaw2594
             | 
             | Apparently human language conveys information at around 39
             | bits/s. You could use a similar technique as that paper to
             | determine the information rate of a speaker and then
             | correct it to 39 bits/s by changing the speed of the video.
        
           | varispeed wrote:
           | It's a shame platforms don't generally support speeds greater
           | than 2x. One of my "superpowers" or a curse is that I cannot
           | stand normal speaking pace. When I watch lectures, I always
           | go for maximum speed and that still is too slow for me. I
           | wish platforms have included 4x but done properly (with
           | minimal artefacts).
        
             | lofaszvanitt wrote:
             | Robot in a human body identified :D.
        
             | mrmuagi wrote:
             | All audiobooks are like this for me. I tried it for
             | lectures but if I'm taking handwritten notes, I can't keep
             | up my writing.
             | 
             | I wonder if there is negative side effects of this though,
             | do you notice when interacting with people who speak slower
             | require a greater deal of patience?
        
               | colechristensen wrote:
               | No but a little. I struggle with people who repeat every
               | point of what they're saying to you several times or when
               | you say "you told me exactly this the last time we spoke"
               | they cannot be stopped from retelling the whole thing
               | verbatim. Usually in those situations though there's some
               | potential cognitive issues so you can only be
               | understanding.
        
               | hamburglar wrote:
               | I once attended a live talk by Leslie Lamport and as he
               | talked, I had the overwhelming feeling that something was
               | wrong, and was thinking "did he have a stroke or
               | something?" but then I realized I had just always watched
               | his lectures online and had become accustomed to
               | listening to him at 2x.
        
             | dpcx wrote:
             | https://github.com/codebicycle/videospeed has been a
             | wonderful addition for me.
        
             | narratives1 wrote:
             | I use a Chrome extension that lets you take any video
             | player (including embedded) to 10x speed. Turn most things
             | to 3-4x. It works on ads too
        
               | munch117 wrote:
               | I use a bookmarklet:
               | 
               | javascript:void%20function(){document.querySelector(%22vi
               | deo,audio%22).playbackRate=parseFloat(prompt(%22Set%20the
               | %20playback rate%22))}();
        
             | cookingrobot wrote:
             | There are fonts designed to be legibly at really small
             | size. I wonder if there are voices that are especially
             | understandable at extreme speeds.
             | 
             | Could use an "auctioneer" voice to playback text at 10x
             | speed.
        
               | bbatha wrote:
               | I'm also a fast listener. I find audio quality is the
               | main differentiator in my ability to listen quickly or
               | not. A podcast recorded at high quality I can listen to
               | at 3-4x (with silence trimmed) comfortably, the second
               | someone calls in from their phone I'm getting every 4th
               | word and often need to go down to 2x or less. Mumbly
               | accents are also a driver of quality but not as much,
               | then again I rarely have trouble understanding difficult
               | accents IRL and almost never use subtitles on TV
               | shows/youtube to better understand the speaker. Your
               | mileage may vary.
               | 
               | I understand 4-6x speakers fairly well but don't enjoy
               | listening at that pace. If I lose focus for a couple of
               | seconds I effectively miss a paragraph of context and my
               | brain can't fill in the missing details.
        
             | seabass wrote:
             | I made a super simplistic chrome extension for this.
             | Doesn't work on all websites, but YouTube and most online
             | video courses are covered.
             | 
             | https://github.com/sebastiansandqvist/video-speed-extension
        
             | JadeNB wrote:
             | Can't you use VLC to watch almost anything streamable, and
             | then play at your desired speed?
        
           | btown wrote:
           | Even a last-decade transcription model could be used to
           | detect a rough number of syllables per unit time, and the
           | accuracy of that model could be used to guide speed-up and
           | dead-time detection before sending to a more expensive model.
           | As with all things, it's a question of whether the cost
           | savings justify the engineering work.
        
           | janalsncm wrote:
           | > I wonder if there's a way to automatically detect how
           | "fast" a person talks in an audio file
           | 
           | Transcribe it locally using whisper and output tokens/sec?
        
             | maxall4 wrote:
             | Just count syllables per second by doing an FFT plus some
             | basic analysis.
        
           | mrstone wrote:
           | > I wonder if there's a way to automatically detect how
           | "fast" a person talks in an audio file.
           | 
           | Hilbert transform and FFT to get phoneme rate would work.
        
         | brunoborges wrote:
         | The interesting thing here is that OpenAI likely has a layer
         | that trims down videos exactly how you suggest, so they can
         | still charge by the full length while costing less for them to
         | actually process the content.
        
         | cbsmith wrote:
         | That's an amusing perspective. I really struggle with watching
         | any video at double speed, but I've never had trouble listening
         | to any of his talks at 1x. To me, he seems to speak at a
         | perfectly reasonable pace.
        
         | swyx wrote:
         | > I didn't look at all at the quality of the transcription by
         | feeding it the shorter version.
         | 
         | guys how hard is it to toss both versions into like diffchecker
         | or something haha youre just comparing text
        
           | TimorousBestie wrote:
           | Why use diffchecker when there's a perfectly good LLM you
           | could ask right there? lol
        
       | babuloseo wrote:
       | I use the youtube trick, will share it here, but upload to
       | youtube and use their built in transcription service to translate
       | to text for you, and than use gemini pro 2.5 to rebuild the
       | transcript.
       | 
       | ffmpeg \ -f lavfi \ -i color=c=black:s=1920x1080:r=5 \ -i
       | file_you_want_transcripted.wav \ -c:v libx264 \ -preset medium \
       | -tune stillimage \ -crf 28 \ -c:a aac \ -b:a 192k \ -pix_fmt
       | yuv420p \ -shortest \
       | file_you_upload_to_youtube_for_free_transcripts.mp4
       | 
       | This works VERY well for my needs.
        
       | KTibow wrote:
       | This is really interesting, although the cheapest route is still
       | to use an alternative audio-compatible LLM (Gemini 2.0 Flash
       | Lite, Phi 4 Multimodal) or an alternative host for Whisper
       | (Deepinfra, Fal).
        
       | fallinditch wrote:
       | When extracting transcripts from YouTube videos, can anyone give
       | advice on the best (cost effective, quick, accurate) way to do
       | this?
       | 
       | I'm confused because I read in various places that the YouTube
       | API doesn't provide access to transcripts ... so how do all these
       | YouTube transcript extractor services do it?
       | 
       | I want to build my own YouTube summarizer app. Any advice and
       | info on this topic greatly appreciated!
        
         | vjerancrnjak wrote:
         | If YouTube placed autogenerated captions you can download them
         | free of charge with yt-dlp.
        
         | rob wrote:
         | There's a tool that uses YouTube's unofficial APIs to get them
         | if they're available:
         | 
         | https://github.com/jdepoix/youtube-transcript-api
         | 
         | For our internal tool that transcribes local city council
         | meetings on YouTube (often 1-3 hours long), we found that these
         | automatic ones were never available though.
         | 
         | (Our tool usually 'processes' the videos within ~5-30 mins of
         | being uploaded, so that's also why none are probably available
         | 'officially' yet.)
         | 
         | So we use yt-dlp to download the highest quality audio and then
         | process them with whisper via Groq, which is way cheaper
         | (~$0.02-0.04/hr with Groq compared to $0.36/hr via OpenAI's
         | API.) Sometimes groq errors out so there's built-in support for
         | Replicate and Deepgram as well.
         | 
         | We run yt-dlp on our remote Linode server and I have a Python
         | script I created that will automatically login to YouTube with
         | a "clean" account and extract the proper cookies.txt file, and
         | we also generate a 'po token' using another tool:
         | 
         | https://github.com/iv-org/youtube-trusted-session-generator
         | 
         | Both cookies.txt and the "po token" get passed to yt-dlp when
         | running on the Linode server and I haven't had to re-generate
         | anything in over a month. Runs smoothly every day.
         | 
         | (Note that I don't use cookies/po_token when running locally at
         | home, it usually works fine there.)
        
           | fallinditch wrote:
           | Very useful, thanks. So does this mean that every month or so
           | you have to create a new 'clean' YouTube account and use that
           | to create new po_token/cookies?
           | 
           | It's frustrating to have to jump through all these hoops just
           | to extract transcripts when the YouTube Data API already
           | gives reasonable limits to free API calls ... would be nice
           | if they allowed transcripts too.
           | 
           | Do you think the various YouTube transcript extractor
           | services all follow a similar method as yours?
        
         | banana_giraffe wrote:
         | You can use yt-dlp to get the transcripts. For instance, to
         | grab just the transcript of a video:                   ./yt-dlp
         | --skip-download --write-sub --write-auto-sub --sub-lang en
         | --sub-format json3 <youtube video URL>
         | 
         | You can also feed the same command a playlist or channel URL
         | and it'll run through and grab all the transcripts for each
         | video in the playlist or channel.
        
           | fallinditch wrote:
           | That's cool, thanks for the info. But do you also have to use
           | a rotating proxy to prevent YouTube from blocking your IP
           | address?
        
             | banana_giraffe wrote:
             | Last time I ran this at scale was a couple of months ago,
             | so my information is no doubt out of date, but in my
             | experience, YouTube seems less concerned about this than
             | they are when you're grabbing lots of videos.
             | 
             | But that was a few months ago, so for all I know they've
             | tightened down more hatches since then.
        
       | topaz0 wrote:
       | I have a way that is (all but) free -- just watch the video if
       | you care about it, or decide not to if you don't, and move on
       | with your life.
        
       | Tepix wrote:
       | Why would you give up your privacy by sending what interests you
       | to OpenAI when whisper doesn't need that much computer in the
       | first place?
       | 
       | With faster-whisper (int8, batch=8) you can transcripe 13 minutes
       | of audio in 51 seconds _on CPU_.
        
         | anigbrowl wrote:
         | I came here to ask the same question. This is a well-solved
         | problem, red queen racing it seems utterly pointless, a symptom
         | of reflexive adversarialism.
        
           | poly2it wrote:
           | > symptom of reflexive adversarialism
           | 
           | Is there a definition for this expression? I don't catch you.
           | 
           | > ... using corporate technology for the solved problem is a
           | symptom of self-directed skepticism by the user against the
           | corporate institutions ...
           | 
           | Eh?
        
       | pimlottc wrote:
       | Appreciated the concise summary + code snippet upfront, followed
       | by more detail and background for those interested. More articles
       | should be written this way!
        
       | rob wrote:
       | For anybody trying to do this in bulk, instead of using OpenAI's
       | whisper via their API, you can also use Groq [0] which is much
       | cheaper:
       | 
       | [0] https://groq.com/pricing/
       | 
       | Groq is ~$0.02/hr with distil-large-v3, or ~$0.04/hr with
       | whisper-large-v3-turbo. I believe OpenAI comes out to like
       | ~$0.36/hr.
       | 
       | We do this internally with our tool that automatically
       | transcribes local government council meetings right when they get
       | uploaded to YouTube. It uses Groq by default, but I also added
       | support for Replicate and Deepgram as backups because sometimes
       | Groq errors out.
        
         | georgemandis wrote:
         | Interesting! At $0.02 to $0.04 an hour I don't suspect you've
         | been hunting for optimizations, but I wonder if this "speed up
         | the audio" trick would save you even more.
         | 
         | > We do this internally with our tool that automatically
         | transcribes local government council meetings right when they
         | get uploaded to YouTube
         | 
         | Doesn't YouTube do this for you automatically these days within
         | a day or so?
        
           | rob wrote:
           | > Doesn't YouTube do this for you automatically these days
           | within a day or so?
           | 
           | Oh yeah, we do a check first and use youtube-transcript-api
           | if there's an automatic one available:
           | 
           | https://github.com/jdepoix/youtube-transcript-api
           | 
           | The tool usually detects them within like ~5 mins of being
           | uploaded though, so usually none are available yet. Then
           | it'll send the summaries to our internal Slack channel for
           | our editors, in case there's anything interesting to 'follow
           | up on' from the meeting.
           | 
           | Probably would be a good idea to add a delay to it and wait
           | for the automatic ones though :)
        
           | jerjerjer wrote:
           | > I wonder if this "speed up the audio" trick would save you
           | even more.
           | 
           | At this point you'll need to at least check how much running
           | ffmpeg costs. Probably less than $0.01 per hour of audio
           | (approximate savings) but still.
        
           | ks2048 wrote:
           | > Doesn't YouTube do this for you automatically these days
           | within a day or so?
           | 
           | Last time I checked, I think the Google auto-captions were
           | noticeably worse quality than whisper, but maybe that has
           | changed.
        
         | colechristensen wrote:
         | If you have a recent macbook you can run the same whisper model
         | locally for free. People are really sleeping on how cheap the
         | compute you own hardware for already is.
        
           | rob wrote:
           | I don't. I have a MacBook Pro from 2019 with an Intel chip
           | and 16 GB of memory. Pretty sure when I tried the large
           | whisper model it took like 30 minutes to an hour to do
           | something that took hardly any time via Groq. It's been a
           | while though so maybe my times are off.
        
             | colechristensen wrote:
             | Ah, no, Apple silicon Mac required with a decent amount of
             | memory. But this kind of machine has been very common (a
             | mid to high range recent macbook) at all of my employers
             | for a long time.
        
             | fragmede wrote:
             | It's been roughly six years since that MacBook was top of
             | the line, so your times are definitely off.
        
         | pzo wrote:
         | there is also cloudflare workers ai where you can have whisper-
         | large-v3-turbo for around $0.03 per hour:
         | 
         | https://developers.cloudflare.com/workers-ai/models/whisper-...
        
         | abidlabs wrote:
         | You could use Hugging Face's Inference API (which supports all
         | of these API providers) directly making it easier to switch
         | between them, e.g. look at the panel on the right on:
         | https://huggingface.co/openai/whisper-large-v3
        
       | stogot wrote:
       | Love this idea but the accuracy section is lacking. Couldnt you
       | do a simple diff of the outputs and see how many differences
       | there are? .5% or 5%?
        
         | georgemandis wrote:
         | Yeah, I'd like to do a more formal analysis of the outputs if I
         | can carve out the time.
         | 
         | I don't think a simple diff is the way to go, at least for what
         | I'm interested in. What I care about more is the overall
         | accuracy of the summary--not the word-for-word transcription.
         | 
         | The test I want to setup is using LLMs to evaluate the
         | summarized output and see if the primary themes/topics persist.
         | That's more interesting and useful to me for this exercise.
        
       | tmaly wrote:
       | The whisper model weights are free. You could save even more by
       | just using them locally.
        
         | pzo wrote:
         | but this is still great trick if you want to reduce latency or
         | inference speed even with local models e.g. in realtime chatbot
        
       | 55555 wrote:
       | This seems like a good place for me to complain about the fact
       | that the automatically generated subtitle files Youtube creates
       | are horribly malformed. Every sentence is repeated twice. In many
       | subtitle files, the subtitle timestamp ranges overlap one another
       | while also repeating every sentence twice in two different
       | ranges. It's absolutely bizarre and has been like this for years
       | or possibly forever. Here's an example - I apologize that it's
       | not in English. I don't know if this issue affects English.
       | https://pastebin.com/raw/LTBps80F
        
       | amelius wrote:
       | Solution: charge by number of characters generated.
        
       | dataviz1000 wrote:
       | I built a Chrome extension with one feature that transcribes
       | audio to text in the browser using huggingface/transformers.js
       | running the OpenAI Whisper model with WebGPU. It works perfect!
       | Here is a list of examples of all the things you can do in the
       | browser with webgpu for free. [0]
       | 
       | The last thing in the world I want to do is listen or watch
       | presidential social media posts, but, on the other hand,
       | sometimes enormously stupid things are said which move the SP500
       | up or down $60 in a session. So this feature queries for new
       | posts every minute, does ORC image to text and transcribe video
       | audio to text locally, sends the post with text for analysis, all
       | in the background inside a Chrome extension before notify me of
       | anything economically significant.
       | 
       | [0]
       | https://github.com/huggingface/transformers.js/tree/main/exa...
       | 
       | [1] https://github.com/adam-s/doomberg-terminal
        
         | kgc wrote:
         | Impressive
        
       | karpathy wrote:
       | Omg long post. TLDR from an LLM for anyone interested
       | 
       | Speed your audio up 2-3x with ffmpeg before sending it to
       | OpenAI's gpt-4o-transcribe: the shorter file uses fewer input-
       | tokens, cuts costs by roughly a third, and processes faster with
       | little quality loss (4x is too fast). A sample yt-dlp - ffmpeg -
       | curl script shows the workflow.
       | 
       | ;)
        
         | bravesoul2 wrote:
         | This is the sort of content I want to see in Tweets and
         | LinkedIn posts.
         | 
         | I have been thinking for a while how do you make good use of
         | the short space in those places.
         | 
         | LLM did well here.
        
         | georgemandis wrote:
         | Hahaha. Okay, okay... I will watch it now ;)
         | 
         | (Thanks for your good sense of humor)
        
           | karpathy wrote:
           | I like that your post deliberately gets to the point first
           | and then (optionally) expands later, I think it's a good and
           | generally underutilized format. I often advise people to
           | structure their emails in the same way, e.g. first just
           | cutting to the chase with the specific ask, then giving more
           | context optionally below.
           | 
           | It's not my intention to bloat information or delivery but I
           | also don't super know how to follow this format especially in
           | this kind of talk. Because it's not so much about relaying
           | specific information (like your final script here), but more
           | as a collection of prompts back to the audience as things to
           | think about.
           | 
           | My companion tweet to this video on X had a brief
           | TLDR/Summary included where I tried, but I didn't super think
           | it was very reflective of the talk, it was more about topics
           | covered.
           | 
           | Anyway, I am overall a big fan of doing more compute at the
           | "creation time" to compress other people's time during
           | "consumption time" and I think it's the respectful and kind
           | thing to do.
        
             | georgemandis wrote:
             | I watched your talk. There are so many more interesting
             | ideas in there that resonated with me that the summary
             | (unsurprisingly) skipped over. I'm glad I watched it!
             | 
             | LLMs as the operating system, the way you interface with
             | vibe-coding (smaller chunks) and the idea that maybe we
             | haven't found the "GUI for AI" yet are all things I've
             | pondered and discussed with people. You articulated them
             | well.
             | 
             | I think some formats, like a talk, don't lend themselves
             | easily to meaningful summaries. It's about giving the
             | audience things to think about, to your point. It's the sum
             | of storytelling that's more than the whole and why we still
             | do it.
             | 
             | My post is, at the end of the day, really more about a neat
             | trick to optimize transcriptions. This particular video
             | might be a great example of why you may not always want to
             | do that :)
             | 
             | Anyway, thanks for the time and thanks for the talk!
        
       | xg15 wrote:
       | That's really cool! Also, isn't this effectively the same as
       | supplying audio with a sampling rate of 8kHz instead of the 16kHz
       | that the model is supposed to work with?
        
       | anshumankmr wrote:
       | Someone should try transcribing Eminem's Rap god with this trick.
        
       | alok-g wrote:
       | >> by jumping straight to the point ...
       | 
       | Love this! I wish more authors follow this approach. So many
       | articles keep going all over the place before 'the point'
       | appears.
       | 
       | If trying, perhaps some 50% of the authors may realize that they
       | don't _have_ a point.
        
       | pknerd wrote:
       | I guess it'd work even if you make it 2.5 or evebn 3x.
        
       | donkey_brains wrote:
       | Hmm...doesn't this technique effectively make the minute longer,
       | not shorter? Because you can pack more speech into a minute of
       | recording? Seems like making a minute shorter would be
       | counterproductive.
        
         | StochasticLi wrote:
         | No. You're paying for a minute of audio, which will be more
         | packed with speech, not for how long it's being computed.
        
       | impossiblefork wrote:
       | Make the minutes longer, you mean.
        
       | pbbakkum wrote:
       | This is great, thank you for sharing. I work on these APIs at
       | OpenAI, it's a surprise to me that it still works reasonably well
       | at 2/3x speed, but on the other hand for phone channels we get
       | 8khz audio that is upsampled to 24khz for the model and it still
       | works well. Note there's probably a measurable decrease in
       | transcription accuracy that worsens as you deviate from 1x speed.
       | Also we really need to support bigger/longer file uploads :)
        
         | nerder92 wrote:
         | Quick Feedback: Would it be cool to research this internally
         | and maybe find a sweet spot in speed multiplier where the loss
         | is minimal. This pre-processing is quite cheap and could bring
         | down the API price eventually.
        
       | celltalk wrote:
       | With this logic, you should also be able to trim the parts that
       | doesn't have words. Just add a cut-off for db, and trim the video
       | before transcription.
       | 
       | Possibly another 10-20% gain?
        
       | isubkhankulov wrote:
       | Transcripts get much more valuable when one diarizes the audio
       | beforehand to determine which speaker said what.
       | 
       | I use this free tool to extract those and dump the transcripts
       | into a LLM with basic prompts: https://contentflow.megalabs.co
        
       ___________________________________________________________________
       (page generated 2025-06-25 23:00 UTC)