[HN Gopher] OpenAI charges by the minute, so speed up your audio
___________________________________________________________________
OpenAI charges by the minute, so speed up your audio
Author : georgemandis
Score : 393 points
Date : 2025-06-25 13:17 UTC (9 hours ago)
(HTM) web link (george.mand.is)
(TXT) w3m dump (george.mand.is)
| georgemandis wrote:
| I was trying to summarize a 40-minute talk with OpenAI's
| transcription API, but it was too long. So I sped it up with
| ffmpeg to fit within the 25-minute cap. It worked quite well (Up
| to 3x speeds) and was cheaper and faster, so I wrote about it.
|
| Felt like a fun trick worth sharing. There's a full script and
| cost breakdown.
| bravesoul2 wrote:
| You could have kept quiet and started a cheaper than openai
| transcription business :)
| behnamoh wrote:
| Sure, but now the world is a better place because he shared
| something useful!
| 4b11b4 wrote:
| Pre-processing of the audio still a valid biz, multiple types
| of pre-processing might be valid
| hn8726 wrote:
| Or openai will do it themselves for transcription tasks
| ilyakaminsky wrote:
| I've already done that [1]. A fraction of the price, 24-hour
| limit per file, and speedup tricks like the OP's are welcome.
| :)
|
| [1] https://speechischeap.com
| bravesoul2 wrote:
| Nice. Don't expect you to spill the beans but is it doing
| OK (some customers?)
|
| Just wondering if I cam build a retirement out of APIs :)
| ada1981 wrote:
| We discovered this last month.
|
| There is also prob a way to send a smaller sampler of audio at
| diff speeds and compare them to get a speed optimization with no
| quality loss unique for each clip.
| moralestapia wrote:
| >We discovered this last month.
|
| Nice. Any blog post, twitter comment or anything pointing to
| that?
| babuloseo wrote:
| source?
| brendanfinan wrote:
| would this also work for my video consisting of 10,000 PDFs?
|
| https://news.ycombinator.com/item?id=44125598
| jasonjmcghee wrote:
| I can't tell if this is a meme or not.
|
| And if someone had this idea and pitched it to Claude (the
| model this project was vibe coded with) it would be like "what
| a great idea!"
| mcc1ane wrote:
| Longer*
| simonw wrote:
| There was a similar trick which worked with Gemini versions prior
| to Gemini 2.0: they charged a flat rate of 258 tokens for an
| image, and it turns out you could fit more than 258 tokens of
| text in an image of text and use that for a discount!
| heeton wrote:
| A point on skimming vs taking the time to read something
| properly.
|
| I read a transcript + summary of that exact talk. I thought it
| was fine, but uninteresting, I moved on.
|
| Later I saw it had been put on youtube and I was on the train, so
| I watched the whole thing at normal speed. I had a huge number of
| different ideas, thoughts and decisions, sparked by watching the
| whole thing.
|
| This happens to me in other areas too. Watching a conference talk
| in person is far more useful to me than watching it online with
| other distractions. Watching it online is more useful again than
| reading a summary.
|
| Going for a walk to think about something deeply beats a 10
| minute session to "solve" the problem and forget it.
|
| Slower is usually better for thinking.
| pluc wrote:
| Seriously this is bonkers to me. I, like many hackers, hated
| school because they just threw one-size-fits-all knowledge at
| you and here we are, paying for the privilege to have that in
| every facet of our lives.
|
| Reading is a pleasure. Watching a lecture or a talk and feeling
| the pieces fall into place is great. Having your brain work out
| the meaning of things is surely something that defines us as a
| species. We're willingly heading for such stupidity, I don't
| get it. I don't get how we can all be so blind at what this is
| going to create.
| hooverd wrote:
| If you're not listening to summaries of different audiobooks
| at 2x speed in each ear you're not contentmaxing.
| lovestory wrote:
| Or just use notebookLM to convert your books into an hour
| long podcasts /s
| isaacremuant wrote:
| > We're willingly heading for such stupidity, I don't get it.
| I don't get how we can all be so blind at what this is going
| to create.
|
| Your doomerism and superiority doesn't follow from your
| initial "I like many hackers don't like one size fits all".
|
| This is literally offering you MANY sizes and you have the
| freedom to choose. Somehow you're pretending pushed down
| uniformity.
|
| Consume it however you want and come up with actual
| criticisms next time?
| colechristensen wrote:
| University didn't agree with me mostly because I can't pay
| attention to the average lecturer. Getting bored in between
| words or while waiting for them to write means I absorbed
| very little and had to teach myself nearly everything.
|
| Audiobooks before speed tools were the worst (are they
| _trying_ to speak extra slow?) But when I can speed things up
| comprehension is just fine.
| bisby wrote:
| > I, like many hackers, hated school because they just threw
| one-size-fits-all knowledge at you
|
| "This specific knowledge format doesnt work for me, so I'm
| asking OpenAI to convert this knowledge into a format that is
| easier for me to digest" is exactly what this is about.
|
| I'm not quite sure what you're upset about? Unless you're
| referring to "one size fits all knowledge" as simplified
| topics, so you can tackle things at a surface level? I love
| having surface level knowledge about a LOT of things. I
| certainly don't have time to have go deep on every topic out
| there. But if this is a topic I find I am interested in, the
| full talk is still available.
|
| Breadth and depth are both important, and well summarized
| talks are important for breadth, but not helpful at all for
| depth, and that's ok.
| georgemandis wrote:
| For what it's worth, I completely agree with you, for all the
| reasons you're saying. With talks in particular I think it's
| seldom about the raw content and ideas presented and more about
| the ancillary ideas they provoke and inspire, like you're
| describing.
|
| There is just _so_ much content out there. And context is
| everything. If the person sharing it had led with some specific
| ideas or thoughts I might have taken the time to watch and
| looked for those ideas. But in the context it was received--a
| quick link with no additional context--I really just wanted the
| "gist" to know what I was even potentially responding to.
|
| In this case, for me, it was worth it. I can go back and decide
| if I want to watch it. Your comment has intrigued me so I very
| well might!
|
| ++ to "Slower is usually better for thinking"
| mutagen wrote:
| Not to discount slower speeds for thinking but I wonder if
| there is also value in dipping into a talk or a subject and
| then revisiting (re-watching) with the time to ponder on the
| thoughts a little more deeply.
| tass wrote:
| This is similar to strategies in "how to read a book"
| (Adler).
|
| By understanding the outline and themes of a book (or
| lecture, I suppose), it makes it easier to piece together
| thoughts as you delve deeper into the full content.
| conradev wrote:
| Was it the speed or the additional information vended by the
| audio and video? If someone is a compelling speaker, the same
| message will be way more effective in an audiovisual format.
| The audio has emphasis on certain parts of the content, for
| example, which is missing from the transcript or summary
| entirely. Video has gestural and facial cues, also often
| utilized to make a point.
| bongodongobob wrote:
| You'd love where I work. Everything is needlessly long
| bloviating power point meetings that could easily be ingested
| in a 5 minute email.
| itsoktocry wrote:
| > _Slower is usually better for thinking._
|
| Yeah, I see people talking about listening to podcasts or
| audiobooks on 2x or 3x.
|
| Sometimes I set mine to 0.8x. I find you get time to absorb and
| think. Am I an outlier?
| b0a04gl wrote:
| it's still decoding every frame and matching phonemes either way,
| but speeding it up reduces how many seconds they bill you for. so
| you may hack their billing logic more than the model itself.
|
| also means the longer you talk, the more you pay even if the
| actual info density is the same. so if your voice has longer
| pauses or you speak slow, you maybe subsidizing inefficiency.
|
| makes me think maybe the next big compression is in delivery
| cadence. just auto-optimize voice tone and pacing before sending
| it to LLM. feed it synthetic fast speech with no emotion, just
| high density words. you lose human warmth but gain 40% cost
| savings
| timerol wrote:
| > Is It Accurate?
|
| > I don't know--I didn't watch it, lol. That was the whole point.
| And if that answer makes you uncomfortable, buckle-up for this
| future we're hurtling toward. Boy, howdy.
|
| This is a great bit of work, and the author accurately summarizes
| my discomfort
| BHSPitMonkey wrote:
| As if human-generated transcriptions of audio ever came with
| guarantees of accuracy?
|
| This kind of transformation has always come with flaws, and I
| think that will continue to be expected implicitly. Far more
| worrying is the public's trust in _interpretations_ and claims
| of _fact_ produced by gen AI services, or at least the popular
| idea that "AI" is more trustworthy/unbiased than humans,
| journalists, experts, etc.
| jasonjmcghee wrote:
| Heads up, the token cost breakdown tables look white on white to
| me. I'm in dark mode on iOS using Brave.
| georgemandis wrote:
| Should be fixed now. Thank you!
| w-m wrote:
| With transcribing a talk by Andrej, you already picked the most
| challenging case possible, speed-wise. His natural talking speed
| is already >=1.5x that of a normal human. One of the people you
| absolutely have to set your YouTube speed back down to 1x when
| listening to follow what's going on.
|
| In the idea of making more of an OpenAI minute, don't send it any
| silence.
|
| E.g. ffmpeg -i video-audio.m4a \ -af
| "silenceremove=start_periods=1:start_duration=0:start_threshold=-
| 50dB:\
| stop_periods=-1:stop_duration=0.02:stop_threshold=-50dB,\
| apad=pad_dur=0.02" \ -c:a aac -b:a 128k
| output_minpause.m4a -y
|
| will cut the talk down from 39m31s to 31m34s, by replacing any
| silence (with a -50dB threshold) longer than 20ms by a 20ms
| pause. And to keep with the spirit of your post, I measured only
| that the input file got shorter, I didn't look at all at the
| quality of the transcription by feeding it the shorter version.
| georgemandis wrote:
| Oooh fun! I had a feeling there was more ffmpeg wizardry I
| could be leaning into here. I'll have to try this later--thanks
| for the idea!
| w-m wrote:
| In the meantime I realized that the apad part is nonsensical
| - it pads the end of the stream, not at each silence-removed
| cut. I wanted to get angry at o3 for proposing this, but then
| I had a look at the silenceremove= documentation myself:
| https://ffmpeg.org/ffmpeg-filters.html#silenceremove
|
| Good god. You couldn't make that any more convoluted and
| hard-to-grasp if you wanted to. You gotta love ffmpeg!
|
| I now _think_ this might be a good solution:
| ffmpeg -i video-audio.m4a \ -af "silenceremove
| =start_periods=1:stop_periods=-1:stop_duration=0.15:stop_thre
| shold=-40dB:detection=rms" \ -c:a aac -b:a
| 128k output.m4a -y
| snickerdoodle12 wrote:
| I love ffmpeg but the documentation is often close to
| incomprehensible.
| squigz wrote:
| Out of curiosity, how might you improve those docs? They
| seem fairly reasonable to me
| w-m wrote:
| The documentation reads like it was written by a
| programmer who documented the different parameters to
| their implementation of a specific algorithm. Now when
| you as the user come along and want to use silenceremove,
| you'll have to carefully read through this, and build
| your own mental model of that algorithm, and then you'll
| be able to set these parameters accordingly. That takes a
| lot of time and energy, in this case multiple read-
| throughs and I'd say > 5 minutes.
|
| Good documentation should do this work for you. It should
| explain somewhat atomic concepts to you, that you can
| immediately adapt, and compose. Where it already works is
| for the "detection" and "window" parameters, which are
| straightforward. But the actions of trimming in the
| start/middle/end, and how to configure how long the
| silence lasts before trimming, whether to ignore short
| bursts of noise, whether to skip every nth silence
| period, these are all ideas and concepts that get mushed
| together in 10 parameters which are called start/stop-
| duration/threshold/silence/mode/periods.
|
| If you want to apply this filter, it takes a long time to
| build mental models for these 10 parameters. You do have
| some example calls, which is great, but which doesn't
| help if you need to adjust any of these - then you
| probably need to understand them all.
|
| Some stuff I stumbled over when reading it:
|
| "To remove silence from the middle of a file, specify a
| stop_periods that is negative. This value is then treated
| as a positive value [...]" - what? Why is this parameter
| so heavily overloaded?
|
| "start_duration: Specify the amount of time that non-
| silence must be detected before it stops trimming audio"
| - parameter is named start_something, but it's about
| stopping? Why?
|
| "start_periods: [...] Normally, [...] start_periods will
| be 1 [...]. Default value is 0."
|
| "start_mode: Specify mode of detection of silence end at
| start": start_mode end at start?
|
| It's very clunky. Every parameter has multiple modes of
| operation. Why is it start and stop for beginning and
| end, and why is "do stuff in the middle" part of the end?
| Why is there no global mode?
|
| You could nitpick this stuff to death. In the end, naming
| things is famously one of the two hard problems in
| computer science (the others being cache invalidation and
| off-by-one errors). And writing good documentation is
| also very, very hard work. Just exposing the internals of
| the algorithm is often not great UX, because then every
| user has to learn how the thing works internally before
| they can start using it (hey, looking at you, git).
|
| So while it's easy to point out where these docs fail, it
| would be a lot of work to rewrite this documentation from
| the top down, explaining the concepts first. Or even
| rewriting the interface to make this more approachable,
| and the parameters less overloaded. But since it's hard
| work, and not sexy to programmers, it won't get done, and
| many people will come after, having to spend time on
| reading and re-reading this current mess.
| pragmatic wrote:
| No not really? The talk where he babbles about OSes and
| everyone is somehow impressed?
| behnamoh wrote:
| > His natural talking speed is already >=1.5x that of a normal
| human. One of the people you absolutely have to set your
| YouTube speed back down to 1x when listening to follow what's
| going on.
|
| I wonder if there's a way to automatically detect how "fast" a
| person talks in an audio file. I know it's subjective and
| different people talk at different paces in an audio, but it'd
| be cool to kinda know when OP's trick fails (they mention x4
| ruined the output; maybe for karpathy that would happen at x2).
| echelon wrote:
| > I wonder if there's a way to automatically detect how
| "fast" a person talks in an audio file.
|
| Stupid heuristic: take a segment of video, transcribe text,
| count number of words per utterance duration. If you need
| speaker diarization, handle speaker utterance durations
| independently. You can further slice, such as syllable count,
| etc.
| nand4011 wrote:
| https://www.science.org/doi/10.1126/sciadv.aaw2594
|
| Apparently human language conveys information at around 39
| bits/s. You could use a similar technique as that paper to
| determine the information rate of a speaker and then
| correct it to 39 bits/s by changing the speed of the video.
| varispeed wrote:
| It's a shame platforms don't generally support speeds greater
| than 2x. One of my "superpowers" or a curse is that I cannot
| stand normal speaking pace. When I watch lectures, I always
| go for maximum speed and that still is too slow for me. I
| wish platforms have included 4x but done properly (with
| minimal artefacts).
| lofaszvanitt wrote:
| Robot in a human body identified :D.
| mrmuagi wrote:
| All audiobooks are like this for me. I tried it for
| lectures but if I'm taking handwritten notes, I can't keep
| up my writing.
|
| I wonder if there is negative side effects of this though,
| do you notice when interacting with people who speak slower
| require a greater deal of patience?
| colechristensen wrote:
| No but a little. I struggle with people who repeat every
| point of what they're saying to you several times or when
| you say "you told me exactly this the last time we spoke"
| they cannot be stopped from retelling the whole thing
| verbatim. Usually in those situations though there's some
| potential cognitive issues so you can only be
| understanding.
| hamburglar wrote:
| I once attended a live talk by Leslie Lamport and as he
| talked, I had the overwhelming feeling that something was
| wrong, and was thinking "did he have a stroke or
| something?" but then I realized I had just always watched
| his lectures online and had become accustomed to
| listening to him at 2x.
| dpcx wrote:
| https://github.com/codebicycle/videospeed has been a
| wonderful addition for me.
| narratives1 wrote:
| I use a Chrome extension that lets you take any video
| player (including embedded) to 10x speed. Turn most things
| to 3-4x. It works on ads too
| munch117 wrote:
| I use a bookmarklet:
|
| javascript:void%20function(){document.querySelector(%22vi
| deo,audio%22).playbackRate=parseFloat(prompt(%22Set%20the
| %20playback rate%22))}();
| cookingrobot wrote:
| There are fonts designed to be legibly at really small
| size. I wonder if there are voices that are especially
| understandable at extreme speeds.
|
| Could use an "auctioneer" voice to playback text at 10x
| speed.
| bbatha wrote:
| I'm also a fast listener. I find audio quality is the
| main differentiator in my ability to listen quickly or
| not. A podcast recorded at high quality I can listen to
| at 3-4x (with silence trimmed) comfortably, the second
| someone calls in from their phone I'm getting every 4th
| word and often need to go down to 2x or less. Mumbly
| accents are also a driver of quality but not as much,
| then again I rarely have trouble understanding difficult
| accents IRL and almost never use subtitles on TV
| shows/youtube to better understand the speaker. Your
| mileage may vary.
|
| I understand 4-6x speakers fairly well but don't enjoy
| listening at that pace. If I lose focus for a couple of
| seconds I effectively miss a paragraph of context and my
| brain can't fill in the missing details.
| seabass wrote:
| I made a super simplistic chrome extension for this.
| Doesn't work on all websites, but YouTube and most online
| video courses are covered.
|
| https://github.com/sebastiansandqvist/video-speed-extension
| JadeNB wrote:
| Can't you use VLC to watch almost anything streamable, and
| then play at your desired speed?
| btown wrote:
| Even a last-decade transcription model could be used to
| detect a rough number of syllables per unit time, and the
| accuracy of that model could be used to guide speed-up and
| dead-time detection before sending to a more expensive model.
| As with all things, it's a question of whether the cost
| savings justify the engineering work.
| janalsncm wrote:
| > I wonder if there's a way to automatically detect how
| "fast" a person talks in an audio file
|
| Transcribe it locally using whisper and output tokens/sec?
| maxall4 wrote:
| Just count syllables per second by doing an FFT plus some
| basic analysis.
| mrstone wrote:
| > I wonder if there's a way to automatically detect how
| "fast" a person talks in an audio file.
|
| Hilbert transform and FFT to get phoneme rate would work.
| brunoborges wrote:
| The interesting thing here is that OpenAI likely has a layer
| that trims down videos exactly how you suggest, so they can
| still charge by the full length while costing less for them to
| actually process the content.
| cbsmith wrote:
| That's an amusing perspective. I really struggle with watching
| any video at double speed, but I've never had trouble listening
| to any of his talks at 1x. To me, he seems to speak at a
| perfectly reasonable pace.
| swyx wrote:
| > I didn't look at all at the quality of the transcription by
| feeding it the shorter version.
|
| guys how hard is it to toss both versions into like diffchecker
| or something haha youre just comparing text
| TimorousBestie wrote:
| Why use diffchecker when there's a perfectly good LLM you
| could ask right there? lol
| babuloseo wrote:
| I use the youtube trick, will share it here, but upload to
| youtube and use their built in transcription service to translate
| to text for you, and than use gemini pro 2.5 to rebuild the
| transcript.
|
| ffmpeg \ -f lavfi \ -i color=c=black:s=1920x1080:r=5 \ -i
| file_you_want_transcripted.wav \ -c:v libx264 \ -preset medium \
| -tune stillimage \ -crf 28 \ -c:a aac \ -b:a 192k \ -pix_fmt
| yuv420p \ -shortest \
| file_you_upload_to_youtube_for_free_transcripts.mp4
|
| This works VERY well for my needs.
| KTibow wrote:
| This is really interesting, although the cheapest route is still
| to use an alternative audio-compatible LLM (Gemini 2.0 Flash
| Lite, Phi 4 Multimodal) or an alternative host for Whisper
| (Deepinfra, Fal).
| fallinditch wrote:
| When extracting transcripts from YouTube videos, can anyone give
| advice on the best (cost effective, quick, accurate) way to do
| this?
|
| I'm confused because I read in various places that the YouTube
| API doesn't provide access to transcripts ... so how do all these
| YouTube transcript extractor services do it?
|
| I want to build my own YouTube summarizer app. Any advice and
| info on this topic greatly appreciated!
| vjerancrnjak wrote:
| If YouTube placed autogenerated captions you can download them
| free of charge with yt-dlp.
| rob wrote:
| There's a tool that uses YouTube's unofficial APIs to get them
| if they're available:
|
| https://github.com/jdepoix/youtube-transcript-api
|
| For our internal tool that transcribes local city council
| meetings on YouTube (often 1-3 hours long), we found that these
| automatic ones were never available though.
|
| (Our tool usually 'processes' the videos within ~5-30 mins of
| being uploaded, so that's also why none are probably available
| 'officially' yet.)
|
| So we use yt-dlp to download the highest quality audio and then
| process them with whisper via Groq, which is way cheaper
| (~$0.02-0.04/hr with Groq compared to $0.36/hr via OpenAI's
| API.) Sometimes groq errors out so there's built-in support for
| Replicate and Deepgram as well.
|
| We run yt-dlp on our remote Linode server and I have a Python
| script I created that will automatically login to YouTube with
| a "clean" account and extract the proper cookies.txt file, and
| we also generate a 'po token' using another tool:
|
| https://github.com/iv-org/youtube-trusted-session-generator
|
| Both cookies.txt and the "po token" get passed to yt-dlp when
| running on the Linode server and I haven't had to re-generate
| anything in over a month. Runs smoothly every day.
|
| (Note that I don't use cookies/po_token when running locally at
| home, it usually works fine there.)
| fallinditch wrote:
| Very useful, thanks. So does this mean that every month or so
| you have to create a new 'clean' YouTube account and use that
| to create new po_token/cookies?
|
| It's frustrating to have to jump through all these hoops just
| to extract transcripts when the YouTube Data API already
| gives reasonable limits to free API calls ... would be nice
| if they allowed transcripts too.
|
| Do you think the various YouTube transcript extractor
| services all follow a similar method as yours?
| banana_giraffe wrote:
| You can use yt-dlp to get the transcripts. For instance, to
| grab just the transcript of a video: ./yt-dlp
| --skip-download --write-sub --write-auto-sub --sub-lang en
| --sub-format json3 <youtube video URL>
|
| You can also feed the same command a playlist or channel URL
| and it'll run through and grab all the transcripts for each
| video in the playlist or channel.
| fallinditch wrote:
| That's cool, thanks for the info. But do you also have to use
| a rotating proxy to prevent YouTube from blocking your IP
| address?
| banana_giraffe wrote:
| Last time I ran this at scale was a couple of months ago,
| so my information is no doubt out of date, but in my
| experience, YouTube seems less concerned about this than
| they are when you're grabbing lots of videos.
|
| But that was a few months ago, so for all I know they've
| tightened down more hatches since then.
| topaz0 wrote:
| I have a way that is (all but) free -- just watch the video if
| you care about it, or decide not to if you don't, and move on
| with your life.
| Tepix wrote:
| Why would you give up your privacy by sending what interests you
| to OpenAI when whisper doesn't need that much computer in the
| first place?
|
| With faster-whisper (int8, batch=8) you can transcripe 13 minutes
| of audio in 51 seconds _on CPU_.
| anigbrowl wrote:
| I came here to ask the same question. This is a well-solved
| problem, red queen racing it seems utterly pointless, a symptom
| of reflexive adversarialism.
| poly2it wrote:
| > symptom of reflexive adversarialism
|
| Is there a definition for this expression? I don't catch you.
|
| > ... using corporate technology for the solved problem is a
| symptom of self-directed skepticism by the user against the
| corporate institutions ...
|
| Eh?
| pimlottc wrote:
| Appreciated the concise summary + code snippet upfront, followed
| by more detail and background for those interested. More articles
| should be written this way!
| rob wrote:
| For anybody trying to do this in bulk, instead of using OpenAI's
| whisper via their API, you can also use Groq [0] which is much
| cheaper:
|
| [0] https://groq.com/pricing/
|
| Groq is ~$0.02/hr with distil-large-v3, or ~$0.04/hr with
| whisper-large-v3-turbo. I believe OpenAI comes out to like
| ~$0.36/hr.
|
| We do this internally with our tool that automatically
| transcribes local government council meetings right when they get
| uploaded to YouTube. It uses Groq by default, but I also added
| support for Replicate and Deepgram as backups because sometimes
| Groq errors out.
| georgemandis wrote:
| Interesting! At $0.02 to $0.04 an hour I don't suspect you've
| been hunting for optimizations, but I wonder if this "speed up
| the audio" trick would save you even more.
|
| > We do this internally with our tool that automatically
| transcribes local government council meetings right when they
| get uploaded to YouTube
|
| Doesn't YouTube do this for you automatically these days within
| a day or so?
| rob wrote:
| > Doesn't YouTube do this for you automatically these days
| within a day or so?
|
| Oh yeah, we do a check first and use youtube-transcript-api
| if there's an automatic one available:
|
| https://github.com/jdepoix/youtube-transcript-api
|
| The tool usually detects them within like ~5 mins of being
| uploaded though, so usually none are available yet. Then
| it'll send the summaries to our internal Slack channel for
| our editors, in case there's anything interesting to 'follow
| up on' from the meeting.
|
| Probably would be a good idea to add a delay to it and wait
| for the automatic ones though :)
| jerjerjer wrote:
| > I wonder if this "speed up the audio" trick would save you
| even more.
|
| At this point you'll need to at least check how much running
| ffmpeg costs. Probably less than $0.01 per hour of audio
| (approximate savings) but still.
| ks2048 wrote:
| > Doesn't YouTube do this for you automatically these days
| within a day or so?
|
| Last time I checked, I think the Google auto-captions were
| noticeably worse quality than whisper, but maybe that has
| changed.
| colechristensen wrote:
| If you have a recent macbook you can run the same whisper model
| locally for free. People are really sleeping on how cheap the
| compute you own hardware for already is.
| rob wrote:
| I don't. I have a MacBook Pro from 2019 with an Intel chip
| and 16 GB of memory. Pretty sure when I tried the large
| whisper model it took like 30 minutes to an hour to do
| something that took hardly any time via Groq. It's been a
| while though so maybe my times are off.
| colechristensen wrote:
| Ah, no, Apple silicon Mac required with a decent amount of
| memory. But this kind of machine has been very common (a
| mid to high range recent macbook) at all of my employers
| for a long time.
| fragmede wrote:
| It's been roughly six years since that MacBook was top of
| the line, so your times are definitely off.
| pzo wrote:
| there is also cloudflare workers ai where you can have whisper-
| large-v3-turbo for around $0.03 per hour:
|
| https://developers.cloudflare.com/workers-ai/models/whisper-...
| abidlabs wrote:
| You could use Hugging Face's Inference API (which supports all
| of these API providers) directly making it easier to switch
| between them, e.g. look at the panel on the right on:
| https://huggingface.co/openai/whisper-large-v3
| stogot wrote:
| Love this idea but the accuracy section is lacking. Couldnt you
| do a simple diff of the outputs and see how many differences
| there are? .5% or 5%?
| georgemandis wrote:
| Yeah, I'd like to do a more formal analysis of the outputs if I
| can carve out the time.
|
| I don't think a simple diff is the way to go, at least for what
| I'm interested in. What I care about more is the overall
| accuracy of the summary--not the word-for-word transcription.
|
| The test I want to setup is using LLMs to evaluate the
| summarized output and see if the primary themes/topics persist.
| That's more interesting and useful to me for this exercise.
| tmaly wrote:
| The whisper model weights are free. You could save even more by
| just using them locally.
| pzo wrote:
| but this is still great trick if you want to reduce latency or
| inference speed even with local models e.g. in realtime chatbot
| 55555 wrote:
| This seems like a good place for me to complain about the fact
| that the automatically generated subtitle files Youtube creates
| are horribly malformed. Every sentence is repeated twice. In many
| subtitle files, the subtitle timestamp ranges overlap one another
| while also repeating every sentence twice in two different
| ranges. It's absolutely bizarre and has been like this for years
| or possibly forever. Here's an example - I apologize that it's
| not in English. I don't know if this issue affects English.
| https://pastebin.com/raw/LTBps80F
| amelius wrote:
| Solution: charge by number of characters generated.
| dataviz1000 wrote:
| I built a Chrome extension with one feature that transcribes
| audio to text in the browser using huggingface/transformers.js
| running the OpenAI Whisper model with WebGPU. It works perfect!
| Here is a list of examples of all the things you can do in the
| browser with webgpu for free. [0]
|
| The last thing in the world I want to do is listen or watch
| presidential social media posts, but, on the other hand,
| sometimes enormously stupid things are said which move the SP500
| up or down $60 in a session. So this feature queries for new
| posts every minute, does ORC image to text and transcribe video
| audio to text locally, sends the post with text for analysis, all
| in the background inside a Chrome extension before notify me of
| anything economically significant.
|
| [0]
| https://github.com/huggingface/transformers.js/tree/main/exa...
|
| [1] https://github.com/adam-s/doomberg-terminal
| kgc wrote:
| Impressive
| karpathy wrote:
| Omg long post. TLDR from an LLM for anyone interested
|
| Speed your audio up 2-3x with ffmpeg before sending it to
| OpenAI's gpt-4o-transcribe: the shorter file uses fewer input-
| tokens, cuts costs by roughly a third, and processes faster with
| little quality loss (4x is too fast). A sample yt-dlp - ffmpeg -
| curl script shows the workflow.
|
| ;)
| bravesoul2 wrote:
| This is the sort of content I want to see in Tweets and
| LinkedIn posts.
|
| I have been thinking for a while how do you make good use of
| the short space in those places.
|
| LLM did well here.
| georgemandis wrote:
| Hahaha. Okay, okay... I will watch it now ;)
|
| (Thanks for your good sense of humor)
| karpathy wrote:
| I like that your post deliberately gets to the point first
| and then (optionally) expands later, I think it's a good and
| generally underutilized format. I often advise people to
| structure their emails in the same way, e.g. first just
| cutting to the chase with the specific ask, then giving more
| context optionally below.
|
| It's not my intention to bloat information or delivery but I
| also don't super know how to follow this format especially in
| this kind of talk. Because it's not so much about relaying
| specific information (like your final script here), but more
| as a collection of prompts back to the audience as things to
| think about.
|
| My companion tweet to this video on X had a brief
| TLDR/Summary included where I tried, but I didn't super think
| it was very reflective of the talk, it was more about topics
| covered.
|
| Anyway, I am overall a big fan of doing more compute at the
| "creation time" to compress other people's time during
| "consumption time" and I think it's the respectful and kind
| thing to do.
| georgemandis wrote:
| I watched your talk. There are so many more interesting
| ideas in there that resonated with me that the summary
| (unsurprisingly) skipped over. I'm glad I watched it!
|
| LLMs as the operating system, the way you interface with
| vibe-coding (smaller chunks) and the idea that maybe we
| haven't found the "GUI for AI" yet are all things I've
| pondered and discussed with people. You articulated them
| well.
|
| I think some formats, like a talk, don't lend themselves
| easily to meaningful summaries. It's about giving the
| audience things to think about, to your point. It's the sum
| of storytelling that's more than the whole and why we still
| do it.
|
| My post is, at the end of the day, really more about a neat
| trick to optimize transcriptions. This particular video
| might be a great example of why you may not always want to
| do that :)
|
| Anyway, thanks for the time and thanks for the talk!
| xg15 wrote:
| That's really cool! Also, isn't this effectively the same as
| supplying audio with a sampling rate of 8kHz instead of the 16kHz
| that the model is supposed to work with?
| anshumankmr wrote:
| Someone should try transcribing Eminem's Rap god with this trick.
| alok-g wrote:
| >> by jumping straight to the point ...
|
| Love this! I wish more authors follow this approach. So many
| articles keep going all over the place before 'the point'
| appears.
|
| If trying, perhaps some 50% of the authors may realize that they
| don't _have_ a point.
| pknerd wrote:
| I guess it'd work even if you make it 2.5 or evebn 3x.
| donkey_brains wrote:
| Hmm...doesn't this technique effectively make the minute longer,
| not shorter? Because you can pack more speech into a minute of
| recording? Seems like making a minute shorter would be
| counterproductive.
| StochasticLi wrote:
| No. You're paying for a minute of audio, which will be more
| packed with speech, not for how long it's being computed.
| impossiblefork wrote:
| Make the minutes longer, you mean.
| pbbakkum wrote:
| This is great, thank you for sharing. I work on these APIs at
| OpenAI, it's a surprise to me that it still works reasonably well
| at 2/3x speed, but on the other hand for phone channels we get
| 8khz audio that is upsampled to 24khz for the model and it still
| works well. Note there's probably a measurable decrease in
| transcription accuracy that worsens as you deviate from 1x speed.
| Also we really need to support bigger/longer file uploads :)
| nerder92 wrote:
| Quick Feedback: Would it be cool to research this internally
| and maybe find a sweet spot in speed multiplier where the loss
| is minimal. This pre-processing is quite cheap and could bring
| down the API price eventually.
| celltalk wrote:
| With this logic, you should also be able to trim the parts that
| doesn't have words. Just add a cut-off for db, and trim the video
| before transcription.
|
| Possibly another 10-20% gain?
| isubkhankulov wrote:
| Transcripts get much more valuable when one diarizes the audio
| beforehand to determine which speaker said what.
|
| I use this free tool to extract those and dump the transcripts
| into a LLM with basic prompts: https://contentflow.megalabs.co
___________________________________________________________________
(page generated 2025-06-25 23:00 UTC)