[HN Gopher] Show HN: Gemini LLM corrects ASR YouTube transcripts
___________________________________________________________________
Show HN: Gemini LLM corrects ASR YouTube transcripts
Author : ldenoue
Score : 163 points
Date : 2024-11-25 18:44 UTC (1 days ago)
(HTM) web link (ldenoue.github.io)
(TXT) w3m dump (ldenoue.github.io)
| alsetmusic wrote:
| Seems like one of the places where LLMs make a lot of sense. I
| see some boneheaded transcriptions in videos pretty regularly.
| Comparing them against "more-likely" words or phrases seems like
| an ideal use case.
| petesergeant wrote:
| Also useful I think for checking human-entered transcriptions,
| which even on expensively produced shows, can often be garbage
| or just wrong. One human + two separate LLMs, and something to
| tie-break, and we could possibly finally get decent subtitles
| for stuff.
| leetharris wrote:
| A few problems with this approach:
|
| 1. It brings everything back to the "average." Any outliers get
| discarded. For example, someone who is a circus performer plays
| fetch with their frog. An LLM would think this is an obvious
| error and correct it to "dog."
|
| 2. LLMs want to format everything as internet text which does
| not align well to natural human speech.
|
| 3. Hallucinations still happen at scale, regardless of model
| quality.
|
| We've done a lot of experiments on this at Rev and it's still
| useful for the right scenario, but not as reliable as you may
| think.
| falcor84 wrote:
| Regarding the frog, I would assume that the way to address
| this would be to feed the LLM screenshots from the video, if
| the budget allows.
| leetharris wrote:
| Generally yes. That being said, sometimes multimodal LLMs
| show decreased performance with extra modalities.
|
| The extra dimensions of analysis cause increased
| hallucination at times. So maybe it solves the frog
| problem, but now it's hallucinating in another section
| because it got confused by another frame's tokens.
|
| One thing we've wanted to explore lately has been video
| based diarization. If I have a video to accompany some
| audio, can I help with cross talk and sound separation by
| matching lips with audio and assign the correct speaker
| more accurately? There's likely something there.
| orion138 wrote:
| Google published Looking to Listen a while back.
|
| https://research.google/blog/looking-to-listen-audio-
| visual-...
| ldenoue wrote:
| Do you have something to read about your study, experiments?
| Genuinely interested. Perhaps the prompts can be made to tell
| the LLM it's specifically handling human speech, not written
| text?
| devmor wrote:
| Those transcriptions are already done by LLMs in the first
| place - in fact, audio transcription was one of the very first
| large scale commercial uses of the technology in its current
| iteration.
|
| This is just like playing a game of markov telephone where the
| step in OP's solution is likely higher compute cost than the
| step YT uses, because YT is interested in minimizing costs.
| albertzeyer wrote:
| Probably just "regular" LMs, not large LMs, I assume. I
| assume some LM with 10-100M params or so, which is cheap to
| use (and very standard for ASR).
| dylan604 wrote:
| What about the cases where the human speaking is actually using
| nonsense words during a meandering off topic bit of "weaving"?
| Replacing those nonsense words would be a disservice as it
| would totally change the tone of the speech.
| dr_dshiv wrote:
| The first time I used Gemini, I gave it a youtube link and asked
| for a transcript. It told me how I could transcribe it myself.
| Honestly, I haven't used it since. Was that unfair of me?
| robrenaud wrote:
| Gemini is much worse as a product than 4o or Claude. I
| recommend using it from Google AI studio rather than the
| official consumer facing interface. But for tasks with large
| audio/visual input, it's better than 4o or Claude.
|
| Whether you want to deal with it being annoying is your call.
| Spooky23 wrote:
| The consumer Gemini is very prudish and optimized against risk
| to Google.
| andai wrote:
| GPT told me the same thing when I asked it to make an API call,
| or do an image search, or download a transcript of a YouTube
| video, or...
| jazzyjackson wrote:
| Thinking about that time Berkeley delisted thousands of
| recordings of course content as a result of a lawsuit complaining
| that they could not be utilized by deaf individuals. Can this be
| resolved with current technology? Google's auto captioning has
| been abysmal up to this point, I've often wondered what the cost
| would be for google to run modern tech over the entire backlog of
| youtube. At least then they might have a new source of training
| data.
|
| https://news.berkeley.edu/2017/02/24/faq-on-legacy-public-co...
|
| Discussed at the time (2017)
| https://news.ycombinator.com/item?id=13768856
| andai wrote:
| Didn't YouTube have auto-captions at the time this was
| discussed? Yeah they're a bit dodgy but I often watch videos in
| public with sound muted and 90% of the time you can guess what
| word it was meant to be from context. (And indeed more recent
| models do way, way, way better on accuracy.)
| jazzyjackson wrote:
| Definitely depends on audio quality and how closely a
| speaker's dialect matches the mid-atlantic accent, if you
| catch my drift.
|
| IME youtube transcripts are completely devoid of meaningful
| information, especially when domain-specific vocabulary is
| used.
| zehaeva wrote:
| I have a few Deaf/Hard of Hearing friends who find the auto-
| captions to be basically useless.
|
| Anything that's even remotely domain specific becomes a
| garbled mess. Even watching documentaries about light
| engineering/archeology/history subjects are hilariously bad.
| Names of historical places and people are randomly correct
| and almost always never consistent.
|
| The second anyone has a bit of an accent then it's completely
| useless.
|
| I keep them on partially because I'm of the "everything needs
| to have subtitles else I can't hear the words they're saying"
| cohort. So I can figure out what they really mean, but if you
| couldn't hear anything I can see it being hugely
| distracting/distressing/confusing/frustrating.
| hunter2_ wrote:
| With this context, it seems as though correction-by-LLM
| might be a net win among your Deaf/HoH friends even if it
| would be a net loss for you, since you're able to correct
| on the fly better than an LLM probably would, while the
| opposite is more often true for them, due to differences in
| experience with phonetics?
|
| Soundex [0] is a prevailing method of codifying phonetic
| similarity, but unfortunately it's focused on names
| exclusively. Any correction-by-LLM really ought to generate
| substitution probabilities weighted heavily on something
| like that, I would think.
|
| [0] https://en.wikipedia.org/wiki/Soundex
| schrodinger wrote:
| I'd assume Soundex is too basic and English-centric to be
| a practical solution for an international company like
| Google. I was taught it and implemented it in a freshman
| level CS course in 2004, it can't be nearly state of the
| art!
| shakna wrote:
| Soundex is fast, but inaccurate. It only prevails,
| because of the computational cost of things like
| levenshtein distance.
| novok wrote:
| You can also download the audio only with yt-dlp and then
| remake subs with whisper or whatever other model you
| want. GPU compute wise it will probably be less than
| asking an llm to try to correct a garbled transcript.
| HPsquared wrote:
| I suppose the gold standard would be a multimodal model
| that also looks at the screen (maybe only if the captions
| aren't making much sense).
| ldenoue wrote:
| The current Flash-8B model I use costs $1 per 500 hours
| of transcript.
| andai wrote:
| If I read OpenAI's pricing right, then Google's thing is
| _200 times_ cheaper?
| creato wrote:
| I use youtube closed captions all the time when I don't
| want to have audio. The captions are almost always fine. I
| definitely am not watching videos that would have had
| professional/human edited captions either.
|
| There may be mistakes like the ones you mentioned (getting
| names wrong/inconsistent), but if I know what was intended,
| it's pretty easy to ignore that. I think expecting
| "textual" correctness is unreasonable. Usually when there
| are mistakes, they are "phonetic", i.e. if you spoke the
| caption out loud, it would sound pretty similar to what was
| spoken in the video.
| dqv wrote:
| > I think expecting "textual" correctness is
| unreasonable.
|
| Of course you think that, you don't have to rely solely
| on closed captions! It's usually not even posed as an
| expectation, but as a request to correct captions that
| don't make sense. Especially now that we have auto-
| captioning _and_ tools that auto-correct the captions,
| running through and tweaking them to near-perfect
| accuracy is not an undue burden.
|
| > if you spoke the caption out loud, it would sound
| pretty similar to what was spoken in the video.
|
| Yes, but most deaf people can't do that. Even if they
| can, they shouldn't have to.
| beeboobaa6 wrote:
| There's helping people and there's infantilizing them.
| Being deaf doesn't mean you're stupid. They can figure it
| out.
|
| Deleting thousands of hours of course material because
| you're worried they're not able to understand
| autogenerated captions just ensures everyone loses. Don't
| be so ridiculous.
| mst wrote:
| They continue to be the worst automated transcripts I
| encounter and personally I find them sufficiently
| terribad that every time I try them I end up filing them
| under "nope, still more trouble than it's worth, gonna
| find a different source for this information and give
| them another go in six months."
|
| Even mentally sounding them out (which is fine for me
| since I have no relevant disabilities, I just despise
| trying to take in any meaningful quantity of information
| from a video) when they look weird doesn't make them
| tolerable *for me*.
|
| It's still a good thing overall that they're tolerable
| for you, though, and I hope other people are on average
| finding the experience closer to how you find it than how
| I find it ... but I definitely don't, yet.
|
| Hopefully in a year or so I'll be in the same camp as you
| are, though, overall progress in the relevant class of
| tech seems to've hit a pretty decent velocity these days.
| ldenoue wrote:
| Definitely: and just giving the LLM context before
| correcting (in this case the title and description of the
| video, often written by a person) creates much better
| transcripts.
| GaggiX wrote:
| Youtube captions have improved massively in recent years,
| they are flawless in most cases, sometimes a few errors
| (almost entirely in reporting numbers).
|
| I think that the biggest problem is that the subtitles do
| not distinguish between the speakers.
| jonas21 wrote:
| Yes, but the DOJ determined that the auto-generated captions
| were "inaccurate and incomplete, making the content
| inaccessible to individuals with hearing disabilities." [1]
|
| If the automatically-generated captions are now of a similar
| quality as human-generated ones, then that changes things.
|
| [1] https://news.berkeley.edu/wp-
| content/uploads/2016/09/2016-08...
| cavisne wrote:
| What happened here is a specific scam where companies are
| targeted for ADA violations, which are so vague it's
| impossible to "comply".
| PeterStuer wrote:
| Youtube auto-captions are extremely poor compared to e.g.
| running the audio through Wisper.
| delusional wrote:
| That's a legal issue. If humans wanted that content to be up,
| we just could have agreed to keep it up. Legal issues don't get
| solved by technology.
| jazzyjackson wrote:
| Well. The legal complaint was that transcripts don't exist.
| The issue was that it was prohibitively expensive to resolve
| the complaint. Now that transcription is 0.1% of the cost it
| was 8 years ago, maybe the complaint could have been
| resolved.
|
| Is building a ramp to meet ADA requirements not using
| technology to solve a legal issue?
| delusional wrote:
| Nowhere on the linked page at least does it say that it was
| due to cost. It would seem more likely to me that it was a
| question of nobody wanting to bother standing up for the
| videos. If nobody wants to take the fight, the default
| judgement becomes to take it down.
|
| Building a ramp solves a problem. Pointing at a ramp 5
| blocks away 7 years later and asking "doesn't this solve
| this issue" doesn't.
| pests wrote:
| Yet this feels very harrison bergeron to me. To handicap
| those with ability so we all can be at the same level.
| fuzzy_biscuit wrote:
| Right. The judgment doesn't help people with disabilities
| at all. It only punishes the rest of the population.
| yard2010 wrote:
| Yet. Legal issues don't get solved by tech yet!
| hackernewds wrote:
| What a silly requirement? Since 1% cannot benefit, let's remove
| it for the 99%
| 3abiton wrote:
| It's one of those "to motivate the horse to run 1% faster,
| you add shit ton of weight on top of it" strategy.
| IanCal wrote:
| The problem is that having that rule results in those 1%s
| always being excluded. It's probably worth just going back
| and looking at the arguments for laws around accessibility.
| mst wrote:
| Yeah, every time I try and figure out an approach that
| could've avoided this being covered by the rules without
| making it easy for everybody to screw over deaf people
| entirely I end up coming to the conclusion that there
| probably isn't one.
|
| I'm somewhat tempted to think that whoever sued berkeley
| and had the whole thing taken down in this specific case
| was just being a knob, but OTOH there's issues even with
| that POV in terms of letting precedents be set that will de
| facto still become "screw over deaf people entirely" even
| when everybody involved is doing their best to act in good
| faith.
|
| Hopefully speech-to-text and text-to-speech will make the
| question moot in the medium term.
| freedomben wrote:
| > _Hopefully speech-to-text and text-to-speech will make
| the question moot in the medium term._
|
| I really think this and other tech advances are going to
| be our saviors. It's still early days and it sometimes
| gets things wrong, but it's going to get good and it will
| basically allow us to have our cake and eat it too (as
| long as we can prevent having automated solutions
| banned).
| mst wrote:
| Yeah, my hopes have the caveat of "this requires
| regulations to catch up to where technology is at rather
| than making everything worse" and in addition to my
| generally low opinion of politicians (the ones I've voted
| for absolutely included) there's a serious risk of a
| "boomers versus technology" incident spannering it even
| if everything else goes right ... but I can still *hope*
| even if I can see a number of possible futures where said
| hopes will turn out to be in vain.
| Thorrez wrote:
| In the past, my university was publishing and mailing me a
| print magazine, and making it available in pdf form online.
| Then they stopped making the pdf available. I emailed them
| and asked why. They said it's because the pdf wasn't
| accessible.
|
| But the print form was even less accessible, and they kept
| publishing that...
| giancarlostoro wrote:
| ADA compliance will cost you.
| kleiba wrote:
| Note that Berkeley is in theory _not required_ to remove the
| video archive. It 's just that by law, they are required to
| add captions. So, if they want to keep it up, that's what
| they could do. Except that it's not really a choice - the
| costs for doing so would be prohibitive. So, really, Berkeley
| is left with no choice: making the recording accessible or
| don't offer them at all means - in practice - "don't offer
| them at all".
|
| Clearly the result of a regulation that meant well. But the
| road to hell is paved with good intentions.
|
| It's a bit reminiscent of a law that prevents institutions
| from continually offering employees non-permanent work
| contracts. As in, after two fixed-term contracts, the third
| one must be permanent. The idea is to guarantee workers more
| stable and long-term perspectives. The result, however, is
| that the employee's contract won't get renewed at all after
| the second one, and instead someone else will be hired on a
| non-permanent contract.
| freedomben wrote:
| > _the road to hell is paved with good intentions_
|
| The longer I live the more the truth of this gets
| reinforced. We humans really are kind of bad at designing
| systems and/or solving problems (especially problems of our
| own making). Most of us are like Ralph Wiggum with a crayon
| sticking out of our noises saying, "I'm helping!"
| IanCal wrote:
| Probably quite expensive over the whole catalog but the Berkley
| content would be cheap to do.
|
| If it's, say, 5000 hours then through the best model at
| assembly.ai with no discounts it's cost less than $2000. I know
| someone could do whisper for cheaper, and there likely would be
| discounts at this rate but worst case it seems very doable even
| for an individual.
| ldenoue wrote:
| My repo doesn't re process the audio track: instead it makes
| the raw ASR text transcript better by feeding it additional
| info (title and description) and asking the LLM to fix
| errors.
|
| It is not perfect, it'd sometimes replace words with a
| synonym, but it is much faster and cheaper.
|
| The low cost of Gemini 1.5 Flash-8B costs $1 per 500 hours of
| transcript.
| ei23 wrote:
| With a RTX4090 and insanly-fast-whisper on whisper-
| large-v3-turbo (see Whisper-WebUI for easy testing) you can
| transscribe 5000h on consumer hardware in about 50h with
| timestamps. So, yeah. I also know someone.
| IanCal wrote:
| I can also run this all locally, my point was more that at
| the worst right now the most advanced model (afaik, I'm not
| personally benchmarking) paid for at the headline rates,
| for a huge content library, costs such a reasonable amount
| that an individual can do it. I've donated more to single
| charities than this would cost, while it's not an
| insignificant sum it's a "find one person who cares enough"
| level problem.
|
| Grabbing the audio from thousands of hours of video, or
| even just managing getting the content from wherever it's
| stored, is probably more of an issue than actually creating
| the transcripts.
|
| If anyone reading this has access to the original
| recordings, this is a pretty great time to get
| transcriptions.
| georgecmu wrote:
| A bit of an aside, but the entire Berkeley collection has been
| saved by and is available at archive.org:
| https://archive.org/search?query=subject%3A%22webcast.berkel...
|
| It would be great if they were annotated and served in a more
| user-friendly fashion.
|
| As a bonus link, one of my favorite courses from the time:
| https://archive.org/details/ucberkeley_webcast_itunesu_35482...
| freedomben wrote:
| Neat, thanks!
| wood_spirit wrote:
| As an aside, has anyone else had some big hallucinations with the
| Gemini meet summaries? Have been using it a week or so and loving
| the quality of the grammar of the summary etc, but noticed two
| recurring problems: omitting what was actually the most important
| point raised, and hallucinating things like "person x suggested y
| do z" when, really, that is absolutely the last thing x would
| really suggest!
| hunter2_ wrote:
| It can simultaneously be [the last thing x would suggest] and
| [a conclusion that an uninvolved person tasked with summarizing
| might mistakenly draw, with slightly higher probability of
| making this mistake than not making it] and theoretically an
| LLM attempts to output the latter. The same exact principle
| applies to missing the most important point.
| leetharris wrote:
| The Google ASR is one of the worst on the internet. We run
| benchmarks of the entire industry regularly and the only
| hyperscaler with a good ASR is Azure. They acquired Nuance for
| $20b a while ago and they have a solid lead in the cloud space.
|
| And to run it on a "free" product they probably use a very
| tiny, heavily quantized version of their already weak ASR.
|
| There's lots and lots of better meeting bots if you don't mind
| paying or have low usage that works for a free tier. At Rev we
| give away something like 300 minutes a month.
| baxtr wrote:
| Very interesting. Thanks for sharing.
|
| Since you have experience in this, I'd like to hear your
| thoughts on a common assumption.
|
| It goes like this: don't build anything that would be feature
| for a Hyperscalar because ultimately they win.
|
| I guess a lot of it is a question of timing?
| leetharris wrote:
| I think it really depends on whether or not you can offer a
| competitive solution and what your end goals are. Do you
| want an indie hacker business, do you want a lifestyle
| business, do you want a big exit, do you want to go public,
| etc?
|
| It is hard to compete with these hyperscalers because they
| use pseudo anti-competitive tactics that honestly should be
| illegal.
|
| For example, I know some ASR providers have lost deals to
| GCP or AWS because those providers will basically throw in
| ASR for free if you sign up for X amount of EC2 or Y amount
| of S3, services that have absurd margins for the cloud
| providers.
|
| Still, stuff like Supabase, Twilio, etc show there is a
| market. But it's likely shrinking as consolidation
| continues, exits slow, and the DOJ turns a blind eye to all
| of this.
| hackernewds wrote:
| Counter argument: Zoom, DocuSign
|
| But you do have to be next to amazing at execution
| mst wrote:
| I think those are cases of successfully becoming *the*
| company for the thing in the minds of decision makers
| before the hyperscalers decide to try and turn your
| product into a bundleable feature.
|
| Which is not to disagree with you, only to "yes, and" to
| emphasise that it's a fairly narrow path and 'amazing at
| execution' is necessary but not sufficient.
| aftbit wrote:
| Are there any self-hosted options that are even remotely
| competitive? I have tried Whisper2 a fair bit, and it seems
| to work okay in very clean situations, like adding subtitles
| to movie dialog, but not so well when dealing with multiple
| speakers or poor audio quality.
| albertzeyer wrote:
| K2/Kaldi is using more traditional ASR technology. It's
| probably more difficult to set up but you will more
| reliable outputs (no hallucinations or so).
| jll29 wrote:
| Interesting. Do you have any peer reviewed scientific
| publications or technical reports regarding this work?
|
| We also compared Amazon, Google, Microsoft Azure as well as a
| bunch of smaller players (from Edinburgh and Cambridge) and -
| consistent with what you reported - we also found Google
| ranked worst - but that was a one-off study from 2019
| (unpublished) on financial news.
|
| Word Error Rate (WER), the standard metric for the tast, is
| not everything. For some applications, the ability to upload
| custom lexicons is paramount (ASR systems that are word-based
| (almost all) as opposted to phoneme based require each word
| to be defined ahead of being able to recognize said word).
| depr wrote:
| Have you tested their new Chirp v2 model? Curious if there's
| any improvement there.
|
| >the only hyperscaler with a good ASR is Azure
|
| How would you say the non-hyperscalers compare? Speechmatics
| for example?
| leetharris wrote:
| The main challenge with using LLMs pretrained on internet text
| for transcript correction is that you reduce verbatimicity due to
| the nature of an LLM wanting to format every transcript as
| internet text.
|
| Talking has a lot of nuances to it. Just try to read a Donald
| Trump transcript. A professional author would never write a
| book's dialogue like that.
|
| Using a generic LLM on transcripts almost always reduces accuracy
| as a whole. We have endless benchmark data to demonstrate this at
| RevAI. It does, however, help with custom vocabulary, rare words,
| proper nouns, and some people prefer the "readability" of an LLM-
| formatted transcript. It will read more like a wikipedia page or
| a book as opposed to the true nature of a transcript, which can
| be ugly, messy, and hard to parse at times.
| dylan604 wrote:
| > A professional author would never write a book's dialogue
| like that.
|
| That's a bit too far. Ever read Huck Finn?
| phrotoma wrote:
| I googled "verbatimicity" and all I could find was stuff
| published by rev.ai which didn't (at a quick glance) define the
| term. Can you clarify what this means?
| depr wrote:
| Most likely they mean the degree of being verbatim or exact
| in reproduction.
| icelancer wrote:
| Nice use of an LLM - we use Groq 70b models for this in our
| pipelines at work. (After using WhisperX ASR on meeting files and
| such)
|
| One of the better reasons to use Cerebras/Groq that I've found so
| you can return huge amounts of clean text back fast for
| processing in other ways.
| ldenoue wrote:
| Although Gemini accepts very long input context, I found that
| sending more than 512 or so words at a time to the LLM for
| "cleaning up the text" yields hallucinations. That's why I
| chunk the raw transcript into 512-word chunks.
|
| Are you saying it works with 70B models on Groq? Mixtral,
| Llama? Other?
| icelancer wrote:
| Yeah, I've had no issues sending tokens up to the context
| limit. I cut it off with a 10% buffer but that's just to
| ensure I don't run into tokenization miscounting between
| tiktoken and whatever tokenizer my actual LLM uses.
|
| I have had little success with Gemini and long videos. My
| pipeline is video -> ffmpeg strip audio -> whisperX ASR ->
| groq (L3-70b-specdec) -> gpt-4o/sonnet-3.5 for summarization.
| Works great.
| tombh wrote:
| ASR: Automatic Speech Recognition
| joshdavham wrote:
| I was too afraid to ask!
| throwaway106382 wrote:
| Not to be confused with "Autonomous Sensory Meridian Response"
| (ASMR) - a popular category of video on Youtube.
| hackernewds wrote:
| How would they be confused?
| xanth wrote:
| This was a clever jape; a good example of a ironic anti-
| humor. But I don't think you were confused by that ether ;)
| djmips wrote:
| clever japes are not desired on HN - there's Reddit for
| that my friend.
| wodenokoto wrote:
| I can't explain the how, but I thought it was the ASMR
| thing the title referred to.
| throwaway106382 wrote:
| I think more people actually know what ASMR is as opposed
| to ASR. Lots of ASMR videos are people speaking/whispering
| at extremely low volume.
|
| I don't think it's quite out of the realm of the
| possibility to have interpreted as "Gemini LLM corrects
| ASMR YouTube transcripts". Because you know..they're
| whispering so might be hard to understand or transcribe.
| thaumasiotes wrote:
| Is that different from "speech-to-text"?
| sorenjan wrote:
| Using an LLM to correct text is a good idea, but the text
| transcript doesn't have information about how confident the
| speech to text conversion is. Whisper can output confidence for
| each word, this would probably make for a better pipeline. It
| would surprise me if Google doesn't do something like this soon,
| although maybe a good speech to text model is too computationally
| expensive for Youtube at the moment.
| dylan604 wrote:
| Depends on your purpose of the transcript. If you are expecting
| the exact form of the words spoken in written form, then any
| deviation from that is no longer a transcription. At that point
| it is text loosely based on the spoken content.
|
| Once you accept it okay for the LLM to just replace words in a
| transcript, you might as well just let it make up a story based
| on character names you've provided.
| falcor84 wrote:
| > any deviation from that is no longer a transcription
|
| That's a wild exaggeration. Professional transcripts often
| have small (and not so small) mistakes, caused by typos,
| mishearing or lack of familiarity with the subject matter.
| Depending on the case, these are then manually proofread, but
| even after proofreading, some mistakes often remain, and
| occasionally even introduced.
| dylan604 wrote:
| maybe, but typos are not even the same thing as an LLM
| thinking of better next choice in words than actually just
| transcribing what was heard.
| kelvinjps wrote:
| Google should have the needed tech for good AI transcription, why
| the don't integrate them in their auto-captioning? and instead
| the offer those crappy auto subtitles
| briga wrote:
| Are they crappy though? Most of the time it gets things right,
| even if they aren't as accurate as a human. And sure, they
| probably have better techniques for this, but are they cost-
| effective to run at YouTube-scale? I think their current
| solution is good enough for most purposes, even if it isn't
| perfect
| InsideOutSanta wrote:
| I'm watching YouTube videos with subtitles for my wife, who
| doesn't speak English. For videos on basic topics where
| people speak clear, unaccented English, they work fine (i.e.
| you usually get what people are saying). If the topic is in
| any way unusual, the recording quality is poor, or people
| have accents, the results very quickly turn into a garbled
| mess that is incomprehensible at best, and misleading (i.e.
| the subtitles seem coherent, but are wrong) at worst.
| wahnfrieden wrote:
| Japanese auto captions suck
| summerlight wrote:
| YT is using USM, which is supposed to be their SOTA ASR model.
| Gemini have much better linguistic knowledge, but it's likely
| prohibitively expensive to be used on all YT videos uploaded
| everyday. But this "correction" approach seems to be a nice
| cost-effective methodology to apply LLM indeed.
| Timwi wrote:
| Can I use this to generate subtitles for my own videos? I would
| love to have subtitles on them but I can't be bothered to do all
| the timing synchronization by hand. Surely there must be a way to
| automate that?
| geor9e wrote:
| That's called Youtube Automatic Speech Recognition
| (captioning), and is what this tool uses as input. You can turn
| those on in youtube studio.
| sidcool wrote:
| This is pretty cool. But at the risk of a digression, I can't
| imagine sharing my API keys with a random website on HN. There
| has to be a safe approach to this. Like limited use API keys,
| rate limited API keys or unsafe API keys etc.
| thomasahle wrote:
| Can't you just create a new API key with a limited budget?
| ldenoue wrote:
| I should do that, let me try.
| sidcool wrote:
| The risk of leakage is very high. If Anthropic, Google,
| OpenAI can provide dispensible keys, it will be great.
| thomasahle wrote:
| Both OpenAI and Anthropic let you disable and delete keys.
| I'd be surprised if Google doesn't.
| mst wrote:
| I'm aware this isn't a *proper* solution, but "throw your
| current API key at it, then as soon as you're done playing
| around, execute a test of your API key rotation scripting"
| isn't a terrible workaround, especially if you're the sort of
| person who really *meant* to have tested said scripting
| recently but kept not getting around to it ("hi").
| pachico wrote:
| Hmm, so this is expecting me to upload a personal API Key...
___________________________________________________________________
(page generated 2024-11-26 23:01 UTC)