[HN Gopher] Show HN: Gemini LLM corrects ASR YouTube transcripts
___________________________________________________________________
Show HN: Gemini LLM corrects ASR YouTube transcripts
Author : ldenoue
Score : 55 points
Date : 2024-11-25 18:44 UTC (4 hours ago)
(HTM) web link (ldenoue.github.io)
(TXT) w3m dump (ldenoue.github.io)
| alsetmusic wrote:
| Seems like one of the places where LLMs make a lot of sense. I
| see some boneheaded transcriptions in videos pretty regularly.
| Comparing them against "more-likely" words or phrases seems like
| an ideal use case.
| petesergeant wrote:
| Also useful I think for checking human-entered transcriptions,
| which even on expensively produced shows, can often be garbage
| or just wrong. One human + two separate LLMs, and something to
| tie-break, and we could possibly finally get decent subtitles
| for stuff.
| leetharris wrote:
| A few problems with this approach:
|
| 1. It brings everything back to the "average." Any outliers get
| discarded. For example, someone who is a circus performer plays
| fetch with their frog. An LLM would think this is an obvious
| error and correct it to "dog."
|
| 2. LLMs want to format everything as internet text which does
| not align well to natural human speech.
|
| 3. Hallucinations still happen at scale, regardless of model
| quality.
|
| We've done a lot of experiments on this at Rev and it's still
| useful for the right scenario, but not as reliable as you may
| think.
| falcor84 wrote:
| Regarding the frog, I would assume that the way to address
| this would be to feed the LLM screenshots from the video, if
| the budget allows.
| leetharris wrote:
| Generally yes. That being said, sometimes multimodal LLMs
| show decreased performance with extra modalities.
|
| The extra dimensions of analysis cause increased
| hallucination at times. So maybe it solves the frog
| problem, but now it's hallucinating in another section
| because it got confused by another frame's tokens.
|
| One thing we've wanted to explore lately has been video
| based diarization. If I have a video to accompany some
| audio, can I help with cross talk and sound separation by
| matching lips with audio and assign the correct speaker
| more accurately? There's likely something there.
| devmor wrote:
| Those transcriptions are already done by LLMs in the first
| place - in fact, audio transcription was one of the very first
| large scale commercial uses of the technology in its current
| iteration.
|
| This is just like playing a game of markov telephone where the
| step in OP's solution is likely higher compute cost than the
| step YT uses, because YT is interested in minimizing costs.
| albertzeyer wrote:
| Probably just "regular" LMs, not large LMs, I assume. I
| assume some LM with 10-100M params or so, which is cheap to
| use (and very standard for ASR).
| dylan604 wrote:
| What about the cases where the human speaking is actually using
| nonsense words during a meandering off topic bit of "weaving"?
| Replacing those nonsense words would be a disservice as it
| would totally change the tone of the speech.
| dr_dshiv wrote:
| The first time I used Gemini, I gave it a youtube link and asked
| for a transcript. It told me how I could transcribe it myself.
| Honestly, I haven't used it since. Was that unfair of me?
| robrenaud wrote:
| Gemini is much worse as a product than 4o or Claude. I
| recommend using it from Google AI studio rather than the
| official consumer facing interface. But for tasks with large
| audio/visual input, it's better than 4o or Claude.
|
| Whether you want to deal with it being annoying is your call.
| Spooky23 wrote:
| The consumer Gemini is very prudish and optimized against risk
| to Google.
| andai wrote:
| GPT told me the same thing when I asked it to make an API call,
| or do an image search, or download a transcript of a YouTube
| video, or...
| jazzyjackson wrote:
| Thinking about that time Berkeley delisted thousands of
| recordings of course content as a result of a lawsuit complaining
| that they could not be utilized by deaf individuals. Can this be
| resolved with current technology? Google's auto captioning has
| been abysmal up to this point, I've often wondered what the cost
| would be for google to run modern tech over the entire backlog of
| youtube. At least then they might have a new source of training
| data.
|
| https://news.berkeley.edu/2017/02/24/faq-on-legacy-public-co...
|
| Discussed at the time (2017)
| https://news.ycombinator.com/item?id=13768856
| andai wrote:
| Didn't YouTube have auto-captions at the time this was
| discussed? Yeah they're a bit dodgy but I often watch videos in
| public with sound muted and 90% of the time you can guess what
| word it was meant to be from context. (And indeed more recent
| models do way, way, way better on accuracy.)
| jazzyjackson wrote:
| Definitely depends on audio quality and how closely a
| speaker's dialect matches the mid-atlantic accent, if you
| catch my drift.
|
| IME youtube transcripts are completely devoid of meaningful
| information, especially when domain-specific vocabulary is
| used.
| zehaeva wrote:
| I have a few Deaf/Hard of Hearing friends who find the auto-
| captions to be basically useless.
|
| Anything that's even remotely domain specific becomes a
| garbled mess. Even watching documentaries about light
| engineering/archeology/history subjects are hilariously bad.
| Names of historical places and people are randomly correct
| and almost always never consistent.
|
| The second anyone has a bit of an accent then it's completely
| useless.
|
| I keep them on partially because I'm of the "everything needs
| to have subtitles else I can't hear the words they're saying"
| cohort. So I can figure out what they really mean, but if you
| couldn't hear anything I can see it being hugely
| distracting/distressing/confusing/frustrating.
| hunter2_ wrote:
| With this context, it seems as though correction-by-LLM
| might be a net win among your Deaf/HoH friends even if it
| would be a net loss for you, since you're able to correct
| on the fly better than an LLM probably would, while the
| opposite is more often true for them, due to differences in
| experience with phonetics?
|
| Soundex [0] is a prevailing method of codifying phonetic
| similarity, but unfortunately it's focused on names
| exclusively. Any correction-by-LLM really ought to generate
| substitution probabilities weighted heavily on something
| like that, I would think.
|
| [0] https://en.wikipedia.org/wiki/Soundex
| jonas21 wrote:
| Yes, but the DOJ determined that the auto-generated captions
| were "inaccurate and incomplete, making the content
| inaccessible to individuals with hearing disabilities." [1]
|
| If the automatically-generated captions are now of a similar
| quality as human-generated ones, then that changes things.
|
| [1] https://news.berkeley.edu/wp-
| content/uploads/2016/09/2016-08...
| delusional wrote:
| That's a legal issue. If humans wanted that content to be up,
| we just could have agreed to keep it up. Legal issues don't get
| solved by technology.
| jazzyjackson wrote:
| Well. The legal complaint was that transcripts don't exist.
| The issue was that it was prohibitively expensive to resolve
| the complaint. Now that transcription is 0.1% of the cost it
| was 8 years ago, maybe the complaint could have been
| resolved.
|
| Is building a ramp to meet ADA requirements not using
| technology to solve a legal issue?
| delusional wrote:
| Nowhere on the linked page at least does it say that it was
| due to cost. It would seem more likely to me that it was a
| question of nobody wanting to bother standing up for the
| videos. If nobody wants to take the fight, the default
| judgement becomes to take it down.
|
| Building a ramp solves a problem. Pointing at a ramp 5
| blocks away 7 years later and asking "doesn't this solve
| this issue" doesn't.
| pests wrote:
| Yet this feels very harrison bergeron to me. To handicap
| those with ability so we all can be at the same level.
| wood_spirit wrote:
| As an aside, has anyone else had some big hallucinations with the
| Gemini meet summaries? Have been using it a week or so and loving
| the quality of the grammar of the summary etc, but noticed two
| recurring problems: omitting what was actually the most important
| point raised, and hallucinating things like "person x suggested y
| do z" when, really, that is absolutely the last thing x would
| really suggest!
| hunter2_ wrote:
| It can simultaneously be [the last thing x would suggest] and
| [a conclusion that an uninvolved person tasked with summarizing
| might mistakenly draw, with slightly higher probability of
| making this mistake than not making it] and theoretically an
| LLM attempts to output the latter. The same exact principle
| applies to missing the most important point.
| leetharris wrote:
| The Google ASR is one of the worst on the internet. We run
| benchmarks of the entire industry regularly and the only
| hyperscaler with a good ASR is Azure. They acquired Nuance for
| $20b a while ago and they have a solid lead in the cloud space.
|
| And to run it on a "free" product they probably use a very
| tiny, heavily quantized version of their already weak ASR.
|
| There's lots and lots of better meeting bots if you don't mind
| paying or have low usage that works for a free tier. At Rev we
| give away something like 300 minutes a month.
| baxtr wrote:
| Very interesting. Thanks for sharing.
|
| Since you have experience in this, I'd like to hear your
| thoughts on a common assumption.
|
| It goes like this: don't build anything that would be feature
| for a Hyperscalar because ultimately they win.
|
| I guess a lot of it is a question of timing?
| leetharris wrote:
| I think it really depends on whether or not you can offer a
| competitive solution and what your end goals are. Do you
| want an indie hacker business, do you want a lifestyle
| business, do you want a big exit, do you want to go public,
| etc?
|
| It is hard to compete with these hyperscalers because they
| use pseudo anti-competitive tactics that honestly should be
| illegal.
|
| For example, I know some ASR providers have lost deals to
| GCP or AWS because those providers will basically throw in
| ASR for free if you sign up for X amount of EC2 or Y amount
| of S3, services that have absurd margins for the cloud
| providers.
|
| Still, stuff like Supabase, Twilio, etc show there is a
| market. But it's likely shrinking as consolidation
| continues, exits slow, and the DOJ turns a blind eye to all
| of this.
| aftbit wrote:
| Are there any self-hosted options that are even remotely
| competitive? I have tried Whisper2 a fair bit, and it seems
| to work okay in very clean situations, like adding subtitles
| to movie dialog, but not so well when dealing with multiple
| speakers or poor audio quality.
| albertzeyer wrote:
| K2/Kaldi is using more traditional ASR technology. It's
| probably more difficult to set up but you will more
| reliable outputs (no hallucinations or so).
| leetharris wrote:
| The main challenge with using LLMs pretrained on internet text
| for transcript correction is that you reduce verbatimicity due to
| the nature of an LLM wanting to format every transcript as
| internet text.
|
| Talking has a lot of nuances to it. Just try to read a Donald
| Trump transcript. A professional author would never write a
| book's dialogue like that.
|
| Using a generic LLM on transcripts almost always reduces accuracy
| as a whole. We have endless benchmark data to demonstrate this at
| RevAI. It does, however, help with custom vocabulary, rare words,
| proper nouns, and some people prefer the "readability" of an LLM-
| formatted transcript. It will read more like a wikipedia page or
| a book as opposed to the true nature of a transcript, which can
| be ugly, messy, and hard to parse at times.
| dylan604 wrote:
| > A professional author would never write a book's dialogue
| like that.
|
| That's a bit too far. Ever read Huck Finn?
| icelancer wrote:
| Nice use of an LLM - we use Groq 70b models for this in our
| pipelines at work. (After using WhisperX ASR on meeting files and
| such)
|
| One of the better reasons to use Cerebras/Groq that I've found so
| you can return huge amounts of clean text back fast for
| processing in other ways.
| tombh wrote:
| ASR: Automatic Speech Recognition
| joshdavham wrote:
| I was too afraid to ask!
| throwaway106382 wrote:
| Not to be confused with "Autonomous Sensory Meridian Response"
| (ASMR) - a popular category of video on Youtube.
| sorenjan wrote:
| Using an LLM to correct text is a good idea, but the text
| transcript doesn't have information about how confident the
| speech to text conversion is. Whisper can output confidence for
| each word, this would probably make for a better pipeline. It
| would surprise me if Google doesn't do something like this soon,
| although maybe a good speech to text model is too computationally
| expensive for Youtube at the moment.
| dylan604 wrote:
| Depends on your purpose of the transcript. If you are expecting
| the exact form of the words spoken in written form, then any
| deviation from that is no longer a transcription. At that point
| it is text loosely based on the spoken content.
|
| Once you accept it okay for the LLM to just replace words in a
| transcript, you might as well just let it make up a story based
| on character names you've provided.
| falcor84 wrote:
| > any deviation from that is no longer a transcription
|
| That's a wild exaggeration. Professional transcripts often
| have small (and not so small) mistakes, caused by typos,
| mishearing or lack of familiarity with the subject matter.
| Depending on the case, these are then manually proofread, but
| even after proofreading, some mistakes often remain, and
| occasionally even introduced.
| kelvinjps wrote:
| Google should have the needed tech for good AI transcription, why
| the don't integrate them in their auto-captioning? and instead
| the offer those crappy auto subtitles
| briga wrote:
| Are they crappy though? Most of the time it gets things right,
| even if they aren't as accurate as a human. And sure, they
| probably have better techniques for this, but are they cost-
| effective to run at YouTube-scale? I think their current
| solution is good enough for most purposes, even if it isn't
| perfect
| InsideOutSanta wrote:
| I'm watching YouTube videos with subtitles for my wife, who
| doesn't speak English. For videos on basic topics where
| people speak clear, unaccented English, they work fine (i.e.
| you usually get what people are saying). If the topic is in
| any way unusual, the recording quality is poor, or people
| have accents, the results very quickly turn into a garbled
| mess that is incomprehensible at best, and misleading (i.e.
| the subtitles seem coherent, but are wrong) at worst.
| wahnfrieden wrote:
| Japanese auto captions suck
___________________________________________________________________
(page generated 2024-11-25 23:00 UTC)