[HN Gopher] Show HN: Gemini LLM corrects ASR YouTube transcripts
       ___________________________________________________________________
        
       Show HN: Gemini LLM corrects ASR YouTube transcripts
        
       Author : ldenoue
       Score  : 55 points
       Date   : 2024-11-25 18:44 UTC (4 hours ago)
        
 (HTM) web link (ldenoue.github.io)
 (TXT) w3m dump (ldenoue.github.io)
        
       | alsetmusic wrote:
       | Seems like one of the places where LLMs make a lot of sense. I
       | see some boneheaded transcriptions in videos pretty regularly.
       | Comparing them against "more-likely" words or phrases seems like
       | an ideal use case.
        
         | petesergeant wrote:
         | Also useful I think for checking human-entered transcriptions,
         | which even on expensively produced shows, can often be garbage
         | or just wrong. One human + two separate LLMs, and something to
         | tie-break, and we could possibly finally get decent subtitles
         | for stuff.
        
         | leetharris wrote:
         | A few problems with this approach:
         | 
         | 1. It brings everything back to the "average." Any outliers get
         | discarded. For example, someone who is a circus performer plays
         | fetch with their frog. An LLM would think this is an obvious
         | error and correct it to "dog."
         | 
         | 2. LLMs want to format everything as internet text which does
         | not align well to natural human speech.
         | 
         | 3. Hallucinations still happen at scale, regardless of model
         | quality.
         | 
         | We've done a lot of experiments on this at Rev and it's still
         | useful for the right scenario, but not as reliable as you may
         | think.
        
           | falcor84 wrote:
           | Regarding the frog, I would assume that the way to address
           | this would be to feed the LLM screenshots from the video, if
           | the budget allows.
        
             | leetharris wrote:
             | Generally yes. That being said, sometimes multimodal LLMs
             | show decreased performance with extra modalities.
             | 
             | The extra dimensions of analysis cause increased
             | hallucination at times. So maybe it solves the frog
             | problem, but now it's hallucinating in another section
             | because it got confused by another frame's tokens.
             | 
             | One thing we've wanted to explore lately has been video
             | based diarization. If I have a video to accompany some
             | audio, can I help with cross talk and sound separation by
             | matching lips with audio and assign the correct speaker
             | more accurately? There's likely something there.
        
         | devmor wrote:
         | Those transcriptions are already done by LLMs in the first
         | place - in fact, audio transcription was one of the very first
         | large scale commercial uses of the technology in its current
         | iteration.
         | 
         | This is just like playing a game of markov telephone where the
         | step in OP's solution is likely higher compute cost than the
         | step YT uses, because YT is interested in minimizing costs.
        
           | albertzeyer wrote:
           | Probably just "regular" LMs, not large LMs, I assume. I
           | assume some LM with 10-100M params or so, which is cheap to
           | use (and very standard for ASR).
        
         | dylan604 wrote:
         | What about the cases where the human speaking is actually using
         | nonsense words during a meandering off topic bit of "weaving"?
         | Replacing those nonsense words would be a disservice as it
         | would totally change the tone of the speech.
        
       | dr_dshiv wrote:
       | The first time I used Gemini, I gave it a youtube link and asked
       | for a transcript. It told me how I could transcribe it myself.
       | Honestly, I haven't used it since. Was that unfair of me?
        
         | robrenaud wrote:
         | Gemini is much worse as a product than 4o or Claude. I
         | recommend using it from Google AI studio rather than the
         | official consumer facing interface. But for tasks with large
         | audio/visual input, it's better than 4o or Claude.
         | 
         | Whether you want to deal with it being annoying is your call.
        
         | Spooky23 wrote:
         | The consumer Gemini is very prudish and optimized against risk
         | to Google.
        
         | andai wrote:
         | GPT told me the same thing when I asked it to make an API call,
         | or do an image search, or download a transcript of a YouTube
         | video, or...
        
       | jazzyjackson wrote:
       | Thinking about that time Berkeley delisted thousands of
       | recordings of course content as a result of a lawsuit complaining
       | that they could not be utilized by deaf individuals. Can this be
       | resolved with current technology? Google's auto captioning has
       | been abysmal up to this point, I've often wondered what the cost
       | would be for google to run modern tech over the entire backlog of
       | youtube. At least then they might have a new source of training
       | data.
       | 
       | https://news.berkeley.edu/2017/02/24/faq-on-legacy-public-co...
       | 
       | Discussed at the time (2017)
       | https://news.ycombinator.com/item?id=13768856
        
         | andai wrote:
         | Didn't YouTube have auto-captions at the time this was
         | discussed? Yeah they're a bit dodgy but I often watch videos in
         | public with sound muted and 90% of the time you can guess what
         | word it was meant to be from context. (And indeed more recent
         | models do way, way, way better on accuracy.)
        
           | jazzyjackson wrote:
           | Definitely depends on audio quality and how closely a
           | speaker's dialect matches the mid-atlantic accent, if you
           | catch my drift.
           | 
           | IME youtube transcripts are completely devoid of meaningful
           | information, especially when domain-specific vocabulary is
           | used.
        
           | zehaeva wrote:
           | I have a few Deaf/Hard of Hearing friends who find the auto-
           | captions to be basically useless.
           | 
           | Anything that's even remotely domain specific becomes a
           | garbled mess. Even watching documentaries about light
           | engineering/archeology/history subjects are hilariously bad.
           | Names of historical places and people are randomly correct
           | and almost always never consistent.
           | 
           | The second anyone has a bit of an accent then it's completely
           | useless.
           | 
           | I keep them on partially because I'm of the "everything needs
           | to have subtitles else I can't hear the words they're saying"
           | cohort. So I can figure out what they really mean, but if you
           | couldn't hear anything I can see it being hugely
           | distracting/distressing/confusing/frustrating.
        
             | hunter2_ wrote:
             | With this context, it seems as though correction-by-LLM
             | might be a net win among your Deaf/HoH friends even if it
             | would be a net loss for you, since you're able to correct
             | on the fly better than an LLM probably would, while the
             | opposite is more often true for them, due to differences in
             | experience with phonetics?
             | 
             | Soundex [0] is a prevailing method of codifying phonetic
             | similarity, but unfortunately it's focused on names
             | exclusively. Any correction-by-LLM really ought to generate
             | substitution probabilities weighted heavily on something
             | like that, I would think.
             | 
             | [0] https://en.wikipedia.org/wiki/Soundex
        
           | jonas21 wrote:
           | Yes, but the DOJ determined that the auto-generated captions
           | were "inaccurate and incomplete, making the content
           | inaccessible to individuals with hearing disabilities." [1]
           | 
           | If the automatically-generated captions are now of a similar
           | quality as human-generated ones, then that changes things.
           | 
           | [1] https://news.berkeley.edu/wp-
           | content/uploads/2016/09/2016-08...
        
         | delusional wrote:
         | That's a legal issue. If humans wanted that content to be up,
         | we just could have agreed to keep it up. Legal issues don't get
         | solved by technology.
        
           | jazzyjackson wrote:
           | Well. The legal complaint was that transcripts don't exist.
           | The issue was that it was prohibitively expensive to resolve
           | the complaint. Now that transcription is 0.1% of the cost it
           | was 8 years ago, maybe the complaint could have been
           | resolved.
           | 
           | Is building a ramp to meet ADA requirements not using
           | technology to solve a legal issue?
        
             | delusional wrote:
             | Nowhere on the linked page at least does it say that it was
             | due to cost. It would seem more likely to me that it was a
             | question of nobody wanting to bother standing up for the
             | videos. If nobody wants to take the fight, the default
             | judgement becomes to take it down.
             | 
             | Building a ramp solves a problem. Pointing at a ramp 5
             | blocks away 7 years later and asking "doesn't this solve
             | this issue" doesn't.
        
               | pests wrote:
               | Yet this feels very harrison bergeron to me. To handicap
               | those with ability so we all can be at the same level.
        
       | wood_spirit wrote:
       | As an aside, has anyone else had some big hallucinations with the
       | Gemini meet summaries? Have been using it a week or so and loving
       | the quality of the grammar of the summary etc, but noticed two
       | recurring problems: omitting what was actually the most important
       | point raised, and hallucinating things like "person x suggested y
       | do z" when, really, that is absolutely the last thing x would
       | really suggest!
        
         | hunter2_ wrote:
         | It can simultaneously be [the last thing x would suggest] and
         | [a conclusion that an uninvolved person tasked with summarizing
         | might mistakenly draw, with slightly higher probability of
         | making this mistake than not making it] and theoretically an
         | LLM attempts to output the latter. The same exact principle
         | applies to missing the most important point.
        
         | leetharris wrote:
         | The Google ASR is one of the worst on the internet. We run
         | benchmarks of the entire industry regularly and the only
         | hyperscaler with a good ASR is Azure. They acquired Nuance for
         | $20b a while ago and they have a solid lead in the cloud space.
         | 
         | And to run it on a "free" product they probably use a very
         | tiny, heavily quantized version of their already weak ASR.
         | 
         | There's lots and lots of better meeting bots if you don't mind
         | paying or have low usage that works for a free tier. At Rev we
         | give away something like 300 minutes a month.
        
           | baxtr wrote:
           | Very interesting. Thanks for sharing.
           | 
           | Since you have experience in this, I'd like to hear your
           | thoughts on a common assumption.
           | 
           | It goes like this: don't build anything that would be feature
           | for a Hyperscalar because ultimately they win.
           | 
           | I guess a lot of it is a question of timing?
        
             | leetharris wrote:
             | I think it really depends on whether or not you can offer a
             | competitive solution and what your end goals are. Do you
             | want an indie hacker business, do you want a lifestyle
             | business, do you want a big exit, do you want to go public,
             | etc?
             | 
             | It is hard to compete with these hyperscalers because they
             | use pseudo anti-competitive tactics that honestly should be
             | illegal.
             | 
             | For example, I know some ASR providers have lost deals to
             | GCP or AWS because those providers will basically throw in
             | ASR for free if you sign up for X amount of EC2 or Y amount
             | of S3, services that have absurd margins for the cloud
             | providers.
             | 
             | Still, stuff like Supabase, Twilio, etc show there is a
             | market. But it's likely shrinking as consolidation
             | continues, exits slow, and the DOJ turns a blind eye to all
             | of this.
        
           | aftbit wrote:
           | Are there any self-hosted options that are even remotely
           | competitive? I have tried Whisper2 a fair bit, and it seems
           | to work okay in very clean situations, like adding subtitles
           | to movie dialog, but not so well when dealing with multiple
           | speakers or poor audio quality.
        
             | albertzeyer wrote:
             | K2/Kaldi is using more traditional ASR technology. It's
             | probably more difficult to set up but you will more
             | reliable outputs (no hallucinations or so).
        
       | leetharris wrote:
       | The main challenge with using LLMs pretrained on internet text
       | for transcript correction is that you reduce verbatimicity due to
       | the nature of an LLM wanting to format every transcript as
       | internet text.
       | 
       | Talking has a lot of nuances to it. Just try to read a Donald
       | Trump transcript. A professional author would never write a
       | book's dialogue like that.
       | 
       | Using a generic LLM on transcripts almost always reduces accuracy
       | as a whole. We have endless benchmark data to demonstrate this at
       | RevAI. It does, however, help with custom vocabulary, rare words,
       | proper nouns, and some people prefer the "readability" of an LLM-
       | formatted transcript. It will read more like a wikipedia page or
       | a book as opposed to the true nature of a transcript, which can
       | be ugly, messy, and hard to parse at times.
        
         | dylan604 wrote:
         | > A professional author would never write a book's dialogue
         | like that.
         | 
         | That's a bit too far. Ever read Huck Finn?
        
       | icelancer wrote:
       | Nice use of an LLM - we use Groq 70b models for this in our
       | pipelines at work. (After using WhisperX ASR on meeting files and
       | such)
       | 
       | One of the better reasons to use Cerebras/Groq that I've found so
       | you can return huge amounts of clean text back fast for
       | processing in other ways.
        
       | tombh wrote:
       | ASR: Automatic Speech Recognition
        
         | joshdavham wrote:
         | I was too afraid to ask!
        
         | throwaway106382 wrote:
         | Not to be confused with "Autonomous Sensory Meridian Response"
         | (ASMR) - a popular category of video on Youtube.
        
       | sorenjan wrote:
       | Using an LLM to correct text is a good idea, but the text
       | transcript doesn't have information about how confident the
       | speech to text conversion is. Whisper can output confidence for
       | each word, this would probably make for a better pipeline. It
       | would surprise me if Google doesn't do something like this soon,
       | although maybe a good speech to text model is too computationally
       | expensive for Youtube at the moment.
        
         | dylan604 wrote:
         | Depends on your purpose of the transcript. If you are expecting
         | the exact form of the words spoken in written form, then any
         | deviation from that is no longer a transcription. At that point
         | it is text loosely based on the spoken content.
         | 
         | Once you accept it okay for the LLM to just replace words in a
         | transcript, you might as well just let it make up a story based
         | on character names you've provided.
        
           | falcor84 wrote:
           | > any deviation from that is no longer a transcription
           | 
           | That's a wild exaggeration. Professional transcripts often
           | have small (and not so small) mistakes, caused by typos,
           | mishearing or lack of familiarity with the subject matter.
           | Depending on the case, these are then manually proofread, but
           | even after proofreading, some mistakes often remain, and
           | occasionally even introduced.
        
       | kelvinjps wrote:
       | Google should have the needed tech for good AI transcription, why
       | the don't integrate them in their auto-captioning? and instead
       | the offer those crappy auto subtitles
        
         | briga wrote:
         | Are they crappy though? Most of the time it gets things right,
         | even if they aren't as accurate as a human. And sure, they
         | probably have better techniques for this, but are they cost-
         | effective to run at YouTube-scale? I think their current
         | solution is good enough for most purposes, even if it isn't
         | perfect
        
           | InsideOutSanta wrote:
           | I'm watching YouTube videos with subtitles for my wife, who
           | doesn't speak English. For videos on basic topics where
           | people speak clear, unaccented English, they work fine (i.e.
           | you usually get what people are saying). If the topic is in
           | any way unusual, the recording quality is poor, or people
           | have accents, the results very quickly turn into a garbled
           | mess that is incomprehensible at best, and misleading (i.e.
           | the subtitles seem coherent, but are wrong) at worst.
        
           | wahnfrieden wrote:
           | Japanese auto captions suck
        
       ___________________________________________________________________
       (page generated 2024-11-25 23:00 UTC)