[HN Gopher] Show HN: Gemini LLM corrects ASR YouTube transcripts
       ___________________________________________________________________
        
       Show HN: Gemini LLM corrects ASR YouTube transcripts
        
       Author : ldenoue
       Score  : 163 points
       Date   : 2024-11-25 18:44 UTC (1 days ago)
        
 (HTM) web link (ldenoue.github.io)
 (TXT) w3m dump (ldenoue.github.io)
        
       | alsetmusic wrote:
       | Seems like one of the places where LLMs make a lot of sense. I
       | see some boneheaded transcriptions in videos pretty regularly.
       | Comparing them against "more-likely" words or phrases seems like
       | an ideal use case.
        
         | petesergeant wrote:
         | Also useful I think for checking human-entered transcriptions,
         | which even on expensively produced shows, can often be garbage
         | or just wrong. One human + two separate LLMs, and something to
         | tie-break, and we could possibly finally get decent subtitles
         | for stuff.
        
         | leetharris wrote:
         | A few problems with this approach:
         | 
         | 1. It brings everything back to the "average." Any outliers get
         | discarded. For example, someone who is a circus performer plays
         | fetch with their frog. An LLM would think this is an obvious
         | error and correct it to "dog."
         | 
         | 2. LLMs want to format everything as internet text which does
         | not align well to natural human speech.
         | 
         | 3. Hallucinations still happen at scale, regardless of model
         | quality.
         | 
         | We've done a lot of experiments on this at Rev and it's still
         | useful for the right scenario, but not as reliable as you may
         | think.
        
           | falcor84 wrote:
           | Regarding the frog, I would assume that the way to address
           | this would be to feed the LLM screenshots from the video, if
           | the budget allows.
        
             | leetharris wrote:
             | Generally yes. That being said, sometimes multimodal LLMs
             | show decreased performance with extra modalities.
             | 
             | The extra dimensions of analysis cause increased
             | hallucination at times. So maybe it solves the frog
             | problem, but now it's hallucinating in another section
             | because it got confused by another frame's tokens.
             | 
             | One thing we've wanted to explore lately has been video
             | based diarization. If I have a video to accompany some
             | audio, can I help with cross talk and sound separation by
             | matching lips with audio and assign the correct speaker
             | more accurately? There's likely something there.
        
               | orion138 wrote:
               | Google published Looking to Listen a while back.
               | 
               | https://research.google/blog/looking-to-listen-audio-
               | visual-...
        
           | ldenoue wrote:
           | Do you have something to read about your study, experiments?
           | Genuinely interested. Perhaps the prompts can be made to tell
           | the LLM it's specifically handling human speech, not written
           | text?
        
         | devmor wrote:
         | Those transcriptions are already done by LLMs in the first
         | place - in fact, audio transcription was one of the very first
         | large scale commercial uses of the technology in its current
         | iteration.
         | 
         | This is just like playing a game of markov telephone where the
         | step in OP's solution is likely higher compute cost than the
         | step YT uses, because YT is interested in minimizing costs.
        
           | albertzeyer wrote:
           | Probably just "regular" LMs, not large LMs, I assume. I
           | assume some LM with 10-100M params or so, which is cheap to
           | use (and very standard for ASR).
        
         | dylan604 wrote:
         | What about the cases where the human speaking is actually using
         | nonsense words during a meandering off topic bit of "weaving"?
         | Replacing those nonsense words would be a disservice as it
         | would totally change the tone of the speech.
        
       | dr_dshiv wrote:
       | The first time I used Gemini, I gave it a youtube link and asked
       | for a transcript. It told me how I could transcribe it myself.
       | Honestly, I haven't used it since. Was that unfair of me?
        
         | robrenaud wrote:
         | Gemini is much worse as a product than 4o or Claude. I
         | recommend using it from Google AI studio rather than the
         | official consumer facing interface. But for tasks with large
         | audio/visual input, it's better than 4o or Claude.
         | 
         | Whether you want to deal with it being annoying is your call.
        
         | Spooky23 wrote:
         | The consumer Gemini is very prudish and optimized against risk
         | to Google.
        
         | andai wrote:
         | GPT told me the same thing when I asked it to make an API call,
         | or do an image search, or download a transcript of a YouTube
         | video, or...
        
       | jazzyjackson wrote:
       | Thinking about that time Berkeley delisted thousands of
       | recordings of course content as a result of a lawsuit complaining
       | that they could not be utilized by deaf individuals. Can this be
       | resolved with current technology? Google's auto captioning has
       | been abysmal up to this point, I've often wondered what the cost
       | would be for google to run modern tech over the entire backlog of
       | youtube. At least then they might have a new source of training
       | data.
       | 
       | https://news.berkeley.edu/2017/02/24/faq-on-legacy-public-co...
       | 
       | Discussed at the time (2017)
       | https://news.ycombinator.com/item?id=13768856
        
         | andai wrote:
         | Didn't YouTube have auto-captions at the time this was
         | discussed? Yeah they're a bit dodgy but I often watch videos in
         | public with sound muted and 90% of the time you can guess what
         | word it was meant to be from context. (And indeed more recent
         | models do way, way, way better on accuracy.)
        
           | jazzyjackson wrote:
           | Definitely depends on audio quality and how closely a
           | speaker's dialect matches the mid-atlantic accent, if you
           | catch my drift.
           | 
           | IME youtube transcripts are completely devoid of meaningful
           | information, especially when domain-specific vocabulary is
           | used.
        
           | zehaeva wrote:
           | I have a few Deaf/Hard of Hearing friends who find the auto-
           | captions to be basically useless.
           | 
           | Anything that's even remotely domain specific becomes a
           | garbled mess. Even watching documentaries about light
           | engineering/archeology/history subjects are hilariously bad.
           | Names of historical places and people are randomly correct
           | and almost always never consistent.
           | 
           | The second anyone has a bit of an accent then it's completely
           | useless.
           | 
           | I keep them on partially because I'm of the "everything needs
           | to have subtitles else I can't hear the words they're saying"
           | cohort. So I can figure out what they really mean, but if you
           | couldn't hear anything I can see it being hugely
           | distracting/distressing/confusing/frustrating.
        
             | hunter2_ wrote:
             | With this context, it seems as though correction-by-LLM
             | might be a net win among your Deaf/HoH friends even if it
             | would be a net loss for you, since you're able to correct
             | on the fly better than an LLM probably would, while the
             | opposite is more often true for them, due to differences in
             | experience with phonetics?
             | 
             | Soundex [0] is a prevailing method of codifying phonetic
             | similarity, but unfortunately it's focused on names
             | exclusively. Any correction-by-LLM really ought to generate
             | substitution probabilities weighted heavily on something
             | like that, I would think.
             | 
             | [0] https://en.wikipedia.org/wiki/Soundex
        
               | schrodinger wrote:
               | I'd assume Soundex is too basic and English-centric to be
               | a practical solution for an international company like
               | Google. I was taught it and implemented it in a freshman
               | level CS course in 2004, it can't be nearly state of the
               | art!
        
               | shakna wrote:
               | Soundex is fast, but inaccurate. It only prevails,
               | because of the computational cost of things like
               | levenshtein distance.
        
               | novok wrote:
               | You can also download the audio only with yt-dlp and then
               | remake subs with whisper or whatever other model you
               | want. GPU compute wise it will probably be less than
               | asking an llm to try to correct a garbled transcript.
        
               | HPsquared wrote:
               | I suppose the gold standard would be a multimodal model
               | that also looks at the screen (maybe only if the captions
               | aren't making much sense).
        
               | ldenoue wrote:
               | The current Flash-8B model I use costs $1 per 500 hours
               | of transcript.
        
               | andai wrote:
               | If I read OpenAI's pricing right, then Google's thing is
               | _200 times_ cheaper?
        
             | creato wrote:
             | I use youtube closed captions all the time when I don't
             | want to have audio. The captions are almost always fine. I
             | definitely am not watching videos that would have had
             | professional/human edited captions either.
             | 
             | There may be mistakes like the ones you mentioned (getting
             | names wrong/inconsistent), but if I know what was intended,
             | it's pretty easy to ignore that. I think expecting
             | "textual" correctness is unreasonable. Usually when there
             | are mistakes, they are "phonetic", i.e. if you spoke the
             | caption out loud, it would sound pretty similar to what was
             | spoken in the video.
        
               | dqv wrote:
               | > I think expecting "textual" correctness is
               | unreasonable.
               | 
               | Of course you think that, you don't have to rely solely
               | on closed captions! It's usually not even posed as an
               | expectation, but as a request to correct captions that
               | don't make sense. Especially now that we have auto-
               | captioning _and_ tools that auto-correct the captions,
               | running through and tweaking them to near-perfect
               | accuracy is not an undue burden.
               | 
               | > if you spoke the caption out loud, it would sound
               | pretty similar to what was spoken in the video.
               | 
               | Yes, but most deaf people can't do that. Even if they
               | can, they shouldn't have to.
        
               | beeboobaa6 wrote:
               | There's helping people and there's infantilizing them.
               | Being deaf doesn't mean you're stupid. They can figure it
               | out.
               | 
               | Deleting thousands of hours of course material because
               | you're worried they're not able to understand
               | autogenerated captions just ensures everyone loses. Don't
               | be so ridiculous.
        
               | mst wrote:
               | They continue to be the worst automated transcripts I
               | encounter and personally I find them sufficiently
               | terribad that every time I try them I end up filing them
               | under "nope, still more trouble than it's worth, gonna
               | find a different source for this information and give
               | them another go in six months."
               | 
               | Even mentally sounding them out (which is fine for me
               | since I have no relevant disabilities, I just despise
               | trying to take in any meaningful quantity of information
               | from a video) when they look weird doesn't make them
               | tolerable *for me*.
               | 
               | It's still a good thing overall that they're tolerable
               | for you, though, and I hope other people are on average
               | finding the experience closer to how you find it than how
               | I find it ... but I definitely don't, yet.
               | 
               | Hopefully in a year or so I'll be in the same camp as you
               | are, though, overall progress in the relevant class of
               | tech seems to've hit a pretty decent velocity these days.
        
             | ldenoue wrote:
             | Definitely: and just giving the LLM context before
             | correcting (in this case the title and description of the
             | video, often written by a person) creates much better
             | transcripts.
        
             | GaggiX wrote:
             | Youtube captions have improved massively in recent years,
             | they are flawless in most cases, sometimes a few errors
             | (almost entirely in reporting numbers).
             | 
             | I think that the biggest problem is that the subtitles do
             | not distinguish between the speakers.
        
           | jonas21 wrote:
           | Yes, but the DOJ determined that the auto-generated captions
           | were "inaccurate and incomplete, making the content
           | inaccessible to individuals with hearing disabilities." [1]
           | 
           | If the automatically-generated captions are now of a similar
           | quality as human-generated ones, then that changes things.
           | 
           | [1] https://news.berkeley.edu/wp-
           | content/uploads/2016/09/2016-08...
        
           | cavisne wrote:
           | What happened here is a specific scam where companies are
           | targeted for ADA violations, which are so vague it's
           | impossible to "comply".
        
           | PeterStuer wrote:
           | Youtube auto-captions are extremely poor compared to e.g.
           | running the audio through Wisper.
        
         | delusional wrote:
         | That's a legal issue. If humans wanted that content to be up,
         | we just could have agreed to keep it up. Legal issues don't get
         | solved by technology.
        
           | jazzyjackson wrote:
           | Well. The legal complaint was that transcripts don't exist.
           | The issue was that it was prohibitively expensive to resolve
           | the complaint. Now that transcription is 0.1% of the cost it
           | was 8 years ago, maybe the complaint could have been
           | resolved.
           | 
           | Is building a ramp to meet ADA requirements not using
           | technology to solve a legal issue?
        
             | delusional wrote:
             | Nowhere on the linked page at least does it say that it was
             | due to cost. It would seem more likely to me that it was a
             | question of nobody wanting to bother standing up for the
             | videos. If nobody wants to take the fight, the default
             | judgement becomes to take it down.
             | 
             | Building a ramp solves a problem. Pointing at a ramp 5
             | blocks away 7 years later and asking "doesn't this solve
             | this issue" doesn't.
        
               | pests wrote:
               | Yet this feels very harrison bergeron to me. To handicap
               | those with ability so we all can be at the same level.
        
               | fuzzy_biscuit wrote:
               | Right. The judgment doesn't help people with disabilities
               | at all. It only punishes the rest of the population.
        
           | yard2010 wrote:
           | Yet. Legal issues don't get solved by tech yet!
        
         | hackernewds wrote:
         | What a silly requirement? Since 1% cannot benefit, let's remove
         | it for the 99%
        
           | 3abiton wrote:
           | It's one of those "to motivate the horse to run 1% faster,
           | you add shit ton of weight on top of it" strategy.
        
           | IanCal wrote:
           | The problem is that having that rule results in those 1%s
           | always being excluded. It's probably worth just going back
           | and looking at the arguments for laws around accessibility.
        
             | mst wrote:
             | Yeah, every time I try and figure out an approach that
             | could've avoided this being covered by the rules without
             | making it easy for everybody to screw over deaf people
             | entirely I end up coming to the conclusion that there
             | probably isn't one.
             | 
             | I'm somewhat tempted to think that whoever sued berkeley
             | and had the whole thing taken down in this specific case
             | was just being a knob, but OTOH there's issues even with
             | that POV in terms of letting precedents be set that will de
             | facto still become "screw over deaf people entirely" even
             | when everybody involved is doing their best to act in good
             | faith.
             | 
             | Hopefully speech-to-text and text-to-speech will make the
             | question moot in the medium term.
        
               | freedomben wrote:
               | > _Hopefully speech-to-text and text-to-speech will make
               | the question moot in the medium term._
               | 
               | I really think this and other tech advances are going to
               | be our saviors. It's still early days and it sometimes
               | gets things wrong, but it's going to get good and it will
               | basically allow us to have our cake and eat it too (as
               | long as we can prevent having automated solutions
               | banned).
        
               | mst wrote:
               | Yeah, my hopes have the caveat of "this requires
               | regulations to catch up to where technology is at rather
               | than making everything worse" and in addition to my
               | generally low opinion of politicians (the ones I've voted
               | for absolutely included) there's a serious risk of a
               | "boomers versus technology" incident spannering it even
               | if everything else goes right ... but I can still *hope*
               | even if I can see a number of possible futures where said
               | hopes will turn out to be in vain.
        
           | Thorrez wrote:
           | In the past, my university was publishing and mailing me a
           | print magazine, and making it available in pdf form online.
           | Then they stopped making the pdf available. I emailed them
           | and asked why. They said it's because the pdf wasn't
           | accessible.
           | 
           | But the print form was even less accessible, and they kept
           | publishing that...
        
             | giancarlostoro wrote:
             | ADA compliance will cost you.
        
           | kleiba wrote:
           | Note that Berkeley is in theory _not required_ to remove the
           | video archive. It 's just that by law, they are required to
           | add captions. So, if they want to keep it up, that's what
           | they could do. Except that it's not really a choice - the
           | costs for doing so would be prohibitive. So, really, Berkeley
           | is left with no choice: making the recording accessible or
           | don't offer them at all means - in practice - "don't offer
           | them at all".
           | 
           | Clearly the result of a regulation that meant well. But the
           | road to hell is paved with good intentions.
           | 
           | It's a bit reminiscent of a law that prevents institutions
           | from continually offering employees non-permanent work
           | contracts. As in, after two fixed-term contracts, the third
           | one must be permanent. The idea is to guarantee workers more
           | stable and long-term perspectives. The result, however, is
           | that the employee's contract won't get renewed at all after
           | the second one, and instead someone else will be hired on a
           | non-permanent contract.
        
             | freedomben wrote:
             | > _the road to hell is paved with good intentions_
             | 
             | The longer I live the more the truth of this gets
             | reinforced. We humans really are kind of bad at designing
             | systems and/or solving problems (especially problems of our
             | own making). Most of us are like Ralph Wiggum with a crayon
             | sticking out of our noises saying, "I'm helping!"
        
         | IanCal wrote:
         | Probably quite expensive over the whole catalog but the Berkley
         | content would be cheap to do.
         | 
         | If it's, say, 5000 hours then through the best model at
         | assembly.ai with no discounts it's cost less than $2000. I know
         | someone could do whisper for cheaper, and there likely would be
         | discounts at this rate but worst case it seems very doable even
         | for an individual.
        
           | ldenoue wrote:
           | My repo doesn't re process the audio track: instead it makes
           | the raw ASR text transcript better by feeding it additional
           | info (title and description) and asking the LLM to fix
           | errors.
           | 
           | It is not perfect, it'd sometimes replace words with a
           | synonym, but it is much faster and cheaper.
           | 
           | The low cost of Gemini 1.5 Flash-8B costs $1 per 500 hours of
           | transcript.
        
           | ei23 wrote:
           | With a RTX4090 and insanly-fast-whisper on whisper-
           | large-v3-turbo (see Whisper-WebUI for easy testing) you can
           | transscribe 5000h on consumer hardware in about 50h with
           | timestamps. So, yeah. I also know someone.
        
             | IanCal wrote:
             | I can also run this all locally, my point was more that at
             | the worst right now the most advanced model (afaik, I'm not
             | personally benchmarking) paid for at the headline rates,
             | for a huge content library, costs such a reasonable amount
             | that an individual can do it. I've donated more to single
             | charities than this would cost, while it's not an
             | insignificant sum it's a "find one person who cares enough"
             | level problem.
             | 
             | Grabbing the audio from thousands of hours of video, or
             | even just managing getting the content from wherever it's
             | stored, is probably more of an issue than actually creating
             | the transcripts.
             | 
             | If anyone reading this has access to the original
             | recordings, this is a pretty great time to get
             | transcriptions.
        
         | georgecmu wrote:
         | A bit of an aside, but the entire Berkeley collection has been
         | saved by and is available at archive.org:
         | https://archive.org/search?query=subject%3A%22webcast.berkel...
         | 
         | It would be great if they were annotated and served in a more
         | user-friendly fashion.
         | 
         | As a bonus link, one of my favorite courses from the time:
         | https://archive.org/details/ucberkeley_webcast_itunesu_35482...
        
           | freedomben wrote:
           | Neat, thanks!
        
       | wood_spirit wrote:
       | As an aside, has anyone else had some big hallucinations with the
       | Gemini meet summaries? Have been using it a week or so and loving
       | the quality of the grammar of the summary etc, but noticed two
       | recurring problems: omitting what was actually the most important
       | point raised, and hallucinating things like "person x suggested y
       | do z" when, really, that is absolutely the last thing x would
       | really suggest!
        
         | hunter2_ wrote:
         | It can simultaneously be [the last thing x would suggest] and
         | [a conclusion that an uninvolved person tasked with summarizing
         | might mistakenly draw, with slightly higher probability of
         | making this mistake than not making it] and theoretically an
         | LLM attempts to output the latter. The same exact principle
         | applies to missing the most important point.
        
         | leetharris wrote:
         | The Google ASR is one of the worst on the internet. We run
         | benchmarks of the entire industry regularly and the only
         | hyperscaler with a good ASR is Azure. They acquired Nuance for
         | $20b a while ago and they have a solid lead in the cloud space.
         | 
         | And to run it on a "free" product they probably use a very
         | tiny, heavily quantized version of their already weak ASR.
         | 
         | There's lots and lots of better meeting bots if you don't mind
         | paying or have low usage that works for a free tier. At Rev we
         | give away something like 300 minutes a month.
        
           | baxtr wrote:
           | Very interesting. Thanks for sharing.
           | 
           | Since you have experience in this, I'd like to hear your
           | thoughts on a common assumption.
           | 
           | It goes like this: don't build anything that would be feature
           | for a Hyperscalar because ultimately they win.
           | 
           | I guess a lot of it is a question of timing?
        
             | leetharris wrote:
             | I think it really depends on whether or not you can offer a
             | competitive solution and what your end goals are. Do you
             | want an indie hacker business, do you want a lifestyle
             | business, do you want a big exit, do you want to go public,
             | etc?
             | 
             | It is hard to compete with these hyperscalers because they
             | use pseudo anti-competitive tactics that honestly should be
             | illegal.
             | 
             | For example, I know some ASR providers have lost deals to
             | GCP or AWS because those providers will basically throw in
             | ASR for free if you sign up for X amount of EC2 or Y amount
             | of S3, services that have absurd margins for the cloud
             | providers.
             | 
             | Still, stuff like Supabase, Twilio, etc show there is a
             | market. But it's likely shrinking as consolidation
             | continues, exits slow, and the DOJ turns a blind eye to all
             | of this.
        
             | hackernewds wrote:
             | Counter argument: Zoom, DocuSign
             | 
             | But you do have to be next to amazing at execution
        
               | mst wrote:
               | I think those are cases of successfully becoming *the*
               | company for the thing in the minds of decision makers
               | before the hyperscalers decide to try and turn your
               | product into a bundleable feature.
               | 
               | Which is not to disagree with you, only to "yes, and" to
               | emphasise that it's a fairly narrow path and 'amazing at
               | execution' is necessary but not sufficient.
        
           | aftbit wrote:
           | Are there any self-hosted options that are even remotely
           | competitive? I have tried Whisper2 a fair bit, and it seems
           | to work okay in very clean situations, like adding subtitles
           | to movie dialog, but not so well when dealing with multiple
           | speakers or poor audio quality.
        
             | albertzeyer wrote:
             | K2/Kaldi is using more traditional ASR technology. It's
             | probably more difficult to set up but you will more
             | reliable outputs (no hallucinations or so).
        
           | jll29 wrote:
           | Interesting. Do you have any peer reviewed scientific
           | publications or technical reports regarding this work?
           | 
           | We also compared Amazon, Google, Microsoft Azure as well as a
           | bunch of smaller players (from Edinburgh and Cambridge) and -
           | consistent with what you reported - we also found Google
           | ranked worst - but that was a one-off study from 2019
           | (unpublished) on financial news.
           | 
           | Word Error Rate (WER), the standard metric for the tast, is
           | not everything. For some applications, the ability to upload
           | custom lexicons is paramount (ASR systems that are word-based
           | (almost all) as opposted to phoneme based require each word
           | to be defined ahead of being able to recognize said word).
        
           | depr wrote:
           | Have you tested their new Chirp v2 model? Curious if there's
           | any improvement there.
           | 
           | >the only hyperscaler with a good ASR is Azure
           | 
           | How would you say the non-hyperscalers compare? Speechmatics
           | for example?
        
       | leetharris wrote:
       | The main challenge with using LLMs pretrained on internet text
       | for transcript correction is that you reduce verbatimicity due to
       | the nature of an LLM wanting to format every transcript as
       | internet text.
       | 
       | Talking has a lot of nuances to it. Just try to read a Donald
       | Trump transcript. A professional author would never write a
       | book's dialogue like that.
       | 
       | Using a generic LLM on transcripts almost always reduces accuracy
       | as a whole. We have endless benchmark data to demonstrate this at
       | RevAI. It does, however, help with custom vocabulary, rare words,
       | proper nouns, and some people prefer the "readability" of an LLM-
       | formatted transcript. It will read more like a wikipedia page or
       | a book as opposed to the true nature of a transcript, which can
       | be ugly, messy, and hard to parse at times.
        
         | dylan604 wrote:
         | > A professional author would never write a book's dialogue
         | like that.
         | 
         | That's a bit too far. Ever read Huck Finn?
        
         | phrotoma wrote:
         | I googled "verbatimicity" and all I could find was stuff
         | published by rev.ai which didn't (at a quick glance) define the
         | term. Can you clarify what this means?
        
           | depr wrote:
           | Most likely they mean the degree of being verbatim or exact
           | in reproduction.
        
       | icelancer wrote:
       | Nice use of an LLM - we use Groq 70b models for this in our
       | pipelines at work. (After using WhisperX ASR on meeting files and
       | such)
       | 
       | One of the better reasons to use Cerebras/Groq that I've found so
       | you can return huge amounts of clean text back fast for
       | processing in other ways.
        
         | ldenoue wrote:
         | Although Gemini accepts very long input context, I found that
         | sending more than 512 or so words at a time to the LLM for
         | "cleaning up the text" yields hallucinations. That's why I
         | chunk the raw transcript into 512-word chunks.
         | 
         | Are you saying it works with 70B models on Groq? Mixtral,
         | Llama? Other?
        
           | icelancer wrote:
           | Yeah, I've had no issues sending tokens up to the context
           | limit. I cut it off with a 10% buffer but that's just to
           | ensure I don't run into tokenization miscounting between
           | tiktoken and whatever tokenizer my actual LLM uses.
           | 
           | I have had little success with Gemini and long videos. My
           | pipeline is video -> ffmpeg strip audio -> whisperX ASR ->
           | groq (L3-70b-specdec) -> gpt-4o/sonnet-3.5 for summarization.
           | Works great.
        
       | tombh wrote:
       | ASR: Automatic Speech Recognition
        
         | joshdavham wrote:
         | I was too afraid to ask!
        
         | throwaway106382 wrote:
         | Not to be confused with "Autonomous Sensory Meridian Response"
         | (ASMR) - a popular category of video on Youtube.
        
           | hackernewds wrote:
           | How would they be confused?
        
             | xanth wrote:
             | This was a clever jape; a good example of a ironic anti-
             | humor. But I don't think you were confused by that ether ;)
        
               | djmips wrote:
               | clever japes are not desired on HN - there's Reddit for
               | that my friend.
        
             | wodenokoto wrote:
             | I can't explain the how, but I thought it was the ASMR
             | thing the title referred to.
        
             | throwaway106382 wrote:
             | I think more people actually know what ASMR is as opposed
             | to ASR. Lots of ASMR videos are people speaking/whispering
             | at extremely low volume.
             | 
             | I don't think it's quite out of the realm of the
             | possibility to have interpreted as "Gemini LLM corrects
             | ASMR YouTube transcripts". Because you know..they're
             | whispering so might be hard to understand or transcribe.
        
         | thaumasiotes wrote:
         | Is that different from "speech-to-text"?
        
       | sorenjan wrote:
       | Using an LLM to correct text is a good idea, but the text
       | transcript doesn't have information about how confident the
       | speech to text conversion is. Whisper can output confidence for
       | each word, this would probably make for a better pipeline. It
       | would surprise me if Google doesn't do something like this soon,
       | although maybe a good speech to text model is too computationally
       | expensive for Youtube at the moment.
        
         | dylan604 wrote:
         | Depends on your purpose of the transcript. If you are expecting
         | the exact form of the words spoken in written form, then any
         | deviation from that is no longer a transcription. At that point
         | it is text loosely based on the spoken content.
         | 
         | Once you accept it okay for the LLM to just replace words in a
         | transcript, you might as well just let it make up a story based
         | on character names you've provided.
        
           | falcor84 wrote:
           | > any deviation from that is no longer a transcription
           | 
           | That's a wild exaggeration. Professional transcripts often
           | have small (and not so small) mistakes, caused by typos,
           | mishearing or lack of familiarity with the subject matter.
           | Depending on the case, these are then manually proofread, but
           | even after proofreading, some mistakes often remain, and
           | occasionally even introduced.
        
             | dylan604 wrote:
             | maybe, but typos are not even the same thing as an LLM
             | thinking of better next choice in words than actually just
             | transcribing what was heard.
        
       | kelvinjps wrote:
       | Google should have the needed tech for good AI transcription, why
       | the don't integrate them in their auto-captioning? and instead
       | the offer those crappy auto subtitles
        
         | briga wrote:
         | Are they crappy though? Most of the time it gets things right,
         | even if they aren't as accurate as a human. And sure, they
         | probably have better techniques for this, but are they cost-
         | effective to run at YouTube-scale? I think their current
         | solution is good enough for most purposes, even if it isn't
         | perfect
        
           | InsideOutSanta wrote:
           | I'm watching YouTube videos with subtitles for my wife, who
           | doesn't speak English. For videos on basic topics where
           | people speak clear, unaccented English, they work fine (i.e.
           | you usually get what people are saying). If the topic is in
           | any way unusual, the recording quality is poor, or people
           | have accents, the results very quickly turn into a garbled
           | mess that is incomprehensible at best, and misleading (i.e.
           | the subtitles seem coherent, but are wrong) at worst.
        
           | wahnfrieden wrote:
           | Japanese auto captions suck
        
         | summerlight wrote:
         | YT is using USM, which is supposed to be their SOTA ASR model.
         | Gemini have much better linguistic knowledge, but it's likely
         | prohibitively expensive to be used on all YT videos uploaded
         | everyday. But this "correction" approach seems to be a nice
         | cost-effective methodology to apply LLM indeed.
        
       | Timwi wrote:
       | Can I use this to generate subtitles for my own videos? I would
       | love to have subtitles on them but I can't be bothered to do all
       | the timing synchronization by hand. Surely there must be a way to
       | automate that?
        
         | geor9e wrote:
         | That's called Youtube Automatic Speech Recognition
         | (captioning), and is what this tool uses as input. You can turn
         | those on in youtube studio.
        
       | sidcool wrote:
       | This is pretty cool. But at the risk of a digression, I can't
       | imagine sharing my API keys with a random website on HN. There
       | has to be a safe approach to this. Like limited use API keys,
       | rate limited API keys or unsafe API keys etc.
        
         | thomasahle wrote:
         | Can't you just create a new API key with a limited budget?
        
           | ldenoue wrote:
           | I should do that, let me try.
        
           | sidcool wrote:
           | The risk of leakage is very high. If Anthropic, Google,
           | OpenAI can provide dispensible keys, it will be great.
        
             | thomasahle wrote:
             | Both OpenAI and Anthropic let you disable and delete keys.
             | I'd be surprised if Google doesn't.
        
         | mst wrote:
         | I'm aware this isn't a *proper* solution, but "throw your
         | current API key at it, then as soon as you're done playing
         | around, execute a test of your API key rotation scripting"
         | isn't a terrible workaround, especially if you're the sort of
         | person who really *meant* to have tested said scripting
         | recently but kept not getting around to it ("hi").
        
       | pachico wrote:
       | Hmm, so this is expecting me to upload a personal API Key...
        
       ___________________________________________________________________
       (page generated 2024-11-26 23:01 UTC)