hngopher.com

       [HN Gopher] Self-hosted offline transcription and diarization se...
       ___________________________________________________________________
        
       Self-hosted offline transcription and diarization service with LLM
       summary
        
       Author : indigodaddy
       Score  : 74 points
       Date   : 2024-05-26 17:30 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | bitshaker wrote:
       | Amazing. I'll see if I can get this working on Mac too. I have so
       | many use cases for this.
       | 
       | 30 years of audio that needs transcribing, summaries, and
       | worksheets made out of them.
        
         | toomuchtodo wrote:
         | I would love to hear more about your use case!
        
       | lbrito wrote:
       | What is the cost compared with something like Whisper API?
       | Assuming one would use commodity cloud GPUs for self hosting
        
         | seligman99 wrote:
         | WhisperX along with whisper-diarization, runs at something
         | around 20x of real time on audio with a modern GPU, so for that
         | part, you're looking at around $1 per twenty hours of content
         | to run it on a g5.xlarge, not counting time to build up a node
         | (or around 1/2 that for Spot prices, assuming you're much
         | luckier than I am at getting stable spot instances these days).
         | 
         | You can short circuit that time to build up a node a bit with a
         | prebaked AMI on AWS, but there's still some amount of time
         | before a new node can start running at speed, around 10 minutes
         | in my experience.
         | 
         | I haven't looked at this particular solution yet, but I really
         | find the LLMs to be hit or miss at summarizing transcripts.
         | Sometimes it's impressive, sometimes it's literally "informal
         | conversation between multiple people about various topics"
        
       | ranger_danger wrote:
       | I thought local LLMs were unable to summarize large documents due
       | to limited token counts or something like that? Can someone ELI5?
        
         | icelancer wrote:
         | You batch them. If token limit is 32k for example, you
         | summarize them in batches of 32k tokens (inc. output) then
         | summarize all the partial summaries.
         | 
         | It's what we were doing at our company until Anthropic and
         | others released larger context window LLMs. We do the TTS
         | locally (whisperX) and the summarization via API. Though we've
         | tried with local LLMs, too.
        
         | phh wrote:
         | Well it'll always depend on the length of the meeting to
         | summarize. But they are using mistral which clocks at 32k
         | context. With an average of 150 spoken words per minute, 1
         | token ~= word (which is rather pessimistic), that's 3h30m of
         | meeting. So I guess that's okay?
        
       | lloydatkinson wrote:
       | Can this translate too? As in transcribe audio and then give
       | output in two languages?
        
       | BeefySwain wrote:
       | I was able to build something that does all this, more or less,
       | in a couple weeks. It works really well.
       | 
       | I wanted to be able to transcribe and diarize in realtime though,
       | which is much harder. Didn't manage to make that happen.
        
       | siruva07 wrote:
       | Built something similar for podcasts
       | 
       | https://www.podsnacks.org/
        
       | rimple wrote:
       | That's cool. I've created a website(https://papertube.site) that
       | essentially transcribes video conversations for reading on
       | Kindle. Right now, I'm relying on third-party APIs, but I was
       | thinking about self-hosting to reduce costs.
        
         | 3abiton wrote:
         | It's like reverse audio-book, but how do you tackle issues
         | related to video content, as the visual medium contains more
         | information dimension than just sound.
        
       ___________________________________________________________________
       (page generated 2024-05-26 23:00 UTC)