[HN Gopher] Self-hosted offline transcription and diarization se...
___________________________________________________________________
Self-hosted offline transcription and diarization service with LLM
summary
Author : indigodaddy
Score : 74 points
Date : 2024-05-26 17:30 UTC (5 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| bitshaker wrote:
| Amazing. I'll see if I can get this working on Mac too. I have so
| many use cases for this.
|
| 30 years of audio that needs transcribing, summaries, and
| worksheets made out of them.
| toomuchtodo wrote:
| I would love to hear more about your use case!
| lbrito wrote:
| What is the cost compared with something like Whisper API?
| Assuming one would use commodity cloud GPUs for self hosting
| seligman99 wrote:
| WhisperX along with whisper-diarization, runs at something
| around 20x of real time on audio with a modern GPU, so for that
| part, you're looking at around $1 per twenty hours of content
| to run it on a g5.xlarge, not counting time to build up a node
| (or around 1/2 that for Spot prices, assuming you're much
| luckier than I am at getting stable spot instances these days).
|
| You can short circuit that time to build up a node a bit with a
| prebaked AMI on AWS, but there's still some amount of time
| before a new node can start running at speed, around 10 minutes
| in my experience.
|
| I haven't looked at this particular solution yet, but I really
| find the LLMs to be hit or miss at summarizing transcripts.
| Sometimes it's impressive, sometimes it's literally "informal
| conversation between multiple people about various topics"
| ranger_danger wrote:
| I thought local LLMs were unable to summarize large documents due
| to limited token counts or something like that? Can someone ELI5?
| icelancer wrote:
| You batch them. If token limit is 32k for example, you
| summarize them in batches of 32k tokens (inc. output) then
| summarize all the partial summaries.
|
| It's what we were doing at our company until Anthropic and
| others released larger context window LLMs. We do the TTS
| locally (whisperX) and the summarization via API. Though we've
| tried with local LLMs, too.
| phh wrote:
| Well it'll always depend on the length of the meeting to
| summarize. But they are using mistral which clocks at 32k
| context. With an average of 150 spoken words per minute, 1
| token ~= word (which is rather pessimistic), that's 3h30m of
| meeting. So I guess that's okay?
| lloydatkinson wrote:
| Can this translate too? As in transcribe audio and then give
| output in two languages?
| BeefySwain wrote:
| I was able to build something that does all this, more or less,
| in a couple weeks. It works really well.
|
| I wanted to be able to transcribe and diarize in realtime though,
| which is much harder. Didn't manage to make that happen.
| siruva07 wrote:
| Built something similar for podcasts
|
| https://www.podsnacks.org/
| rimple wrote:
| That's cool. I've created a website(https://papertube.site) that
| essentially transcribes video conversations for reading on
| Kindle. Right now, I'm relying on third-party APIs, but I was
| thinking about self-hosting to reduce costs.
| 3abiton wrote:
| It's like reverse audio-book, but how do you tackle issues
| related to video content, as the visual medium contains more
| information dimension than just sound.
___________________________________________________________________
(page generated 2024-05-26 23:00 UTC)