[HN Gopher] Llasa: Llama-Based Speech Synthesis
___________________________________________________________________
Llasa: Llama-Based Speech Synthesis
Author : CalmStorm
Score : 101 points
Date : 2025-05-01 16:43 UTC (6 hours ago)
(HTM) web link (llasatts.github.io)
(TXT) w3m dump (llasatts.github.io)
| CalmStorm wrote:
| LLaSA is a simple framework for speech synthesis that employs a
| single-layer vector quantizer (VQ) codec and a single Transformer
| architecture to fully align with standard LLMs such as LLaMA.
| WastedCucumber wrote:
| Probably the title should have the correct capitalization then.
| Cause I was fully expecting a speech synthesis tool that
| sounded like llamas talking human language and now I'm bummed
| out!
| StevenNunez wrote:
| I can't wait see this integrated into Open WebUI! These sound
| amazing.
| mring33621 wrote:
| the long 'uuuuhhhhhhh' from some of the lesser models is killing
| me.
| jszymborski wrote:
| based on the samples, it really seams like anything smaller
| than 3B is pretty useless.
| hadlock wrote:
| If you're doing a home lab voice assistant 1B is nice,
| because on a 12gb gpu you can run a moderately competent 7b
| LLM and two 1b models; 1 for speech to text and also text to
| speech, plus some for the wake word monitor. Maybe in a
| couple of years we can combine all this into a single ~8b
| model that runs efficiently on 12gb gpu. Nvidia doesn't seem
| very incentivized right now to sell consumer GPUs that can
| run all this on a single consumer grade chip when they're
| making so much money selling commercial grade 48gb cards.
| dheera wrote:
| > employs a single-layer vector quantizer (VQ) codec and a single
| Transformer architecture to fully align
|
| I really wish when new models were released that they would draw
| a diagram of all the layers and the tensor input and output sizes
| at each layer, with zoom in/out capabilities if needed using
| D3.js or whatever visualization framework if needed. Every single
| layer should be on there with its input and output sizes.
|
| These one-sentence descriptions, and approximate block diagrams
| with arrows pointing at each other are never enough to understand
| how something is actually implemented.
| exe34 wrote:
| Sounds like a solid SaaS business plan!
| ks2048 wrote:
| Odd that the page doesn't seem to link to either,
|
| paper: https://arxiv.org/abs/2502.04128
|
| github: https://github.com/zhenye234/LLaSA_training
___________________________________________________________________
(page generated 2025-05-01 23:00 UTC)