[HN Gopher] Llasa: Llama-Based Speech Synthesis
       ___________________________________________________________________
        
       Llasa: Llama-Based Speech Synthesis
        
       Author : CalmStorm
       Score  : 101 points
       Date   : 2025-05-01 16:43 UTC (6 hours ago)
        
 (HTM) web link (llasatts.github.io)
 (TXT) w3m dump (llasatts.github.io)
        
       | CalmStorm wrote:
       | LLaSA is a simple framework for speech synthesis that employs a
       | single-layer vector quantizer (VQ) codec and a single Transformer
       | architecture to fully align with standard LLMs such as LLaMA.
        
         | WastedCucumber wrote:
         | Probably the title should have the correct capitalization then.
         | Cause I was fully expecting a speech synthesis tool that
         | sounded like llamas talking human language and now I'm bummed
         | out!
        
       | StevenNunez wrote:
       | I can't wait see this integrated into Open WebUI! These sound
       | amazing.
        
       | mring33621 wrote:
       | the long 'uuuuhhhhhhh' from some of the lesser models is killing
       | me.
        
         | jszymborski wrote:
         | based on the samples, it really seams like anything smaller
         | than 3B is pretty useless.
        
           | hadlock wrote:
           | If you're doing a home lab voice assistant 1B is nice,
           | because on a 12gb gpu you can run a moderately competent 7b
           | LLM and two 1b models; 1 for speech to text and also text to
           | speech, plus some for the wake word monitor. Maybe in a
           | couple of years we can combine all this into a single ~8b
           | model that runs efficiently on 12gb gpu. Nvidia doesn't seem
           | very incentivized right now to sell consumer GPUs that can
           | run all this on a single consumer grade chip when they're
           | making so much money selling commercial grade 48gb cards.
        
       | dheera wrote:
       | > employs a single-layer vector quantizer (VQ) codec and a single
       | Transformer architecture to fully align
       | 
       | I really wish when new models were released that they would draw
       | a diagram of all the layers and the tensor input and output sizes
       | at each layer, with zoom in/out capabilities if needed using
       | D3.js or whatever visualization framework if needed. Every single
       | layer should be on there with its input and output sizes.
       | 
       | These one-sentence descriptions, and approximate block diagrams
       | with arrows pointing at each other are never enough to understand
       | how something is actually implemented.
        
         | exe34 wrote:
         | Sounds like a solid SaaS business plan!
        
       | ks2048 wrote:
       | Odd that the page doesn't seem to link to either,
       | 
       | paper: https://arxiv.org/abs/2502.04128
       | 
       | github: https://github.com/zhenye234/LLaSA_training
        
       ___________________________________________________________________
       (page generated 2025-05-01 23:00 UTC)