[HN Gopher] StabilityAI new audio generation - better than Audio...
       ___________________________________________________________________
        
       StabilityAI new audio generation - better than AudioBox?
        
       Author : EGreg
       Score  : 21 points
       Date   : 2024-02-06 16:36 UTC (6 hours ago)
        
 (HTM) web link (www.text-description-to-speech.com)
 (TXT) w3m dump (www.text-description-to-speech.com)
        
       | smusamashah wrote:
       | The speech quality is very good. It's not clear from the page
       | that are they just adding styling to already generated audio or
       | the audio is all generated using their own model?
        
         | vagabund wrote:
         | They're generating the audio. They use a series of techniques
         | to automatically generate metadata for speech samples in
         | LibriSpeech for things like accent, recording quality, pitch,
         | speed, gender, then use an LLM to format these tags into
         | comprehensive natural language descriptions, leading to a more
         | tunable model at inference time. This metadata generation
         | pipeline is the key insight and what was missing from speech
         | datasets unlike e.g. image datasets, which have obviously seen
         | more rapid success.
        
       | turnsout wrote:
       | Do they have plans to release this model?
        
       | miohtama wrote:
       | Sounds like this is on a verge to revolutionise video game
       | characters and YouTube dubs.
        
       ___________________________________________________________________
       (page generated 2024-02-06 23:01 UTC)