[HN Gopher] StabilityAI new audio generation - better than Audio...
___________________________________________________________________
StabilityAI new audio generation - better than AudioBox?
Author : EGreg
Score : 21 points
Date : 2024-02-06 16:36 UTC (6 hours ago)
(HTM) web link (www.text-description-to-speech.com)
(TXT) w3m dump (www.text-description-to-speech.com)
| smusamashah wrote:
| The speech quality is very good. It's not clear from the page
| that are they just adding styling to already generated audio or
| the audio is all generated using their own model?
| vagabund wrote:
| They're generating the audio. They use a series of techniques
| to automatically generate metadata for speech samples in
| LibriSpeech for things like accent, recording quality, pitch,
| speed, gender, then use an LLM to format these tags into
| comprehensive natural language descriptions, leading to a more
| tunable model at inference time. This metadata generation
| pipeline is the key insight and what was missing from speech
| datasets unlike e.g. image datasets, which have obviously seen
| more rapid success.
| turnsout wrote:
| Do they have plans to release this model?
| miohtama wrote:
| Sounds like this is on a verge to revolutionise video game
| characters and YouTube dubs.
___________________________________________________________________
(page generated 2024-02-06 23:01 UTC)