[HN Gopher] Navigating the Challenges and Opportunities of Synth...
___________________________________________________________________
Navigating the Challenges and Opportunities of Synthetic Voices
Author : Josely
Score : 56 points
Date : 2024-03-29 17:13 UTC (5 hours ago)
(HTM) web link (openai.com)
(TXT) w3m dump (openai.com)
| makin wrote:
| > We recognize that generating speech that resembles people's
| voices has serious risks, which are especially top of mind in an
| election year. We are engaging with U.S. and international
| partners from across government, media, entertainment, education,
| civil society and beyond to ensure we are incorporating their
| feedback as we build.
|
| People had been theorizing the slowdown in releases was because
| of this, I didn't expect they would come out and admit it.
|
| I've found the quality of Elevenlabs' voice generation is at
| least twice as good as this, which is surprising since OpenAI has
| had more time and money to work on this. There seemed to be an
| issue with inconsistent accents in the multilingual samples.
| singularity2001 wrote:
| They might have been forced to give a signal after this rose on
| HN today:
|
| https://research.myshell.ai/open-voice
|
| https://news.ycombinator.com/item?id=39861578
| lelandfe wrote:
| The quality of these samples would be more than sufficient to
| fool anyone I know.
| zaptrem wrote:
| I thought everyone (11Labs, open source, OpenAI even) already has
| human-level TTS models? Is there still an open challenge
| somewhere (e.g., is there a use case where a better model would
| make any difference?).
| kenjackson wrote:
| I haven't seen any TTS take 15 second audio samples and capture
| the speech so well.
|
| It's getting clearer -- in the future, if you didn't see it
| happen in real life then it didn't happen.
| belter wrote:
| Even that...Everybody better get their own Certificate
| Authority and a family password.
| zaptrem wrote:
| Idk training a model like this is kinda trivial for most
| shops. Afaik there are open source ones that do the same
| thing. We trained one with similar audio quality to this by
| accident while working on music generation.
| the_snooze wrote:
| >It's getting clearer -- in the future, if you didn't see it
| happen in real life then it didn't happen.
|
| I'm convinced that all these AI technologies have legitimate
| use cases. But it's far too easy to use them to generate
| endless streams of noise, spam, and mistrust and drown out
| anything valuable.
| zachthewf wrote:
| Expressivity and naturalness are still not good enough. It's
| getting there, but today you'd never have a voice conversation
| with an AI and come away thinking "boy, what an engaging,
| charismatic personality that AI had!" This is partly due to
| latency, partly due to imperfect speech-to-text which is too
| slow and doesn't understand conversational dynamics or
| nonverbal signals, but yes partly due to text-to-speech.
|
| When it comes to naturalness for dialog, OpenAI actually has
| the best voices I've heard--significantly better than Azure or
| 11 Labs which are second and third best. But at least the
| voices they expose through ChatGPT are fairly muted and
| inexpressive.
| TillE wrote:
| > Voice Engine preserves the native accent of the original
| speaker
|
| I sort of understand this as a goal, but the American-accented
| German is weird. Like they nail some difficult "r" sounds but
| botch the easy ones. An odd hybrid between a fluent speaker and a
| total beginner.
| bugglebeetle wrote:
| The Japanese has the same problem. It's sounds like someone who
| is fluent, but maintained a beginner level accent, with a
| slight tinge of robot.
| ibaikov wrote:
| Companies doing speech synthesis where one can clone a voice
| using a small recording should add digital watermarks into audio
| itself, akin to steganography. Companies that allow speech
| communication can check for these marks to make sure the voice
| was not generated. Many problems, including quality drop
| resistance, but this can be done and be a valid defence a year or
| maybe a few years. Not sure if carriers or phone manufacturers
| can check for these marks in default phone calls.
| spott wrote:
| I think most of them do... but they are too easy to remove.
|
| It isn't just quality drop resistance, it is also filtering and
| post processing resistance. It is kinda like antivirus
| signatures: those who want to abuse it just filter and tweak it
| till it no longer triggers the watermark detection.
___________________________________________________________________
(page generated 2024-03-29 23:02 UTC)