hngopher.com

       [HN Gopher] Navigating the Challenges and Opportunities of Synth...
       ___________________________________________________________________
        
       Navigating the Challenges and Opportunities of Synthetic Voices
        
       Author : Josely
       Score  : 56 points
       Date   : 2024-03-29 17:13 UTC (5 hours ago)
        
 (HTM) web link (openai.com)
 (TXT) w3m dump (openai.com)
        
       | makin wrote:
       | > We recognize that generating speech that resembles people's
       | voices has serious risks, which are especially top of mind in an
       | election year. We are engaging with U.S. and international
       | partners from across government, media, entertainment, education,
       | civil society and beyond to ensure we are incorporating their
       | feedback as we build.
       | 
       | People had been theorizing the slowdown in releases was because
       | of this, I didn't expect they would come out and admit it.
       | 
       | I've found the quality of Elevenlabs' voice generation is at
       | least twice as good as this, which is surprising since OpenAI has
       | had more time and money to work on this. There seemed to be an
       | issue with inconsistent accents in the multilingual samples.
        
         | singularity2001 wrote:
         | They might have been forced to give a signal after this rose on
         | HN today:
         | 
         | https://research.myshell.ai/open-voice
         | 
         | https://news.ycombinator.com/item?id=39861578
        
       | lelandfe wrote:
       | The quality of these samples would be more than sufficient to
       | fool anyone I know.
        
       | zaptrem wrote:
       | I thought everyone (11Labs, open source, OpenAI even) already has
       | human-level TTS models? Is there still an open challenge
       | somewhere (e.g., is there a use case where a better model would
       | make any difference?).
        
         | kenjackson wrote:
         | I haven't seen any TTS take 15 second audio samples and capture
         | the speech so well.
         | 
         | It's getting clearer -- in the future, if you didn't see it
         | happen in real life then it didn't happen.
        
           | belter wrote:
           | Even that...Everybody better get their own Certificate
           | Authority and a family password.
        
           | zaptrem wrote:
           | Idk training a model like this is kinda trivial for most
           | shops. Afaik there are open source ones that do the same
           | thing. We trained one with similar audio quality to this by
           | accident while working on music generation.
        
           | the_snooze wrote:
           | >It's getting clearer -- in the future, if you didn't see it
           | happen in real life then it didn't happen.
           | 
           | I'm convinced that all these AI technologies have legitimate
           | use cases. But it's far too easy to use them to generate
           | endless streams of noise, spam, and mistrust and drown out
           | anything valuable.
        
         | zachthewf wrote:
         | Expressivity and naturalness are still not good enough. It's
         | getting there, but today you'd never have a voice conversation
         | with an AI and come away thinking "boy, what an engaging,
         | charismatic personality that AI had!" This is partly due to
         | latency, partly due to imperfect speech-to-text which is too
         | slow and doesn't understand conversational dynamics or
         | nonverbal signals, but yes partly due to text-to-speech.
         | 
         | When it comes to naturalness for dialog, OpenAI actually has
         | the best voices I've heard--significantly better than Azure or
         | 11 Labs which are second and third best. But at least the
         | voices they expose through ChatGPT are fairly muted and
         | inexpressive.
        
       | TillE wrote:
       | > Voice Engine preserves the native accent of the original
       | speaker
       | 
       | I sort of understand this as a goal, but the American-accented
       | German is weird. Like they nail some difficult "r" sounds but
       | botch the easy ones. An odd hybrid between a fluent speaker and a
       | total beginner.
        
         | bugglebeetle wrote:
         | The Japanese has the same problem. It's sounds like someone who
         | is fluent, but maintained a beginner level accent, with a
         | slight tinge of robot.
        
       | ibaikov wrote:
       | Companies doing speech synthesis where one can clone a voice
       | using a small recording should add digital watermarks into audio
       | itself, akin to steganography. Companies that allow speech
       | communication can check for these marks to make sure the voice
       | was not generated. Many problems, including quality drop
       | resistance, but this can be done and be a valid defence a year or
       | maybe a few years. Not sure if carriers or phone manufacturers
       | can check for these marks in default phone calls.
        
         | spott wrote:
         | I think most of them do... but they are too easy to remove.
         | 
         | It isn't just quality drop resistance, it is also filtering and
         | post processing resistance. It is kinda like antivirus
         | signatures: those who want to abuse it just filter and tweak it
         | till it no longer triggers the watermark detection.
        
       ___________________________________________________________________
       (page generated 2024-03-29 23:02 UTC)