[HN Gopher] Play 3.0 mini - A lightweight, reliable, cost-effici...
       ___________________________________________________________________
        
       Play 3.0 mini - A lightweight, reliable, cost-efficient
       Multilingual TTS model
        
       Author : amrrs
       Score  : 73 points
       Date   : 2024-10-14 19:16 UTC (3 hours ago)
        
 (HTM) web link (play.ht)
 (TXT) w3m dump (play.ht)
        
       | Asjad wrote:
       | Play 3.0 mini sounds like a game-changer for real-time
       | multilingual TTS with its speed and voice cloning capabilities
        
       | treesciencebot wrote:
       | Much faster than OpenAI's real-time mode, wow! Quality seems to
       | be on par if not better as well.
        
         | samsepi0l121 wrote:
         | Did we watch the same video? OpenAI's model is faster, and the
         | quality is far better.
        
       | dulldata wrote:
       | demo video if you don't want to go through the announcement -
       | https://www.youtube.com/watch?v=DusTj5NLC9w
       | 
       | Good with numbers mostly!
        
       | nickthegreek wrote:
       | The live test on https://play.ai/ didn't work for me in firefox.
       | swapped to chrome and it worked quickly. I cloned my voice in 30s
       | and was instantly talking to myself. This would easily fool most
       | people who know me. Wild stuff.
        
       | phkahler wrote:
       | Sounds quite good, but this prompt is NOT what I'd expect an
       | automated system to feed into it:
       | 
       | "I've successfully processed your order and I'd like to confirm
       | your product ID. It is A as in Alpha, 1, 2, 3, B as in Bravo, 5,
       | 6, 7, Z as in Zulu, 8, 9, 0, X as in X-ray."
       | 
       | Phone numbers and others were read nicely, but apparently a
       | string of alphanumerics for an order number aren't handled well
       | yet.
        
         | amrrs wrote:
         | Sorry, Do you mean to the audio for this text is not good?
         | 
         | "I've successfully processed your order and I'd like to confirm
         | your product ID. It is A as in Alpha, 1, 2, 3, B as in Bravo,
         | 5, 6, 7, Z as in Zulu, 8, 9, 0, X as in X-ray."
         | 
         | I thought this was included in the demo, it seemed okay!
        
         | BoorishBears wrote:
         | Most of these prompts come from LLMs, so it's trivial to
         | instruct them to provide a string that's broken out like that.
         | 
         | Also not the end of the world to process stuff like this with a
         | regex.
         | 
         | Most of these newer TTS models require this type of formatting
         | to _reliably_ state long strings of numbers and IDs
        
         | diggan wrote:
         | > Phone numbers and others were read nicely
         | 
         | The phone numbers were not naturally read at all. A human would
         | have read a grouping of 123-456-789 like "123", "456", "789",
         | but instead the model generated something like "123", "45",
         | "6789". Listen to the RVSP example again and you'll know what I
         | mean. The pacing is generally off for normal text too, but
         | extra noticeable for the numbers.
         | 
         | My hunch would be that it's because of tokenization, but I
         | wouldn't be able to say that's the issue for sure. Sounds like
         | it though :)
        
       | gorkemyurt wrote:
       | wow! latency is insane
        
         | c0brac0bra wrote:
         | This is similar: https://deepgram.com/agent
        
       | codetrotter wrote:
       | Hey Alexa, Google "Play"!
        
       | DevX101 wrote:
       | Has anyone done a comparison of combined speech to text and TTS
       | vs speech-to-speech for create audio only interfaces?
       | Particularly curious around latency, and quality of audio output.
        
         | amrrs wrote:
         | Hugging Face has got a TTS leaderboard (arena like lmsys) -
         | https://huggingface.co/spaces/TTS-AGI/TTS-Arena
        
       | lostmsu wrote:
       | Is this one open in any way? If no, why would anyone use it over
       | OpenAI?
        
       | gyre007 wrote:
       | This is awesome! Over the summer I wrote API clients for both Go
       | [1] and Rust [2] as we were using Play in my job at the time but
       | there was only Python and Node SDK.
       | 
       | [1] https://github.com/milosgajdos/go-playht [2]
       | https://github.com/milosgajdos/playht_rs
        
       | Mizza wrote:
       | What's SOTA for open source or on-device right now?
       | 
       | I tried building a babelfish with o1, but the transcription in
       | languages other than English are useless. When it gets it
       | correct, the translations are pretty perfect and the voice
       | responses are super fast, but without good transcription it's
       | kind of useless. So close!
        
         | amrrs wrote:
         | have you tried Moshi -
         | https://huggingface.co/collections/kyutai/moshi-v01-release-...
        
         | refulgentis wrote:
         | I'm not sure what you mean fully, this is TTS, but it sounds
         | like you're expecting an answer about transcription
         | 
         | So its both hard to know what category you'd like to hear
         | about, as well as if you do mean transcription, what your
         | baseline is.
         | 
         | Whisper is widely regarded the best in the free camp, but I
         | wouldn't be surprised to see a paper of a model claiming better
         | WER, or a much bigger model.
         | 
         | If you meant you tried realtime 4o from OpenAI, and not o1*, it
         | uses whisper for transcription on server, so I don't think
         | you'll see much gain from trying whisper. my next try would be
         | the Google Cloud APIs, but they're paid and with regard to your
         | question re: open source SOTA, the underlying model isn't open.
         | 
         | But also if you did mean 4o, the transcription shouldn't matter
         | for output transcription quality, the model is taking in voice
         | (I verified their claim by noticing when there's errors in the
         | transcription, it answers correctly)
         | 
         | * I keep messing these two up when talking about it, and it
         | seems unlikely you meant o1 because it has a long synchronous
         | delay before any part of the answer is available, and doesn't
         | take in audio.
         | 
         | If you did mean o1, then, I'd use realtime 4o for TTS, and have
         | it natively do the translation, as it will be unaffected by
         | errors in transcription like you're facing now
        
         | diggan wrote:
         | I was literally just looking at that today, and the best one I
         | came across was F5-TTS: https://swivid.github.io/F5-TTS/
         | 
         | Only thing missing (for me) is "emotion tokens" instead of
         | forcing the entire generation to be with a specific emotion, as
         | the generated voice is a bit too robotic otherwise.
        
           | moffkalast wrote:
           | > based on flow matching with Diffusion Transformer
           | 
           | Yeah that's not gonna be realtime. It's really odd that we
           | currently have two options, ViTS/Piper that runs at a
           | ludicrous speed on a CPU and is kinda ok, and these slightly
           | more natural versions a la StyleTTS2 that take 2 minutes to
           | generate a sentence with CUDA acceleration.
           | 
           | Like, is there a middle ground? Maybe inverting one of the
           | smaller whispers or something.
        
             | modeless wrote:
             | StyleTTS2 is faster than realtime
        
             | gunalx wrote:
             | Bark?
        
       | Yenrabbit wrote:
       | Quite disconcerting to have a low-latency chat with something
       | that sounds like you! Can recommend the experience, very thought-
       | provoking.
        
       | lyjackal wrote:
       | Is there any way to use the TTS on its own? I maintain an
       | obsidian TTS plug-in, and am starting to add new TTS providers
       | (its just been OpenAI thus far). From the documentation at
       | https://docs.play.ai/documentation/get-started/introduction, it
       | looks like their API seems to couple it to an LLM for building
       | conversational agents. Seems like it might be nice to use
       | standalone as just TTS.
        
         | amrrs wrote:
         | You can use Play HT (the TTS powering Play AI) on its own -
         | https://docs.play.ht/reference/api-getting-started
         | 
         | Do you have link to your obsidian TTS plugin?
        
       | BoppreH wrote:
       | In the video demo, Play 3.0 mini (on the left) incorrectly claims
       | that the other AI missed a word.
       | 
       | How does that end up in an announcement? Do people not notice, or
       | not care? Or are they trying to show realistic mistakes?
        
       | KaoruAoiShiho wrote:
       | Is this better than 11labs?
        
       | siscia wrote:
       | I honestly wanted to try to use it, but their pricing was quite
       | off-putting.
        
         | c0brac0bra wrote:
         | Yes. I think $0.05/min is a high multiple of what other agent-
         | oriented realtime TTS products are charging.
        
       | Aeolun wrote:
       | That's 12 times cheaper than the OpenAI models though. Those are
       | already very good, so I can't really see myself using this.
       | 
       | I really want a good on-device model though.
        
       | CommanderData wrote:
       | Is there a way to train this on common AI voices from video
       | games/movies, I'd very much like a voice assistant to sound like
       | Father/Mother from Alien or Dead Space.
        
       ___________________________________________________________________
       (page generated 2024-10-14 23:00 UTC)