[HN Gopher] Play 3.0 mini - A lightweight, reliable, cost-effici...
___________________________________________________________________
Play 3.0 mini - A lightweight, reliable, cost-efficient
Multilingual TTS model
Author : amrrs
Score : 73 points
Date : 2024-10-14 19:16 UTC (3 hours ago)
(HTM) web link (play.ht)
(TXT) w3m dump (play.ht)
| Asjad wrote:
| Play 3.0 mini sounds like a game-changer for real-time
| multilingual TTS with its speed and voice cloning capabilities
| treesciencebot wrote:
| Much faster than OpenAI's real-time mode, wow! Quality seems to
| be on par if not better as well.
| samsepi0l121 wrote:
| Did we watch the same video? OpenAI's model is faster, and the
| quality is far better.
| dulldata wrote:
| demo video if you don't want to go through the announcement -
| https://www.youtube.com/watch?v=DusTj5NLC9w
|
| Good with numbers mostly!
| nickthegreek wrote:
| The live test on https://play.ai/ didn't work for me in firefox.
| swapped to chrome and it worked quickly. I cloned my voice in 30s
| and was instantly talking to myself. This would easily fool most
| people who know me. Wild stuff.
| phkahler wrote:
| Sounds quite good, but this prompt is NOT what I'd expect an
| automated system to feed into it:
|
| "I've successfully processed your order and I'd like to confirm
| your product ID. It is A as in Alpha, 1, 2, 3, B as in Bravo, 5,
| 6, 7, Z as in Zulu, 8, 9, 0, X as in X-ray."
|
| Phone numbers and others were read nicely, but apparently a
| string of alphanumerics for an order number aren't handled well
| yet.
| amrrs wrote:
| Sorry, Do you mean to the audio for this text is not good?
|
| "I've successfully processed your order and I'd like to confirm
| your product ID. It is A as in Alpha, 1, 2, 3, B as in Bravo,
| 5, 6, 7, Z as in Zulu, 8, 9, 0, X as in X-ray."
|
| I thought this was included in the demo, it seemed okay!
| BoorishBears wrote:
| Most of these prompts come from LLMs, so it's trivial to
| instruct them to provide a string that's broken out like that.
|
| Also not the end of the world to process stuff like this with a
| regex.
|
| Most of these newer TTS models require this type of formatting
| to _reliably_ state long strings of numbers and IDs
| diggan wrote:
| > Phone numbers and others were read nicely
|
| The phone numbers were not naturally read at all. A human would
| have read a grouping of 123-456-789 like "123", "456", "789",
| but instead the model generated something like "123", "45",
| "6789". Listen to the RVSP example again and you'll know what I
| mean. The pacing is generally off for normal text too, but
| extra noticeable for the numbers.
|
| My hunch would be that it's because of tokenization, but I
| wouldn't be able to say that's the issue for sure. Sounds like
| it though :)
| gorkemyurt wrote:
| wow! latency is insane
| c0brac0bra wrote:
| This is similar: https://deepgram.com/agent
| codetrotter wrote:
| Hey Alexa, Google "Play"!
| DevX101 wrote:
| Has anyone done a comparison of combined speech to text and TTS
| vs speech-to-speech for create audio only interfaces?
| Particularly curious around latency, and quality of audio output.
| amrrs wrote:
| Hugging Face has got a TTS leaderboard (arena like lmsys) -
| https://huggingface.co/spaces/TTS-AGI/TTS-Arena
| lostmsu wrote:
| Is this one open in any way? If no, why would anyone use it over
| OpenAI?
| gyre007 wrote:
| This is awesome! Over the summer I wrote API clients for both Go
| [1] and Rust [2] as we were using Play in my job at the time but
| there was only Python and Node SDK.
|
| [1] https://github.com/milosgajdos/go-playht [2]
| https://github.com/milosgajdos/playht_rs
| Mizza wrote:
| What's SOTA for open source or on-device right now?
|
| I tried building a babelfish with o1, but the transcription in
| languages other than English are useless. When it gets it
| correct, the translations are pretty perfect and the voice
| responses are super fast, but without good transcription it's
| kind of useless. So close!
| amrrs wrote:
| have you tried Moshi -
| https://huggingface.co/collections/kyutai/moshi-v01-release-...
| refulgentis wrote:
| I'm not sure what you mean fully, this is TTS, but it sounds
| like you're expecting an answer about transcription
|
| So its both hard to know what category you'd like to hear
| about, as well as if you do mean transcription, what your
| baseline is.
|
| Whisper is widely regarded the best in the free camp, but I
| wouldn't be surprised to see a paper of a model claiming better
| WER, or a much bigger model.
|
| If you meant you tried realtime 4o from OpenAI, and not o1*, it
| uses whisper for transcription on server, so I don't think
| you'll see much gain from trying whisper. my next try would be
| the Google Cloud APIs, but they're paid and with regard to your
| question re: open source SOTA, the underlying model isn't open.
|
| But also if you did mean 4o, the transcription shouldn't matter
| for output transcription quality, the model is taking in voice
| (I verified their claim by noticing when there's errors in the
| transcription, it answers correctly)
|
| * I keep messing these two up when talking about it, and it
| seems unlikely you meant o1 because it has a long synchronous
| delay before any part of the answer is available, and doesn't
| take in audio.
|
| If you did mean o1, then, I'd use realtime 4o for TTS, and have
| it natively do the translation, as it will be unaffected by
| errors in transcription like you're facing now
| diggan wrote:
| I was literally just looking at that today, and the best one I
| came across was F5-TTS: https://swivid.github.io/F5-TTS/
|
| Only thing missing (for me) is "emotion tokens" instead of
| forcing the entire generation to be with a specific emotion, as
| the generated voice is a bit too robotic otherwise.
| moffkalast wrote:
| > based on flow matching with Diffusion Transformer
|
| Yeah that's not gonna be realtime. It's really odd that we
| currently have two options, ViTS/Piper that runs at a
| ludicrous speed on a CPU and is kinda ok, and these slightly
| more natural versions a la StyleTTS2 that take 2 minutes to
| generate a sentence with CUDA acceleration.
|
| Like, is there a middle ground? Maybe inverting one of the
| smaller whispers or something.
| modeless wrote:
| StyleTTS2 is faster than realtime
| gunalx wrote:
| Bark?
| Yenrabbit wrote:
| Quite disconcerting to have a low-latency chat with something
| that sounds like you! Can recommend the experience, very thought-
| provoking.
| lyjackal wrote:
| Is there any way to use the TTS on its own? I maintain an
| obsidian TTS plug-in, and am starting to add new TTS providers
| (its just been OpenAI thus far). From the documentation at
| https://docs.play.ai/documentation/get-started/introduction, it
| looks like their API seems to couple it to an LLM for building
| conversational agents. Seems like it might be nice to use
| standalone as just TTS.
| amrrs wrote:
| You can use Play HT (the TTS powering Play AI) on its own -
| https://docs.play.ht/reference/api-getting-started
|
| Do you have link to your obsidian TTS plugin?
| BoppreH wrote:
| In the video demo, Play 3.0 mini (on the left) incorrectly claims
| that the other AI missed a word.
|
| How does that end up in an announcement? Do people not notice, or
| not care? Or are they trying to show realistic mistakes?
| KaoruAoiShiho wrote:
| Is this better than 11labs?
| siscia wrote:
| I honestly wanted to try to use it, but their pricing was quite
| off-putting.
| c0brac0bra wrote:
| Yes. I think $0.05/min is a high multiple of what other agent-
| oriented realtime TTS products are charging.
| Aeolun wrote:
| That's 12 times cheaper than the OpenAI models though. Those are
| already very good, so I can't really see myself using this.
|
| I really want a good on-device model though.
| CommanderData wrote:
| Is there a way to train this on common AI voices from video
| games/movies, I'd very much like a voice assistant to sound like
| Father/Mother from Alien or Dead Space.
___________________________________________________________________
(page generated 2024-10-14 23:00 UTC)