[HN Gopher] Show HN: Dia, an open-weights TTS model for generati...
___________________________________________________________________
Show HN: Dia, an open-weights TTS model for generating realistic
dialogue
Author : toebee
Score : 298 points
Date : 2025-04-21 17:07 UTC (5 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| toebee wrote:
| Hey HN! We're Toby and Jay, creators of Dia. Dia is 1.6B
| parameter open-weights model that generates dialogue directly
| from a transcript.
|
| Unlike TTS models that generate each speaker turn and stitch them
| together, Dia generates the entire conversation in a single pass.
| This makes it faster, more natural, and easier to use for
| dialogue generation.
|
| It also supports audio prompts -- you can condition the output on
| a specific voice/emotion and it will continue in that style.
|
| Demo page comparing it to ElevenLabs and Sesame-1B https://yummy-
| fir-7a4.notion.site/dia
|
| We started this project after falling in love with NotebookLM's
| podcast feature. But over time, the voices and content started to
| feel repetitive. We tried to replicate the podcast-feel with APIs
| but it did not sound like human conversations.
|
| So we decided to train a model ourselves. We had no prior
| experience with speech models and had to learn everything from
| scratch -- from large-scale training, to audio tokenization. It
| took us a bit over 3 months.
|
| Our work is heavily inspired by SoundStorm and Parakeet. We plan
| to release a lightweight technical report to share what we
| learned and accelerate research.
|
| We'd love to hear what you think! We are a tiny team, so open
| source contributions are extra-welcomed. Please feel free to
| check out the code, and share any thoughts or suggestions with
| us.
| new_user_final wrote:
| Easily 10 times better than recent OpenAI voice model. I don't
| like robotic voices.
|
| Example voices seems like over loud, over excitement like
| Andrew Tate, Speed or advertisement. It's lacking calm, normal
| conversation or normal podcast like interaction.
| gfaure wrote:
| Amazing that you developed this over the course of three
| months! Can you drop any insight into how you pulled together
| the audio data?
| isoprophlex wrote:
| +1 to this, amazing how you managed to deliver this, and iff
| you're willing to share i'd be most interested in learning
| what you did in terms of train data..!
| nickthegreek wrote:
| Are there any examples of the audio differences between the
| this and the larger model?
| heystefan wrote:
| Could one usecase be generating an audiobook with this from
| existing books? I wonder if I could fine-tune the "characters"
| that speak these lines since you said it's a single pass whole
| the whole convo. Wonder if that's a limitation for this kind of
| a usecase (where speed is not imperative).
| bzuker wrote:
| hey, this looks (or rather, sounds) amazing! Does it work with
| different languages or is it English only?
| llm_nerd wrote:
| This is a pretty incredible three month creation for a couple
| of people who had no experience with speech models.
| smusamashah wrote:
| Hi! This is awesome for size and quality. I want to see a book
| reading example or try it myself.
|
| This is a tangent point but it would have been nicer if it
| wasn't a notion site. You could put the same page on github
| pages and it will be much lighter to open, navigate and link
| (like people trying to link some audio)
| strobe wrote:
| just in case, another opensource project using same name
| https://wiki.gnome.org/Apps/Dia/
|
| https://gitlab.gnome.org/GNOME/dia
| toebee wrote:
| Thanks for the heads-up! We weren't aware of the GNOME Dia
| project. Since we focus on speech AI, we'll make sure to
| clarify that distinction.
| aclark wrote:
| Ditto this! Dia diagram tool user here just noticing the name
| clash. Good luck with your Dia!! Assuming both can exist in
| harmony. :-)
| mrandish wrote:
| > Assuming both can exist in harmony.
|
| I'm sure they can... _talk it over._
|
| I'll show myself out.
| Magma7404 wrote:
| I know it's a bit ridiculous to see that as some kind of
| conspiracy, but I have seen a very long list of AI-related
| projects that got the same name as a famous open-source
| project, as if they wanted to hijack the popularity of those
| projects, and Dia is yet another example. It was relatively
| famous a few years ago and you cannot have forgotten it if you
| used Linux for more than a few weeks. It's almost done on
| purpose.
| teddyh wrote:
| The _generous_ interpretation is that the AI hype people just
| _didn't know_ about those other projects, i.e. that they are
| neither open source developers, nor users.
| gapan wrote:
| Of course, how could they have known? Doing a basic web
| search before deciding on a name is so last year.
| SoKamil wrote:
| And another one, not open source but in AI sphere:
| https://www.diabrowser.com/
| freedomben wrote:
| Fun, I can't get to it because I can't get past the "Making
| sure you're not a bot!" page. It's just stuck at
| "calculating...". I understand the desire to slow down AI bots,
| but . If all the gnome apps are now behind this, they just
| completely shut down a small-time contributor. I love to play
| with Gnome apps and help out with things here and there, but
| I'm not going to fight with this damn thing to do so.
| stuartjohnson12 wrote:
| Impressive project! We'd love to use something like this over at
| Delfa (https://delfa.ai). How does this hold up from the
| perspective of stability? I've spoken to various folks working on
| voice models, and one thing that has consistently held Eleven
| Labs ahead of the pack from my experience is that their models
| seem to mostly avoid (while albeit not being immune to) accent
| shifts and distortions when confronted with unfamiliar medical
| terminology.
|
| A high quality, affordable TTS model that can consistently nail
| medical terminology while maintaining an American accent has been
| frustratingly elusive.
| toebee wrote:
| Interesting. I haven't thought of that problem before. I'm
| guessing a large enough audio dataset for medical terminology
| does not exist publicly.
|
| But AFAIK, even if you have just a few hours of audio
| containing specific terminology (and correct pronunciation),
| fine-tuning on that data will significantly improve
| performance.
| IshKebab wrote:
| Why does it say "join waitlist" if it's already available?
|
| Also, you don't need to explicitly create and activate a venv if
| you're using uv - it deals with that nonsense itself. Just `uv
| sync`.
| flakiness wrote:
| Seek back a few tens of bytes which states "Play with a larger
| version of Dia"
| toebee wrote:
| We're envisioning a platform with a social aspect, so that is
| the biggest difference. Also, bigger models!
|
| We are aware of the fact that you do not need to create a venv
| when using pre-existing uv. Just added it for people spinning
| up new GPUs on cloud. But I'll update the README to make that a
| bit clearer. Thanks for the feedback :)
| ivape wrote:
| Darn, don't have the appropriate hardware.
|
| _The full version of Dia requires around 10GB of VRAM to run._
|
| If you have a 16gb of VRAM, I guess you could pair this with a 3B
| param model along side it, or really probably only 1B param with
| reasonable context window.
| toebee wrote:
| We will work on a quantized version of the model, so hopefully
| you will be able to run it soon!
|
| We've seen Bark from Suno go from 16GB requirement -> 4GB
| requirement + running on CPUs. Won't be too hard, just need
| some time to work on it.
| ivape wrote:
| No doubt, these TTS models locally are what I'm looking for
| because I'm so done typing and reading :)
| sarangzambare wrote:
| Impressive demo! We'd love to use this at https://useponder.ai
|
| time to first audio is something that is crucial for us to reduce
| the latency - wondering if dia works with output streaming?
|
| the python code snippet seems to imply that the entire audio
| bytes are generated directly?
| toebee wrote:
| Sounds awesome! I think it won't be very hard to run it using
| output streaming, although that might require beefier GPUs.
| Give us an email and we can talk more - nari.ai.contact at
| gmail dot com.
|
| It's way past bedtime where I live, so will be able to get back
| to you after a few hours. Thanks for the interest :)
| sarangzambare wrote:
| no worries, i will email you
| xienze wrote:
| How do you declare which voice should be used for a particular
| speaker? And can it created a cloned speaker voice from a sample?
| toebee wrote:
| You can add an audio prompt and prepend text corresponding to
| it in the script. You can get a feel for it by trying the
| second example in the Gradio interface!
| brumar wrote:
| Impressive! Is it english only at the moment?
| toebee wrote:
| Unfortunately yes at the moment
| toebee wrote:
| It is way past bedtime here, will be getting back to comments
| after a few hours of sleep! Thanks for all the kind words and
| feedback
| pzo wrote:
| Sounds great. Hope more language support in the future. In
| comparison Sesame CSM-1B sounds like trained on stoned people.
| film42 wrote:
| Very very impressive.
| Versipelle wrote:
| This is really impressive; we're getting close to a dream of
| mine: the ability to generate proper audiobooks from EPUBs. Not
| just a robotic single voice for everything, but different,
| consistent voices for each protagonist, with the LLM analyzing
| the text to guess which voice to use and add an appropriate tone,
| much like a voice actor would do.
|
| I've tried "EPUB to audiobook" tools, but they are really miles
| behind what a real narrator accomplishes and make the audiobook
| impossible to engage with
| mclau157 wrote:
| Realistic voice acting for audio books, realistic images for
| each page, realistic videos for each page, oh wait I just
| created a movie, maybe I can change the plot? Oh wait I just
| created a video game
| azinman2 wrote:
| Wouldn't it be more desirable to hear an actual human on an
| audiobook? Ideally the author?
| senordevnyc wrote:
| Honestly, I'd say that's true _only_ for the author. Anyone
| else is just going to be interpreting the words to understand
| how to best convey the character / emotion / situation /
| etc., just like an AI will have to do. If an AI can do that
| more effectively than a human, why not?
|
| The author _could_ be better, because they at least have
| other info beyond the text to rely on, they can go off-script
| or add little details, etc.
| DrSiemer wrote:
| As somebody who has listened to hundreds of audiobooks, I
| can tell you authors are generally not the best choice to
| voice their own work. They may know every intent, but they
| are writers, not actors.
|
| The most skilled readers will make you want to read books
| _just because they narrated them_. They add a unique
| quality to the story, that you do not get from reading
| yourself or from watching a video adaptation.
|
| Currently I'm in The Age of Madness, read by Steven Pacey.
| He's fantastic. The late Roy Dotrice is worth a mention as
| well, for voicing Game of Thrones and claiming the Guinness
| world record for most distinct voices (224) in one series.
|
| It will be awesome if we can create readings automatically,
| but it will be a while before TTS can compete with the best
| readers out there.
| tyrauber wrote:
| Hey, do yourself a favor and listen to the fun example:
|
| > [S1] Oh fire! Oh my goodness! What's the procedure? What to we
| do people? The smoke could be coming through an air duct!
|
| Seriously impressive. Wish I could direct link the audio.
|
| Kudos to the Dia team.
| jinay wrote:
| For anyone who wants to listen, it's on this page:
| https://yummy-fir-7a4.notion.site/dia
| mrandish wrote:
| Wow. Thanks for posting the direct link to examples. Those
| sound incredibly good and would be impressive for a frontier
| lab. For two people over a few months, it's spectacular.
| DoctorOW wrote:
| A little overacted, it reminds me of the voice acting in
| those flash cartoons you'd see in the early days of YouTube.
| That's not to say it isn't good work, it still sounds
| remarkably human. Just silly humans :)
| nojs wrote:
| This is so good. Reminds me of The Office. I love how bad the
| other examples are.
| fwip wrote:
| The text is lifted from a scene in The Office:
| https://youtu.be/gO8N3L_aERg?si=y7PggNrKlVQm0qyX&t=82
| 3abiton wrote:
| This is oddly reminiscent of the office. I wonder if tv shows
| were part of its training data!
| toebee wrote:
| Thank you!! Indeed the script was inspired from a scene in the
| Office.
| notdian wrote:
| made a small change and got it running on M2 Pro 16GB Macbook
| pro, the quality is amazing.
|
| https://github.com/nari-labs/dia/pull/4
| noiv wrote:
| Can confirm, runs straight forward on 15.4.1@M4, THX.
| isoprophlex wrote:
| Incredible quality demo samples, well done. How's the performance
| for multilingual generation?
| 999900000999 wrote:
| Does this only support English?
|
| I would absolutely love something like this for practicing
| Chinese, or even just adding Chinese dialogue to a project.
| verghese wrote:
| How does this compare with Spark TTS?
|
| https://github.com/SparkAudio/Spark-TTS
| youssefabdelm wrote:
| Anyone know if possible to fine-tune for cloning my voice?
| xbmcuser wrote:
| Wow first time I have felt that this could be the end of voice
| acting/audio book narration etc. The speed with with the ways
| things are changing how soon before you can make any book any
| novel into a complete audio video / movie or tv show.
| a2128 wrote:
| What's the training process like? I have some data in my language
| I'd love to use train it on my language seeing as it's English-
| only
| popalchemist wrote:
| This looks excellent, thank you for releasing openly.
| hemloc_io wrote:
| Very cool!
|
| Insane how much low hanging fruit there is for Audio models right
| now. A team of two picking things up over a few months can build
| something that still competes with large players with tons of
| funding
| toebee wrote:
| Thank you for the kind words <3
| lostmsu wrote:
| Does this only work for two voices? Can I generate an entire
| conversation between multiple people? Like this HN thread.
| mclau157 wrote:
| Will you support the other side with AI voice detection software
| to detect and block malicious voice snippets?
| rustc wrote:
| Is this Apache licensed or a custom one? The README contains
| this:
|
| > This project is licensed under the Apache License 2.0 - see the
| LICENSE file for details.
|
| > This project offers a high-fidelity speech generation model
| *intended solely for research and educational use*. The following
| uses are strictly forbidden:
|
| > Identity Misuse: Do not produce audio resembling real
| individuals without permission.
|
| > ...
|
| Specifically the phrase "intended solely for research and
| educational use".
| montroser wrote:
| Hmm, the "strictly forbidden" part seems more important than
| whatever are their stated intentions... Either way, it seems
| like it needs clarifying.
| zhyder wrote:
| V v cool: first time I've seen such expressiveness in TTS for
| laughs, coughs, yelling about a fire, etc!
|
| What're the recommended GPU cloud providers for using such open-
| weights models?
| codingmoh wrote:
| Hey, this is really cool! Curious how good the multi-language
| support is. Also - pretty wild that you trained the whole thing
| yourselves, especially without prior experience in speech models.
|
| Might actually be helpful for others if you ever feel like
| documenting how you got started and what the process looked like.
| I've never worked with TTS models myself, and honestly wouldn't
| know where to begin. Either way, awesome work. Big respect.
| eob wrote:
| Bravo -- this is fantastic.
|
| I've been waiting for this ever since reading some interview with
| Orson Scott Card ages ago. It turns out he thinks of his novels
| as radio theater, not books. Which is a very different way to
| experience the audio.
| vagabund wrote:
| The huggingface spaces link doesn't work, fyi.
|
| Sounds awesome in the demo page though.
| noiv wrote:
| The demo page does fancy stuff when marking text and hitting
| cmd-d to create a bookmark :)
| Havoc wrote:
| Sounds really good & human! Got a fair bit of unexpected
| artifacts though. e.g. 3 seconds hissing noise before dialogue.
| And music in background when I added (happy) in an attempt to
| control tone. Also don't understand how to control the S1 and S2
| speakers...is it just random based on temp?
|
| > TODO Docker support
|
| Got this adapted pretty easily. Just latest nvidia cuda
| container, throw python and modules on it and change server to
| serve on 0.0.0.0. Does mean it pulls the model every time on
| startup though which isn't ideal
| yjftsjthsd-h wrote:
| > Does mean it pulls the model every time on startup though
| which isn't ideal
|
| Surely it just downloads to a directory that can be volume
| mapped?
| Havoc wrote:
| Yep. I just didn't spend the time to track down the location
| tbh. Plus huggingface usually does links to a cache folder
| that I don't recall the location of
|
| Literally got cuda containers working earlier today so
| haven't spent a huge amount of time figuring things out
| jokethrowaway wrote:
| Looking forward to try. My current go-to solution is E5-F2 (great
| cloning, decent delivery, ok audio quality, a lot of incoherence
| here and there forcing you to do multiple generations).
|
| I've just been massively disappointed by Sesame's CSM: on their
| gradio on the website it was generating flawless dialogs with
| amazing voice cloning. When running it local the voice cloning
| performance is awful.
| xhkkffbf wrote:
| Are there different voices? Or only [s1] and [s2] in the
| examples?
| hiAndrewQuinn wrote:
| Is this English-only? I'm looking for a local model for Finnish
| dialogue to run.
| dindindin wrote:
| Was this trained on Planet Money / NPR podcasts? The last audio
| (continuation of prompt) sounds eerily like Planet Money, I had
| to double check if my Spotify had accidentally started playing.
| jelling wrote:
| NPR voice is a thing.
|
| It started with Ira Glass voice and now the default voice is
| someone that sounds like they're not certain they should be
| saying the very banal thing they are about to say, followed by
| a hand-shake protocol of nervous laughter.
| instagary wrote:
| Does this use the the mimi codec by moshi? If so it would be
| straighforward to get Dia running on iOS!
___________________________________________________________________
(page generated 2025-04-21 23:00 UTC)