[HN Gopher] Show HN: Dia, an open-weights TTS model for generati...
       ___________________________________________________________________
        
       Show HN: Dia, an open-weights TTS model for generating realistic
       dialogue
        
       Author : toebee
       Score  : 298 points
       Date   : 2025-04-21 17:07 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | toebee wrote:
       | Hey HN! We're Toby and Jay, creators of Dia. Dia is 1.6B
       | parameter open-weights model that generates dialogue directly
       | from a transcript.
       | 
       | Unlike TTS models that generate each speaker turn and stitch them
       | together, Dia generates the entire conversation in a single pass.
       | This makes it faster, more natural, and easier to use for
       | dialogue generation.
       | 
       | It also supports audio prompts -- you can condition the output on
       | a specific voice/emotion and it will continue in that style.
       | 
       | Demo page comparing it to ElevenLabs and Sesame-1B https://yummy-
       | fir-7a4.notion.site/dia
       | 
       | We started this project after falling in love with NotebookLM's
       | podcast feature. But over time, the voices and content started to
       | feel repetitive. We tried to replicate the podcast-feel with APIs
       | but it did not sound like human conversations.
       | 
       | So we decided to train a model ourselves. We had no prior
       | experience with speech models and had to learn everything from
       | scratch -- from large-scale training, to audio tokenization. It
       | took us a bit over 3 months.
       | 
       | Our work is heavily inspired by SoundStorm and Parakeet. We plan
       | to release a lightweight technical report to share what we
       | learned and accelerate research.
       | 
       | We'd love to hear what you think! We are a tiny team, so open
       | source contributions are extra-welcomed. Please feel free to
       | check out the code, and share any thoughts or suggestions with
       | us.
        
         | new_user_final wrote:
         | Easily 10 times better than recent OpenAI voice model. I don't
         | like robotic voices.
         | 
         | Example voices seems like over loud, over excitement like
         | Andrew Tate, Speed or advertisement. It's lacking calm, normal
         | conversation or normal podcast like interaction.
        
         | gfaure wrote:
         | Amazing that you developed this over the course of three
         | months! Can you drop any insight into how you pulled together
         | the audio data?
        
           | isoprophlex wrote:
           | +1 to this, amazing how you managed to deliver this, and iff
           | you're willing to share i'd be most interested in learning
           | what you did in terms of train data..!
        
         | nickthegreek wrote:
         | Are there any examples of the audio differences between the
         | this and the larger model?
        
         | heystefan wrote:
         | Could one usecase be generating an audiobook with this from
         | existing books? I wonder if I could fine-tune the "characters"
         | that speak these lines since you said it's a single pass whole
         | the whole convo. Wonder if that's a limitation for this kind of
         | a usecase (where speed is not imperative).
        
         | bzuker wrote:
         | hey, this looks (or rather, sounds) amazing! Does it work with
         | different languages or is it English only?
        
         | llm_nerd wrote:
         | This is a pretty incredible three month creation for a couple
         | of people who had no experience with speech models.
        
         | smusamashah wrote:
         | Hi! This is awesome for size and quality. I want to see a book
         | reading example or try it myself.
         | 
         | This is a tangent point but it would have been nicer if it
         | wasn't a notion site. You could put the same page on github
         | pages and it will be much lighter to open, navigate and link
         | (like people trying to link some audio)
        
       | strobe wrote:
       | just in case, another opensource project using same name
       | https://wiki.gnome.org/Apps/Dia/
       | 
       | https://gitlab.gnome.org/GNOME/dia
        
         | toebee wrote:
         | Thanks for the heads-up! We weren't aware of the GNOME Dia
         | project. Since we focus on speech AI, we'll make sure to
         | clarify that distinction.
        
           | aclark wrote:
           | Ditto this! Dia diagram tool user here just noticing the name
           | clash. Good luck with your Dia!! Assuming both can exist in
           | harmony. :-)
        
             | mrandish wrote:
             | > Assuming both can exist in harmony.
             | 
             | I'm sure they can... _talk it over._
             | 
             | I'll show myself out.
        
         | Magma7404 wrote:
         | I know it's a bit ridiculous to see that as some kind of
         | conspiracy, but I have seen a very long list of AI-related
         | projects that got the same name as a famous open-source
         | project, as if they wanted to hijack the popularity of those
         | projects, and Dia is yet another example. It was relatively
         | famous a few years ago and you cannot have forgotten it if you
         | used Linux for more than a few weeks. It's almost done on
         | purpose.
        
           | teddyh wrote:
           | The _generous_ interpretation is that the AI hype people just
           | _didn't know_ about those other projects, i.e. that they are
           | neither open source developers, nor users.
        
             | gapan wrote:
             | Of course, how could they have known? Doing a basic web
             | search before deciding on a name is so last year.
        
         | SoKamil wrote:
         | And another one, not open source but in AI sphere:
         | https://www.diabrowser.com/
        
         | freedomben wrote:
         | Fun, I can't get to it because I can't get past the "Making
         | sure you're not a bot!" page. It's just stuck at
         | "calculating...". I understand the desire to slow down AI bots,
         | but . If all the gnome apps are now behind this, they just
         | completely shut down a small-time contributor. I love to play
         | with Gnome apps and help out with things here and there, but
         | I'm not going to fight with this damn thing to do so.
        
       | stuartjohnson12 wrote:
       | Impressive project! We'd love to use something like this over at
       | Delfa (https://delfa.ai). How does this hold up from the
       | perspective of stability? I've spoken to various folks working on
       | voice models, and one thing that has consistently held Eleven
       | Labs ahead of the pack from my experience is that their models
       | seem to mostly avoid (while albeit not being immune to) accent
       | shifts and distortions when confronted with unfamiliar medical
       | terminology.
       | 
       | A high quality, affordable TTS model that can consistently nail
       | medical terminology while maintaining an American accent has been
       | frustratingly elusive.
        
         | toebee wrote:
         | Interesting. I haven't thought of that problem before. I'm
         | guessing a large enough audio dataset for medical terminology
         | does not exist publicly.
         | 
         | But AFAIK, even if you have just a few hours of audio
         | containing specific terminology (and correct pronunciation),
         | fine-tuning on that data will significantly improve
         | performance.
        
       | IshKebab wrote:
       | Why does it say "join waitlist" if it's already available?
       | 
       | Also, you don't need to explicitly create and activate a venv if
       | you're using uv - it deals with that nonsense itself. Just `uv
       | sync`.
        
         | flakiness wrote:
         | Seek back a few tens of bytes which states "Play with a larger
         | version of Dia"
        
         | toebee wrote:
         | We're envisioning a platform with a social aspect, so that is
         | the biggest difference. Also, bigger models!
         | 
         | We are aware of the fact that you do not need to create a venv
         | when using pre-existing uv. Just added it for people spinning
         | up new GPUs on cloud. But I'll update the README to make that a
         | bit clearer. Thanks for the feedback :)
        
       | ivape wrote:
       | Darn, don't have the appropriate hardware.
       | 
       |  _The full version of Dia requires around 10GB of VRAM to run._
       | 
       | If you have a 16gb of VRAM, I guess you could pair this with a 3B
       | param model along side it, or really probably only 1B param with
       | reasonable context window.
        
         | toebee wrote:
         | We will work on a quantized version of the model, so hopefully
         | you will be able to run it soon!
         | 
         | We've seen Bark from Suno go from 16GB requirement -> 4GB
         | requirement + running on CPUs. Won't be too hard, just need
         | some time to work on it.
        
           | ivape wrote:
           | No doubt, these TTS models locally are what I'm looking for
           | because I'm so done typing and reading :)
        
       | sarangzambare wrote:
       | Impressive demo! We'd love to use this at https://useponder.ai
       | 
       | time to first audio is something that is crucial for us to reduce
       | the latency - wondering if dia works with output streaming?
       | 
       | the python code snippet seems to imply that the entire audio
       | bytes are generated directly?
        
         | toebee wrote:
         | Sounds awesome! I think it won't be very hard to run it using
         | output streaming, although that might require beefier GPUs.
         | Give us an email and we can talk more - nari.ai.contact at
         | gmail dot com.
         | 
         | It's way past bedtime where I live, so will be able to get back
         | to you after a few hours. Thanks for the interest :)
        
           | sarangzambare wrote:
           | no worries, i will email you
        
       | xienze wrote:
       | How do you declare which voice should be used for a particular
       | speaker? And can it created a cloned speaker voice from a sample?
        
         | toebee wrote:
         | You can add an audio prompt and prepend text corresponding to
         | it in the script. You can get a feel for it by trying the
         | second example in the Gradio interface!
        
       | brumar wrote:
       | Impressive! Is it english only at the moment?
        
         | toebee wrote:
         | Unfortunately yes at the moment
        
       | toebee wrote:
       | It is way past bedtime here, will be getting back to comments
       | after a few hours of sleep! Thanks for all the kind words and
       | feedback
        
       | pzo wrote:
       | Sounds great. Hope more language support in the future. In
       | comparison Sesame CSM-1B sounds like trained on stoned people.
        
       | film42 wrote:
       | Very very impressive.
        
       | Versipelle wrote:
       | This is really impressive; we're getting close to a dream of
       | mine: the ability to generate proper audiobooks from EPUBs. Not
       | just a robotic single voice for everything, but different,
       | consistent voices for each protagonist, with the LLM analyzing
       | the text to guess which voice to use and add an appropriate tone,
       | much like a voice actor would do.
       | 
       | I've tried "EPUB to audiobook" tools, but they are really miles
       | behind what a real narrator accomplishes and make the audiobook
       | impossible to engage with
        
         | mclau157 wrote:
         | Realistic voice acting for audio books, realistic images for
         | each page, realistic videos for each page, oh wait I just
         | created a movie, maybe I can change the plot? Oh wait I just
         | created a video game
        
         | azinman2 wrote:
         | Wouldn't it be more desirable to hear an actual human on an
         | audiobook? Ideally the author?
        
           | senordevnyc wrote:
           | Honestly, I'd say that's true _only_ for the author. Anyone
           | else is just going to be interpreting the words to understand
           | how to best convey the character  / emotion / situation /
           | etc., just like an AI will have to do. If an AI can do that
           | more effectively than a human, why not?
           | 
           | The author _could_ be better, because they at least have
           | other info beyond the text to rely on, they can go off-script
           | or add little details, etc.
        
             | DrSiemer wrote:
             | As somebody who has listened to hundreds of audiobooks, I
             | can tell you authors are generally not the best choice to
             | voice their own work. They may know every intent, but they
             | are writers, not actors.
             | 
             | The most skilled readers will make you want to read books
             | _just because they narrated them_. They add a unique
             | quality to the story, that you do not get from reading
             | yourself or from watching a video adaptation.
             | 
             | Currently I'm in The Age of Madness, read by Steven Pacey.
             | He's fantastic. The late Roy Dotrice is worth a mention as
             | well, for voicing Game of Thrones and claiming the Guinness
             | world record for most distinct voices (224) in one series.
             | 
             | It will be awesome if we can create readings automatically,
             | but it will be a while before TTS can compete with the best
             | readers out there.
        
       | tyrauber wrote:
       | Hey, do yourself a favor and listen to the fun example:
       | 
       | > [S1] Oh fire! Oh my goodness! What's the procedure? What to we
       | do people? The smoke could be coming through an air duct!
       | 
       | Seriously impressive. Wish I could direct link the audio.
       | 
       | Kudos to the Dia team.
        
         | jinay wrote:
         | For anyone who wants to listen, it's on this page:
         | https://yummy-fir-7a4.notion.site/dia
        
           | mrandish wrote:
           | Wow. Thanks for posting the direct link to examples. Those
           | sound incredibly good and would be impressive for a frontier
           | lab. For two people over a few months, it's spectacular.
        
           | DoctorOW wrote:
           | A little overacted, it reminds me of the voice acting in
           | those flash cartoons you'd see in the early days of YouTube.
           | That's not to say it isn't good work, it still sounds
           | remarkably human. Just silly humans :)
        
         | nojs wrote:
         | This is so good. Reminds me of The Office. I love how bad the
         | other examples are.
        
           | fwip wrote:
           | The text is lifted from a scene in The Office:
           | https://youtu.be/gO8N3L_aERg?si=y7PggNrKlVQm0qyX&t=82
        
         | 3abiton wrote:
         | This is oddly reminiscent of the office. I wonder if tv shows
         | were part of its training data!
        
         | toebee wrote:
         | Thank you!! Indeed the script was inspired from a scene in the
         | Office.
        
       | notdian wrote:
       | made a small change and got it running on M2 Pro 16GB Macbook
       | pro, the quality is amazing.
       | 
       | https://github.com/nari-labs/dia/pull/4
        
         | noiv wrote:
         | Can confirm, runs straight forward on 15.4.1@M4, THX.
        
       | isoprophlex wrote:
       | Incredible quality demo samples, well done. How's the performance
       | for multilingual generation?
        
       | 999900000999 wrote:
       | Does this only support English?
       | 
       | I would absolutely love something like this for practicing
       | Chinese, or even just adding Chinese dialogue to a project.
        
       | verghese wrote:
       | How does this compare with Spark TTS?
       | 
       | https://github.com/SparkAudio/Spark-TTS
        
       | youssefabdelm wrote:
       | Anyone know if possible to fine-tune for cloning my voice?
        
       | xbmcuser wrote:
       | Wow first time I have felt that this could be the end of voice
       | acting/audio book narration etc. The speed with with the ways
       | things are changing how soon before you can make any book any
       | novel into a complete audio video / movie or tv show.
        
       | a2128 wrote:
       | What's the training process like? I have some data in my language
       | I'd love to use train it on my language seeing as it's English-
       | only
        
       | popalchemist wrote:
       | This looks excellent, thank you for releasing openly.
        
       | hemloc_io wrote:
       | Very cool!
       | 
       | Insane how much low hanging fruit there is for Audio models right
       | now. A team of two picking things up over a few months can build
       | something that still competes with large players with tons of
       | funding
        
         | toebee wrote:
         | Thank you for the kind words <3
        
       | lostmsu wrote:
       | Does this only work for two voices? Can I generate an entire
       | conversation between multiple people? Like this HN thread.
        
       | mclau157 wrote:
       | Will you support the other side with AI voice detection software
       | to detect and block malicious voice snippets?
        
       | rustc wrote:
       | Is this Apache licensed or a custom one? The README contains
       | this:
       | 
       | > This project is licensed under the Apache License 2.0 - see the
       | LICENSE file for details.
       | 
       | > This project offers a high-fidelity speech generation model
       | *intended solely for research and educational use*. The following
       | uses are strictly forbidden:
       | 
       | > Identity Misuse: Do not produce audio resembling real
       | individuals without permission.
       | 
       | > ...
       | 
       | Specifically the phrase "intended solely for research and
       | educational use".
        
         | montroser wrote:
         | Hmm, the "strictly forbidden" part seems more important than
         | whatever are their stated intentions... Either way, it seems
         | like it needs clarifying.
        
       | zhyder wrote:
       | V v cool: first time I've seen such expressiveness in TTS for
       | laughs, coughs, yelling about a fire, etc!
       | 
       | What're the recommended GPU cloud providers for using such open-
       | weights models?
        
       | codingmoh wrote:
       | Hey, this is really cool! Curious how good the multi-language
       | support is. Also - pretty wild that you trained the whole thing
       | yourselves, especially without prior experience in speech models.
       | 
       | Might actually be helpful for others if you ever feel like
       | documenting how you got started and what the process looked like.
       | I've never worked with TTS models myself, and honestly wouldn't
       | know where to begin. Either way, awesome work. Big respect.
        
       | eob wrote:
       | Bravo -- this is fantastic.
       | 
       | I've been waiting for this ever since reading some interview with
       | Orson Scott Card ages ago. It turns out he thinks of his novels
       | as radio theater, not books. Which is a very different way to
       | experience the audio.
        
       | vagabund wrote:
       | The huggingface spaces link doesn't work, fyi.
       | 
       | Sounds awesome in the demo page though.
        
       | noiv wrote:
       | The demo page does fancy stuff when marking text and hitting
       | cmd-d to create a bookmark :)
        
       | Havoc wrote:
       | Sounds really good & human! Got a fair bit of unexpected
       | artifacts though. e.g. 3 seconds hissing noise before dialogue.
       | And music in background when I added (happy) in an attempt to
       | control tone. Also don't understand how to control the S1 and S2
       | speakers...is it just random based on temp?
       | 
       | > TODO Docker support
       | 
       | Got this adapted pretty easily. Just latest nvidia cuda
       | container, throw python and modules on it and change server to
       | serve on 0.0.0.0. Does mean it pulls the model every time on
       | startup though which isn't ideal
        
         | yjftsjthsd-h wrote:
         | > Does mean it pulls the model every time on startup though
         | which isn't ideal
         | 
         | Surely it just downloads to a directory that can be volume
         | mapped?
        
           | Havoc wrote:
           | Yep. I just didn't spend the time to track down the location
           | tbh. Plus huggingface usually does links to a cache folder
           | that I don't recall the location of
           | 
           | Literally got cuda containers working earlier today so
           | haven't spent a huge amount of time figuring things out
        
       | jokethrowaway wrote:
       | Looking forward to try. My current go-to solution is E5-F2 (great
       | cloning, decent delivery, ok audio quality, a lot of incoherence
       | here and there forcing you to do multiple generations).
       | 
       | I've just been massively disappointed by Sesame's CSM: on their
       | gradio on the website it was generating flawless dialogs with
       | amazing voice cloning. When running it local the voice cloning
       | performance is awful.
        
       | xhkkffbf wrote:
       | Are there different voices? Or only [s1] and [s2] in the
       | examples?
        
       | hiAndrewQuinn wrote:
       | Is this English-only? I'm looking for a local model for Finnish
       | dialogue to run.
        
       | dindindin wrote:
       | Was this trained on Planet Money / NPR podcasts? The last audio
       | (continuation of prompt) sounds eerily like Planet Money, I had
       | to double check if my Spotify had accidentally started playing.
        
         | jelling wrote:
         | NPR voice is a thing.
         | 
         | It started with Ira Glass voice and now the default voice is
         | someone that sounds like they're not certain they should be
         | saying the very banal thing they are about to say, followed by
         | a hand-shake protocol of nervous laughter.
        
       | instagary wrote:
       | Does this use the the mimi codec by moshi? If so it would be
       | straighforward to get Dia running on iOS!
        
       ___________________________________________________________________
       (page generated 2025-04-21 23:00 UTC)