hngopher.com

       [HN Gopher] Abogen - Generate audiobooks from EPUBs, PDFs and text
       ___________________________________________________________________
        
       Abogen - Generate audiobooks from EPUBs, PDFs and text
        
       Author : mzehrer
       Score  : 267 points
       Date   : 2025-08-10 05:56 UTC (17 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | nikolayasdf123 wrote:
       | can I choose any voice? would love to read software engineering
       | books in voice of Morgan Freeman, or maybe even better, Scarlett
       | Johansson
        
         | hulitu wrote:
         | Why not Stephen Hawking ?
        
           | hajimuz wrote:
           | Yeah, could be a buff like 500% brain supercharge.
        
           | throwup238 wrote:
           | Because the Stephen Hawking voice spends a quarter of its
           | time joking/complaining how it never got a Nobel Prize.
        
         | pyman wrote:
         | The voice of Mickey Mouse would be nice.
        
       | TOGoS wrote:
       | The demo video doesn't seem to have any audio in it! At least
       | none that either ffmpeg or whatever Firefox uses can recognize.
        
         | huseyinkeles wrote:
         | I can hear it on safari
        
         | Daunk wrote:
         | Same on my end, no audio in the video.
        
         | jamilton wrote:
         | Same here, but it worked when I opened it in Chrome. What a
         | weird error - you would think that playing an embedded mp4 with
         | audio wouldn't differ from browser to browser.
        
           | mnmalst wrote:
           | I was surprised by this as well at first but thinking about
           | it, it would make sense when they use an audio codec which is
           | not supported on the target system. In that case the video
           | can still play but the audio can't. I wasn't aware tho that
           | audio can be disabled separately.
        
         | ertian wrote:
         | Yeah, I've run a local Kokoro instance, and it doesn't work
         | with Firefox. This uses Kokoro under the hood.
        
           | noisem4ker wrote:
           | The demo clip is static and has the Kokoro output encoded as
           | the audio track. It's not Kokoro running and generating it in
           | your browser in real time.
        
         | noisem4ker wrote:
         | It's probably due to the unusual sound format, 24kHz mono PCM,
         | and the fact that it was somehow forced into a WebM container,
         | which only supports Vorbis and Opus officially.
         | 
         | It looks like the author created it using the "higher quality"
         | ffmpeg command line, except for the "webm" final extension,
         | producing the opposite of what's described as "an MP4 file
         | that's compatible with more devices".
         | 
         | https://github.com/denizsafak/abogen/tree/main/demo#for-high...
        
         | frumiousirc wrote:
         | Thanks for this. I thought I had some local issue with
         | waterfox. Pasting the (long) video URL to the terminal let's
         | mpv play it with audio.
        
       | 8s2ngy wrote:
       | I've been using Kokoro TTS with the CLI app, audiblez, mentioned
       | in the "Similar Projects" section of the README. The model is
       | fast and delivers impressive quality for its small size. Some
       | issues I have faced, however, are: a) It doesn't distinguish
       | periods at the end of sentences from the dots in abbreviations
       | such as "Mr." or "Mrs." The result is an awkward pause between
       | "Mr." and the name. b) It doesn't handle ellipses well. c) Words
       | are pronounced the same way regardless of context.
        
         | rkagerer wrote:
         | The Mr. / Mrs. thing feels like it would be a pretty easy fix,
         | at least to eliminate a lot of the more common cases.
        
           | hombre_fatal wrote:
           | ^ A thought that everyone has had at one point when
           | processing human text before learning the hard way (like end
           | of sentence detection). :P
           | 
           | The difference is that even weak LLMs are good at magically
           | doing this, so I wonder what the problem is for the TTS
           | mentioned above.
        
             | leobg wrote:
             | Kokoro is small and fast because all the text -> phoneme
             | conversion is done by "dumb code" and only the phoneme ->
             | sound part is done using a neural net.
        
         | fudged71 wrote:
         | Look into SSML phoneme tags. Some TTS supports it. That was you
         | can use a powerful LLM to fix these issues ahead of TTS
        
       | anotherpaul wrote:
       | Does it turn it into spoken word or an audiobook? Because good
       | audiobooks often have voice actors that read the characters with
       | different emphasis and dialects. I imagine tools like chatgpt
       | could do this for a few sentences but what about an 8-20 hour
       | audiobook?
       | 
       | I think there are still basic hurdles to take before we can go
       | epub to audiobook in a quality that can compete with current
       | state of the art.
       | 
       | Or am I missing something?
        
         | jamilton wrote:
         | Elevenlabs has a feature for a "full cast"-type generation,
         | where different characters will get different voices. It's
         | certainly not automatically sensitive to dialect though.
         | 
         | It's probably possible with current systems to do though. I
         | believe there are TTS systems that can use context/prompting to
         | change emphasis and other speech qualities, though I'm not sure
         | how reliably.
        
           | pyman wrote:
           | Is it open source?
        
           | vorgol wrote:
           | Have you heard results from it? How does it know for example,
           | when there is a romantic scene in the book, which voice to
           | read out as?
           | 
           | It's definitely an exited voice, but is it read out as in a
           | battle or as in a romantic scene?
        
           | tummler wrote:
           | I'm sure it's doable. I think you'd want to break it into a
           | few discrete steps for the best quality. First process the
           | book and identify key info like genre, tone, etc. Use that to
           | determine the best voice(s) and reading style, assign actors
           | for multiple characters/subjects. Maybe output some examples
           | to spot check for approval. Tweak based on that then generate
           | the audio. Prob a couple other steps in there and maybe a bit
           | of custom work to optimize in key areas. If someone wants to
           | do this as a side project I can help scope the architecture
           | and process but I don't want to code it. :p
        
           | fudged71 wrote:
           | I don't think they do it automatically, though. I think you
           | need to piece apart the transcript in their tool to decide
           | which voice to use where.
        
         | BenGosub wrote:
         | There are a few character voices that also can be mixed using
         | the mixer, achieving different nuances. You can then write your
         | own code to use different voices for different characters.
        
         | parineum wrote:
         | > Because good audiobooks often have voice actors that read the
         | characters with different emphasis and dialects.
         | 
         | I actually hate this. I like quotes to be read with the tone
         | and inflection implied by the context but I don't like the
         | different voices.
        
           | crazygringo wrote:
           | I'm with you. It's as if a book decided to use a different
           | font for each character's speech. It's distracting, not
           | helpful.
        
       | floppyd wrote:
       | I tried Kokoro for voicing blog posts and articles and wasn't
       | impressed to be honest. Right now Gemini 2.5 Flash TTS is a much
       | more capable system with generous free limits (about 10 minutes
       | per generation and about 90 minutes per day). Voices are not very
       | consistent between generations, but for shorter pieces it's not a
       | big deal (but will obviously be for books)
        
         | ekianjo wrote:
         | Kokoro is fine for TTS, but it lacks emotion. But for a model
         | of this size, that is kind of given.
        
           | robin_reala wrote:
           | Ironic given the name: kokoro is Japanese for heart or
           | sentiment.
        
           | SirHumphrey wrote:
           | I played with ebook generation a bunch and find that (at
           | least for English text) around 1B is needed to get something
           | usable emotionally (Chatterbox is 0.5B, Orpheus is 3B).
        
       | scotty79 wrote:
       | I think the quality of the voice is super important for
       | audiobooks and I think we are just closing in on the required
       | quality with TTS.
       | 
       | I played a bit with Eleven labs voices and while they aren't bad
       | when I tried make them read fragment of a text that I wrote, it
       | sounded chaotic, boring, quite terrible, for anything longer than
       | a sentence or two. But when I tried their v3 voices which they
       | are currently in the process of rolling out, the same text
       | sounded consistent, emotional, engaging, simply amazing. I think
       | we are just crossing vocal uncanny valley.
        
         | porker wrote:
         | Strong agree that voice quality (and voice acting) is
         | important. I listen to a lot of fiction audiobooks, and will
         | listen to the end of a middling book with a good narrator, but
         | if the narration is flat or out of keeping with the characters
         | I'll stop after a chapter or two.
        
       | dsign wrote:
       | Nice!
       | 
       | As an aside, while this tool can be used to create an audiobook
       | from a book you have in text format, for your private
       | consumption, having an author employ something like this to
       | create files for distribution is extremely risky, even if they
       | acknowledge its use and intend those files to only be available
       | on their website.
       | 
       | Indie authors struggle a lot to promote their works, and the new
       | normal is that potential readers, the polite ones[^1], use the
       | slightest hint of AI usage to discard their title and move
       | on...as they are entitled to, since there are so many books.
       | 
       | I in particular have started to hire voice actors that have good
       | acting skills and good diction but for whom English is their
       | second language, or it's their first language but they speak
       | something else at home; sometimes I even ask them to go a notch
       | up with their accents. It helps with the non-AI recognition, and
       | it also increases the appeal of the book for people who would
       | like to try out something new. Once, I did an audition for a
       | project and was pleasantly surprised with how much life people
       | from around the Mediterranean basin were able to inject into
       | their renderings, compared with people from Britain and North
       | America.
       | 
       | [^1] Impolite readers set the town on fire, and then go about and
       | spread that fire to neighboring towns, for good measure.
        
         | baxtr wrote:
         | I am big time user of Amazon's WhisperSync feature. With that
         | feature I can simultaneously read the book and listen to it.
         | 
         | This is especially helpful when you're on the go but still want
         | to have a visual now and then or highlight text for later.
         | 
         | The problem is that many books don't offer that feature. There
         | is a built-in read function now in the kindle app, but it's
         | crap.
         | 
         | So, if you ask me, I'd prefer a good human-written book with an
         | additional AI voice on top to enable that feature for me.
        
           | em-bee wrote:
           | yeah, i don't see the problem. using a generated voice, no
           | matter how, only affects the audiobook, not the actual book.
           | if i don't like the voice i can ignore it. i am part of a
           | group that occasionally gets email from new authors wanting
           | us to review their books. and some of them sound really
           | interesting, and i'd love to read them, but i can only do
           | audiobook, so i would be very happy if the author went
           | through the effort to generate an audiobook that i could
           | listen to.
        
           | vahid4m wrote:
           | I'm obsessed with "simultaneously read the book and listen to
           | it". Thats why I built WithAudio. You can checkout the demo
           | here: https://desktop.with.audio/reader-demo
           | 
           | I'd love to hear any feedback you have. "prefer a good human-
           | written book with an additional AI voice on top to enable
           | that feature for me" is exactly what I prefer when it comes
           | to reading.
        
             | baxtr wrote:
             | Thanks! Will definitely check it out
        
             | montag wrote:
             | Very surprising that you're offering this without
             | subscription. Huge selling point. Next time the need
             | arises, I'll come back for WithAudio.
        
           | cyberax wrote:
           | Thre's a nice project that attempts to do this:
           | https://gitlab.com/storyteller-platform/storyteller
           | 
           | I've been meaning to use its position sync protocol with
           | KoReader, but it's not trivial.
        
         | crazygringo wrote:
         | > _and the new normal is that potential readers, the polite
         | ones[^1], use the slightest hint of AI usage to discard their
         | title and move on..._
         | 
         | Is that the new normal?
         | 
         | My impression is that when it comes to reading text, nobody
         | cares as long as the final product is good.
         | 
         | People don't want AI- _written_ books, but people have been
         | comfortably listening to AI voices reading text for a long time
         | now. Text-to-speech isn 't really a controversial thing for
         | listening to articles or books.
         | 
         | (Which is very different from voice acting, for example, which
         | requires _acting_ not just reading.)
        
       | leke wrote:
       | How big is this app?
        
       | amaccuish wrote:
       | Amazing, but I'm personally waiting for the one that generates a
       | well formated ePub from a PDF.
        
       | poulpy123 wrote:
       | perfect, I was looking for something like that ! is it gui only,
       | or is there an api available ? I would like to be able to share a
       | link or a text from my phone and get back the audio
        
       | logicprog wrote:
       | I've been using this to try to make audiobooks out of various
       | philosophy books I've been wanting to read, for accessibility
       | reasons, and I ran into a critical problem: if the input text fed
       | to Kokoro is too long, it'll start skipping words at the end or
       | in the middle, or fade out at the end; and abogen chunks the text
       | it feeds to Kokoro by sentence, so sentences of arbitrary length
       | are fed to Kokoro without any guarding. This produces unusable
       | audiobooks for me. I'm working on "vibe coding" my own Kokoro
       | based tkinter personal gui app for the same purpise that uses
       | nltk and some regex magic for better splitting.
        
         | gavinray wrote:
         | I use "kokoro-tts" CLI, which has better chunking/splitting.
         | 
         | https://github.com/nazdridoy/kokoro-tts
         | 
         | It generates a directory of audio files, along with a metadata
         | file for ebook chapters
         | 
         | You have to use m4b-tool to stitch the audio files together
         | into an audiobook and include the chapter metadata, but it
         | works great:
         | 
         | https://github.com/sandreas/m4b-tool
         | 
         | I've been meaning to write a post on this workflow because it's
         | incredibly useful
        
           | logicprog wrote:
           | I'll look into this! But I have to say I'm a bit attacked to
           | the little app I've ended up habing AI make for myself lol.
           | It's so cute, and its mine!
        
         | RicoElectrico wrote:
         | I just can't stand how non-deterministic many deep learning
         | TTSes are. At least the classical ones have predictable
         | pronunciation which can be worked around if needed.
        
       | frumiousirc wrote:
       | This needs to be run from an environment where `pip` is available
       | as that tool is used during the running of the abogen app. Using
       | `uv tool run abogen` gets you started but then the app hangs at
       | model install time. `uv venv && uv pip install pip && source
       | .venv/bin/activate && abogen` lets it run properly.
       | 
       | Otherwise, it's a nicely packaged GUI. Well done!
       | 
       | I tried a PDF and the UI to select pages or sections is good and
       | generation is fast on my laptop's GTX 1650.
       | 
       | The result is an .ogg audio and .ass subtitle file. Played with
       | mpv allows listening and reading along in the terminal. Only
       | issue I have with the result is that visual line breaks from the
       | PDF are preserved resulting in long pauses "randomly" in the
       | middle of sentences. This greatly interrupts understanding of the
       | audio.
       | 
       | Edit: enabling the skipping of single newlines helps!
        
         | nnashrat wrote:
         | I just converted a 110 page book to wav in about an hour with a
         | RTX 4060.
         | 
         | I didn't have the newlines enabled though so it was pretty
         | useless.
         | 
         | Enabling makes this pretty awesome.
         | 
         | af_heart is a great voice to me while af_jessica I find
         | annoying. That is the main issue I have with audiobooks , the
         | randomness of liking the voice actor or not almost matters as
         | much as what the book says for me.
         | 
         | I knew this day was coming soon and I really am blown away. I
         | have got so use to audiobooks that it is hard to actually sit
         | and read a full book for me. I have about 20 books to convert
         | that would never have a market to bother having someone read
         | the book and in a voice I really like. Incredible.
        
       | gman83 wrote:
       | I love audiobooks, but I'm a stickler for good narration. I've
       | stopped listening to plenty of audiobooks because I didn't like
       | the narrator. I guess it will be a long time before I can use
       | something like this.
        
         | NBJack wrote:
         | I recall one series where R. C. Bray had been doing the
         | narration for several books, then for undisclosed reasons they
         | replaced him with another narrator. The drop in quality was so
         | bad I eventually gave up trying to finish the series (though
         | admittedly the author(s) didn't seem to be helping much with
         | the content).
         | 
         | Some narrators, like Wil Wheaton, are so entertaining to me I
         | actively search by what they have voiced.
         | 
         | In general, I have to agree the narrator can make or break a
         | series.
        
         | ratelimitsteve wrote:
         | Coming from the other side of this, I've had a good narrator
         | sell me entire series in the past. The Grim Noir Chronicles is
         | the first that comes to mind (idk if anyone remembers the
         | terrible sitcom Perfect Strangers but he played Balki and in
         | real life he has a buttery smooth baritone voice that I just
         | adore) and anything that Soundbooth Theater touches, partly for
         | Jeff Hays and partly because of the full-cast adaptations they
         | do that feel like old school radio plays to me. I see no reason
         | to use this over existing text to speech features. If i just
         | want to mechanically turn patterns of light into patterns of
         | vibration there's non-AI tech that will do that for free, and
         | AI narration doesn't do what human narration does yet.
        
         | criddell wrote:
         | What are some of your favorite audiobooks?
        
         | crazygringo wrote:
         | > _I 've stopped listening to plenty of audiobooks because I
         | didn't like the narrator._
         | 
         | I'm with you on this, but my reaction is the opposite -- I'm
         | wondering if there are some books I couldn't stand to listen
         | to, that now I could with a nice neutral narration voice?
         | Instead of the weird untrained voice with weird vocal tics that
         | was the official narration?
        
       | xtracto wrote:
       | I assume it doesn't work well for books that have non-text
       | structured elements (code, diagrams, etc)or images (which is
       | expected).
       | 
       | I wonder, is there some open source NN that can consume PDF pages
       | and produce a "pure prose" version of it. Say, a page with mixed
       | text and an image of a car engine would be output to the text and
       | then a detailed description of the image, or what it is
       | depicting.
        
       | numb7rs wrote:
       | You will want to reconsider the name if you plan to have a
       | presence in Australia or New Zealand. "Abo" is an ethnic slur
       | similar in offensiveness to the N-word.
        
         | yepyip wrote:
         | Everyone has a gimmick these days. If abos are that sensitive,
         | let them not have access to solutions--no need to worry about
         | them.
        
           | someperson wrote:
           | The project presumably is a portmanteau of "audio book
           | generator".
           | 
           | I agree that the project need not be renamed to remove the
           | single syllable that may be an obscure slur, especially since
           | every syllable may be an obscure slur in some language and
           | you can't expect somebody to learn them all just to avoid
           | them.
           | 
           | But there was no need to use that syllable as a slur.
        
         | isaacremuant wrote:
         | Don't worry. Australia will probably ban it for some reason
         | anyway. Better to be free.
         | 
         | Btw, Don't look into the name of a famous python formatter or
         | you might be offended.
        
       | m_sahaf wrote:
       | I imagine a pipeline between Calibre-Web[0] and audiobookshelf[1]
       | going through Abogen, where Calibre-Web supplies the books,
       | Abogen generates the audio version of it, and Audiobookshelf
       | serves them. Great solution for the hearing impaired.
       | 
       | [0] https://github.com/janeczku/calibre-web
       | 
       | [1] https://github.com/advplyr/audiobookshelf
        
       | lynx97 wrote:
       | DAISY would be a desirable output format.
        
       ___________________________________________________________________
       (page generated 2025-08-10 23:00 UTC)