[HN Gopher] Abogen - Generate audiobooks from EPUBs, PDFs and text
___________________________________________________________________
Abogen - Generate audiobooks from EPUBs, PDFs and text
Author : mzehrer
Score : 267 points
Date : 2025-08-10 05:56 UTC (17 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| nikolayasdf123 wrote:
| can I choose any voice? would love to read software engineering
| books in voice of Morgan Freeman, or maybe even better, Scarlett
| Johansson
| hulitu wrote:
| Why not Stephen Hawking ?
| hajimuz wrote:
| Yeah, could be a buff like 500% brain supercharge.
| throwup238 wrote:
| Because the Stephen Hawking voice spends a quarter of its
| time joking/complaining how it never got a Nobel Prize.
| pyman wrote:
| The voice of Mickey Mouse would be nice.
| TOGoS wrote:
| The demo video doesn't seem to have any audio in it! At least
| none that either ffmpeg or whatever Firefox uses can recognize.
| huseyinkeles wrote:
| I can hear it on safari
| Daunk wrote:
| Same on my end, no audio in the video.
| jamilton wrote:
| Same here, but it worked when I opened it in Chrome. What a
| weird error - you would think that playing an embedded mp4 with
| audio wouldn't differ from browser to browser.
| mnmalst wrote:
| I was surprised by this as well at first but thinking about
| it, it would make sense when they use an audio codec which is
| not supported on the target system. In that case the video
| can still play but the audio can't. I wasn't aware tho that
| audio can be disabled separately.
| ertian wrote:
| Yeah, I've run a local Kokoro instance, and it doesn't work
| with Firefox. This uses Kokoro under the hood.
| noisem4ker wrote:
| The demo clip is static and has the Kokoro output encoded as
| the audio track. It's not Kokoro running and generating it in
| your browser in real time.
| noisem4ker wrote:
| It's probably due to the unusual sound format, 24kHz mono PCM,
| and the fact that it was somehow forced into a WebM container,
| which only supports Vorbis and Opus officially.
|
| It looks like the author created it using the "higher quality"
| ffmpeg command line, except for the "webm" final extension,
| producing the opposite of what's described as "an MP4 file
| that's compatible with more devices".
|
| https://github.com/denizsafak/abogen/tree/main/demo#for-high...
| frumiousirc wrote:
| Thanks for this. I thought I had some local issue with
| waterfox. Pasting the (long) video URL to the terminal let's
| mpv play it with audio.
| 8s2ngy wrote:
| I've been using Kokoro TTS with the CLI app, audiblez, mentioned
| in the "Similar Projects" section of the README. The model is
| fast and delivers impressive quality for its small size. Some
| issues I have faced, however, are: a) It doesn't distinguish
| periods at the end of sentences from the dots in abbreviations
| such as "Mr." or "Mrs." The result is an awkward pause between
| "Mr." and the name. b) It doesn't handle ellipses well. c) Words
| are pronounced the same way regardless of context.
| rkagerer wrote:
| The Mr. / Mrs. thing feels like it would be a pretty easy fix,
| at least to eliminate a lot of the more common cases.
| hombre_fatal wrote:
| ^ A thought that everyone has had at one point when
| processing human text before learning the hard way (like end
| of sentence detection). :P
|
| The difference is that even weak LLMs are good at magically
| doing this, so I wonder what the problem is for the TTS
| mentioned above.
| leobg wrote:
| Kokoro is small and fast because all the text -> phoneme
| conversion is done by "dumb code" and only the phoneme ->
| sound part is done using a neural net.
| fudged71 wrote:
| Look into SSML phoneme tags. Some TTS supports it. That was you
| can use a powerful LLM to fix these issues ahead of TTS
| anotherpaul wrote:
| Does it turn it into spoken word or an audiobook? Because good
| audiobooks often have voice actors that read the characters with
| different emphasis and dialects. I imagine tools like chatgpt
| could do this for a few sentences but what about an 8-20 hour
| audiobook?
|
| I think there are still basic hurdles to take before we can go
| epub to audiobook in a quality that can compete with current
| state of the art.
|
| Or am I missing something?
| jamilton wrote:
| Elevenlabs has a feature for a "full cast"-type generation,
| where different characters will get different voices. It's
| certainly not automatically sensitive to dialect though.
|
| It's probably possible with current systems to do though. I
| believe there are TTS systems that can use context/prompting to
| change emphasis and other speech qualities, though I'm not sure
| how reliably.
| pyman wrote:
| Is it open source?
| vorgol wrote:
| Have you heard results from it? How does it know for example,
| when there is a romantic scene in the book, which voice to
| read out as?
|
| It's definitely an exited voice, but is it read out as in a
| battle or as in a romantic scene?
| tummler wrote:
| I'm sure it's doable. I think you'd want to break it into a
| few discrete steps for the best quality. First process the
| book and identify key info like genre, tone, etc. Use that to
| determine the best voice(s) and reading style, assign actors
| for multiple characters/subjects. Maybe output some examples
| to spot check for approval. Tweak based on that then generate
| the audio. Prob a couple other steps in there and maybe a bit
| of custom work to optimize in key areas. If someone wants to
| do this as a side project I can help scope the architecture
| and process but I don't want to code it. :p
| fudged71 wrote:
| I don't think they do it automatically, though. I think you
| need to piece apart the transcript in their tool to decide
| which voice to use where.
| BenGosub wrote:
| There are a few character voices that also can be mixed using
| the mixer, achieving different nuances. You can then write your
| own code to use different voices for different characters.
| parineum wrote:
| > Because good audiobooks often have voice actors that read the
| characters with different emphasis and dialects.
|
| I actually hate this. I like quotes to be read with the tone
| and inflection implied by the context but I don't like the
| different voices.
| crazygringo wrote:
| I'm with you. It's as if a book decided to use a different
| font for each character's speech. It's distracting, not
| helpful.
| floppyd wrote:
| I tried Kokoro for voicing blog posts and articles and wasn't
| impressed to be honest. Right now Gemini 2.5 Flash TTS is a much
| more capable system with generous free limits (about 10 minutes
| per generation and about 90 minutes per day). Voices are not very
| consistent between generations, but for shorter pieces it's not a
| big deal (but will obviously be for books)
| ekianjo wrote:
| Kokoro is fine for TTS, but it lacks emotion. But for a model
| of this size, that is kind of given.
| robin_reala wrote:
| Ironic given the name: kokoro is Japanese for heart or
| sentiment.
| SirHumphrey wrote:
| I played with ebook generation a bunch and find that (at
| least for English text) around 1B is needed to get something
| usable emotionally (Chatterbox is 0.5B, Orpheus is 3B).
| scotty79 wrote:
| I think the quality of the voice is super important for
| audiobooks and I think we are just closing in on the required
| quality with TTS.
|
| I played a bit with Eleven labs voices and while they aren't bad
| when I tried make them read fragment of a text that I wrote, it
| sounded chaotic, boring, quite terrible, for anything longer than
| a sentence or two. But when I tried their v3 voices which they
| are currently in the process of rolling out, the same text
| sounded consistent, emotional, engaging, simply amazing. I think
| we are just crossing vocal uncanny valley.
| porker wrote:
| Strong agree that voice quality (and voice acting) is
| important. I listen to a lot of fiction audiobooks, and will
| listen to the end of a middling book with a good narrator, but
| if the narration is flat or out of keeping with the characters
| I'll stop after a chapter or two.
| dsign wrote:
| Nice!
|
| As an aside, while this tool can be used to create an audiobook
| from a book you have in text format, for your private
| consumption, having an author employ something like this to
| create files for distribution is extremely risky, even if they
| acknowledge its use and intend those files to only be available
| on their website.
|
| Indie authors struggle a lot to promote their works, and the new
| normal is that potential readers, the polite ones[^1], use the
| slightest hint of AI usage to discard their title and move
| on...as they are entitled to, since there are so many books.
|
| I in particular have started to hire voice actors that have good
| acting skills and good diction but for whom English is their
| second language, or it's their first language but they speak
| something else at home; sometimes I even ask them to go a notch
| up with their accents. It helps with the non-AI recognition, and
| it also increases the appeal of the book for people who would
| like to try out something new. Once, I did an audition for a
| project and was pleasantly surprised with how much life people
| from around the Mediterranean basin were able to inject into
| their renderings, compared with people from Britain and North
| America.
|
| [^1] Impolite readers set the town on fire, and then go about and
| spread that fire to neighboring towns, for good measure.
| baxtr wrote:
| I am big time user of Amazon's WhisperSync feature. With that
| feature I can simultaneously read the book and listen to it.
|
| This is especially helpful when you're on the go but still want
| to have a visual now and then or highlight text for later.
|
| The problem is that many books don't offer that feature. There
| is a built-in read function now in the kindle app, but it's
| crap.
|
| So, if you ask me, I'd prefer a good human-written book with an
| additional AI voice on top to enable that feature for me.
| em-bee wrote:
| yeah, i don't see the problem. using a generated voice, no
| matter how, only affects the audiobook, not the actual book.
| if i don't like the voice i can ignore it. i am part of a
| group that occasionally gets email from new authors wanting
| us to review their books. and some of them sound really
| interesting, and i'd love to read them, but i can only do
| audiobook, so i would be very happy if the author went
| through the effort to generate an audiobook that i could
| listen to.
| vahid4m wrote:
| I'm obsessed with "simultaneously read the book and listen to
| it". Thats why I built WithAudio. You can checkout the demo
| here: https://desktop.with.audio/reader-demo
|
| I'd love to hear any feedback you have. "prefer a good human-
| written book with an additional AI voice on top to enable
| that feature for me" is exactly what I prefer when it comes
| to reading.
| baxtr wrote:
| Thanks! Will definitely check it out
| montag wrote:
| Very surprising that you're offering this without
| subscription. Huge selling point. Next time the need
| arises, I'll come back for WithAudio.
| cyberax wrote:
| Thre's a nice project that attempts to do this:
| https://gitlab.com/storyteller-platform/storyteller
|
| I've been meaning to use its position sync protocol with
| KoReader, but it's not trivial.
| crazygringo wrote:
| > _and the new normal is that potential readers, the polite
| ones[^1], use the slightest hint of AI usage to discard their
| title and move on..._
|
| Is that the new normal?
|
| My impression is that when it comes to reading text, nobody
| cares as long as the final product is good.
|
| People don't want AI- _written_ books, but people have been
| comfortably listening to AI voices reading text for a long time
| now. Text-to-speech isn 't really a controversial thing for
| listening to articles or books.
|
| (Which is very different from voice acting, for example, which
| requires _acting_ not just reading.)
| leke wrote:
| How big is this app?
| amaccuish wrote:
| Amazing, but I'm personally waiting for the one that generates a
| well formated ePub from a PDF.
| poulpy123 wrote:
| perfect, I was looking for something like that ! is it gui only,
| or is there an api available ? I would like to be able to share a
| link or a text from my phone and get back the audio
| logicprog wrote:
| I've been using this to try to make audiobooks out of various
| philosophy books I've been wanting to read, for accessibility
| reasons, and I ran into a critical problem: if the input text fed
| to Kokoro is too long, it'll start skipping words at the end or
| in the middle, or fade out at the end; and abogen chunks the text
| it feeds to Kokoro by sentence, so sentences of arbitrary length
| are fed to Kokoro without any guarding. This produces unusable
| audiobooks for me. I'm working on "vibe coding" my own Kokoro
| based tkinter personal gui app for the same purpise that uses
| nltk and some regex magic for better splitting.
| gavinray wrote:
| I use "kokoro-tts" CLI, which has better chunking/splitting.
|
| https://github.com/nazdridoy/kokoro-tts
|
| It generates a directory of audio files, along with a metadata
| file for ebook chapters
|
| You have to use m4b-tool to stitch the audio files together
| into an audiobook and include the chapter metadata, but it
| works great:
|
| https://github.com/sandreas/m4b-tool
|
| I've been meaning to write a post on this workflow because it's
| incredibly useful
| logicprog wrote:
| I'll look into this! But I have to say I'm a bit attacked to
| the little app I've ended up habing AI make for myself lol.
| It's so cute, and its mine!
| RicoElectrico wrote:
| I just can't stand how non-deterministic many deep learning
| TTSes are. At least the classical ones have predictable
| pronunciation which can be worked around if needed.
| frumiousirc wrote:
| This needs to be run from an environment where `pip` is available
| as that tool is used during the running of the abogen app. Using
| `uv tool run abogen` gets you started but then the app hangs at
| model install time. `uv venv && uv pip install pip && source
| .venv/bin/activate && abogen` lets it run properly.
|
| Otherwise, it's a nicely packaged GUI. Well done!
|
| I tried a PDF and the UI to select pages or sections is good and
| generation is fast on my laptop's GTX 1650.
|
| The result is an .ogg audio and .ass subtitle file. Played with
| mpv allows listening and reading along in the terminal. Only
| issue I have with the result is that visual line breaks from the
| PDF are preserved resulting in long pauses "randomly" in the
| middle of sentences. This greatly interrupts understanding of the
| audio.
|
| Edit: enabling the skipping of single newlines helps!
| nnashrat wrote:
| I just converted a 110 page book to wav in about an hour with a
| RTX 4060.
|
| I didn't have the newlines enabled though so it was pretty
| useless.
|
| Enabling makes this pretty awesome.
|
| af_heart is a great voice to me while af_jessica I find
| annoying. That is the main issue I have with audiobooks , the
| randomness of liking the voice actor or not almost matters as
| much as what the book says for me.
|
| I knew this day was coming soon and I really am blown away. I
| have got so use to audiobooks that it is hard to actually sit
| and read a full book for me. I have about 20 books to convert
| that would never have a market to bother having someone read
| the book and in a voice I really like. Incredible.
| gman83 wrote:
| I love audiobooks, but I'm a stickler for good narration. I've
| stopped listening to plenty of audiobooks because I didn't like
| the narrator. I guess it will be a long time before I can use
| something like this.
| NBJack wrote:
| I recall one series where R. C. Bray had been doing the
| narration for several books, then for undisclosed reasons they
| replaced him with another narrator. The drop in quality was so
| bad I eventually gave up trying to finish the series (though
| admittedly the author(s) didn't seem to be helping much with
| the content).
|
| Some narrators, like Wil Wheaton, are so entertaining to me I
| actively search by what they have voiced.
|
| In general, I have to agree the narrator can make or break a
| series.
| ratelimitsteve wrote:
| Coming from the other side of this, I've had a good narrator
| sell me entire series in the past. The Grim Noir Chronicles is
| the first that comes to mind (idk if anyone remembers the
| terrible sitcom Perfect Strangers but he played Balki and in
| real life he has a buttery smooth baritone voice that I just
| adore) and anything that Soundbooth Theater touches, partly for
| Jeff Hays and partly because of the full-cast adaptations they
| do that feel like old school radio plays to me. I see no reason
| to use this over existing text to speech features. If i just
| want to mechanically turn patterns of light into patterns of
| vibration there's non-AI tech that will do that for free, and
| AI narration doesn't do what human narration does yet.
| criddell wrote:
| What are some of your favorite audiobooks?
| crazygringo wrote:
| > _I 've stopped listening to plenty of audiobooks because I
| didn't like the narrator._
|
| I'm with you on this, but my reaction is the opposite -- I'm
| wondering if there are some books I couldn't stand to listen
| to, that now I could with a nice neutral narration voice?
| Instead of the weird untrained voice with weird vocal tics that
| was the official narration?
| xtracto wrote:
| I assume it doesn't work well for books that have non-text
| structured elements (code, diagrams, etc)or images (which is
| expected).
|
| I wonder, is there some open source NN that can consume PDF pages
| and produce a "pure prose" version of it. Say, a page with mixed
| text and an image of a car engine would be output to the text and
| then a detailed description of the image, or what it is
| depicting.
| numb7rs wrote:
| You will want to reconsider the name if you plan to have a
| presence in Australia or New Zealand. "Abo" is an ethnic slur
| similar in offensiveness to the N-word.
| yepyip wrote:
| Everyone has a gimmick these days. If abos are that sensitive,
| let them not have access to solutions--no need to worry about
| them.
| someperson wrote:
| The project presumably is a portmanteau of "audio book
| generator".
|
| I agree that the project need not be renamed to remove the
| single syllable that may be an obscure slur, especially since
| every syllable may be an obscure slur in some language and
| you can't expect somebody to learn them all just to avoid
| them.
|
| But there was no need to use that syllable as a slur.
| isaacremuant wrote:
| Don't worry. Australia will probably ban it for some reason
| anyway. Better to be free.
|
| Btw, Don't look into the name of a famous python formatter or
| you might be offended.
| m_sahaf wrote:
| I imagine a pipeline between Calibre-Web[0] and audiobookshelf[1]
| going through Abogen, where Calibre-Web supplies the books,
| Abogen generates the audio version of it, and Audiobookshelf
| serves them. Great solution for the hearing impaired.
|
| [0] https://github.com/janeczku/calibre-web
|
| [1] https://github.com/advplyr/audiobookshelf
| lynx97 wrote:
| DAISY would be a desirable output format.
___________________________________________________________________
(page generated 2025-08-10 23:00 UTC)