[HN Gopher] Generate audiobooks from E-books with Kokoro-82M
       ___________________________________________________________________
        
       Generate audiobooks from E-books with Kokoro-82M
        
       Author : csantini
       Score  : 375 points
       Date   : 2025-01-15 08:47 UTC (14 hours ago)
        
 (HTM) web link (claudio.uk)
 (TXT) w3m dump (claudio.uk)
        
       | jaggs wrote:
       | This looks really nice. And fast too it seems.
        
       | treetalker wrote:
       | For anyone looking for an easier alternative (and one without the
       | bugs the author describes, such as skipping some prefaces or
       | failing to detect some chapters), Voice Dream Reader on iOS (and
       | macOS) handles .epub and other e-books just fine and supports a
       | variety of built-in and external voices.
        
         | huhtenberg wrote:
         | Another subscription.
         | 
         | $80/yr.
         | 
         | Yaaaaaay.
        
           | treetalker wrote:
           | Unless something has changed, the iOS version is a one-time
           | purchase. I bought the app many years ago (8?) and have been
           | a happy user since.
           | 
           | Like you, though, I had that reaction to the subscription
           | model for macOS and therefore decided not to "buy" it when it
           | came out.
        
             | huhtenberg wrote:
             | They got greedy and decided to milk it. That's what
             | changed.
             | 
             | It's $80/yr for the iOS app.
        
               | treetalker wrote:
               | Oof, I believe they changed ownership so I must have been
               | grandfathered in. That's steep.
        
         | danman2 wrote:
         | Do you know if it's possible to train it to use my own voice?
        
         | rhizoma wrote:
         | Yes, I've used Voice Dream for years with Pocket articles &
         | ebooks because the Pocket app took up too much space and was
         | limited to web articles. The voice quality is ok for short
         | pieces or stints. The choice of voices is a bit robotic, but I
         | find it useful while making written notes in Split View.
        
         | freefaler wrote:
         | Kybook is 1 time payment and can use iOS TTS voices.
        
         | jdlyga wrote:
         | ElevenLabs Reader is the same thing, but much higher quality
         | voices for free. I've lost my place a few times so it's not
         | quite as reliable as VoiceDream. But you aren't paying an
         | expensive subscription with mediocre voices.
        
       | qurashee wrote:
       | This looks incredible! I've had an idea simmering in the back of
       | my mind for a while now: creating an audiobook from an ebook for
       | my commute using the voice of a specific audiobook narrator I
       | really enjoy. The concept struck me after coming across the
       | Infinite Conversation project here on HN. Unfortunately, I just
       | haven't found the time to bring it to life yet. :(
        
         | vinni2 wrote:
         | What about the copyright issue? You can't mimic the voice of a
         | narrator without their consent. OpenAI landed in trouble after
         | using Scarlett Johansson's voice in a demo.
         | 
         | https://www.theverge.com/2024/5/20/24161253/scarlett-johanss...
        
           | notachatbot123 wrote:
           | No limitations on this kind of thing if you are in private
           | use.
        
             | vinni2 wrote:
             | Forgive me for not knowing it was for personal use.
        
             | qurashee wrote:
             | Indeed I was thinking about private use only.
        
           | benatkin wrote:
           | She only won in that OpenAI decided it wasn't worth the
           | trouble.
        
             | K0balt wrote:
             | Yeah, by my ear it was pretty clearly not SJ's voice-
             | likeness, although there were some superficial
             | similarities.
             | 
             | But some people could have mistook it due to some regional
             | accent similarities, though it would be akin to
             | interpretation of any light southern drawl with a similar
             | timbre as being SJ.
        
               | mmahemoff wrote:
               | They also asked her for permission in advance, which was
               | never going to help their case.
        
           | amrrs wrote:
           | Kokoro really mentions that they used only permissive
           | licensed voice
        
       | gunalx wrote:
       | Kokoro seemed pretty nice for the size. I guess it is not much
       | mvetter than a lot of the simpler tts. But at least it sounds
       | less machinic than a few bad ones.
        
         | outofpaper wrote:
         | It is essentially a set of voice models building on
         | https://huggingface.co/spaces/styletts2/styletts2
         | 
         | The odd thing is that while they are releasing these great
         | sounding models, they are not documenting the training process.
         | What we want to know is what magic if any allowed them to
         | create such wonderful voices...
        
       | cess11 wrote:
       | I would for sure not want this for fiction, it's too obvious that
       | the voice has no understanding whatsoever of the text, but it's
       | probably pretty nice for converting short news texts or
       | notifications to audio.
        
         | vanderZwan wrote:
         | Your point is a valid one, but I want to add to it that it is
         | also a matter of expectations and how one listens.
         | 
         | Years ago, when I was dating someone who spoke Russian as one
         | of her native languages, we had to do a funny compromise when
         | watching films together with her parents: they didn't speak a
         | word of English, so we'd use the Russian dub with English
         | subtitles.
         | 
         | I noticed that the Russian dub was just _one man_ reading a
         | translation in a flat voice over what was happening on the
         | screen, no attempts at voice acting or matching the emotions.
         | Usually the dub would have a split second delay to the actual
         | lines, so you 'd still hear the original voices for a moment
         | (and also a little bit in the background).
         | 
         | At first I found it very jarring, but they explained that this
         | flatness was a feature. You'll quickly learn to "filter out"
         | the voice while still hearing the translation, and the faint
         | presence of the original voices was enough to bring the
         | emotional flavor back. The lack of voice acting helped with the
         | filtering.
         | 
         | This turned out to apply to me as well, even though I don't
         | speak Russian! My brain subconsciously would filter out the
         | dub, and extract most of the original performance through the
         | subtitles and faint presence of the original voices. Obviously
         | the original version would have been a better experience for
         | me, but it was still very enjoyable.
         | 
         | Of course a generated audiobook is not a dub, as there is no
         | "original voice" to extract an emotional performance from. But
         | some listeners might still be able do something similar. The
         | lack of understanding in the generated voice and its
         | predictable monotony might allow them to filter out everything
         | but the literal text, and then fill it in with their own
         | emotional interpretations. Still not as great as having proper
         | story teller who _does_ understand the text and knows how to
         | deliver dramatic lines, but perhaps not as bad as expected
         | either.
        
           | cess11 wrote:
           | It's not a "point", I didn't make an argument.
           | 
           | I dislike german and russian style dubs as well, I'd rather
           | learn a bit of the original language.
        
           | em-bee wrote:
           | indeed, audio books come in many forms, some are rather flat,
           | and some include different voices, even by different
           | speakers, or include a few voiced sound effects, laughing,
           | crying, singing, etc. TTS is extra flat, but if the quality
           | is good otherwise then it is like reading with my ears, and i
           | add the emotions myself.
        
           | aleksiy123 wrote:
           | Watching these as russian/english bilingual is very painful,
           | tho I grew up in western world so maybe I'm just not used to
           | it.
           | 
           | To add on a slight tangent. Many books/audiobooks just don't
           | exist in other languages at all. So even getting some
           | monotone is a lot better than getting nothing.
           | 
           | I think this is where these models really shine. Cheaply
           | creating cross language media and unlocking the
           | knowledge/media to underprivileged parts of the world.
        
             | vanderZwan wrote:
             | > _Watching these as russian /english bilingual is very
             | painful, tho I grew up in western world so maybe I'm just
             | not used to it._
             | 
             | I figured that their opinion probably wasn't universal,
             | hahaha.
             | 
             | And yes, it's at the very least a win for accessibility
        
           | arafalov wrote:
           | Here is the rest of that story.
           | 
           | When the foreign movies started to filter into the Soviet
           | Union's illegal movie theatres, you would get 3 or 4 movies
           | playing at once in one room. There would be a TV in each
           | corner of the room and 4 or 5 rows of plastic chairs in front
           | of it in an arch.
           | 
           | ALL of the movies were being revoiced by the same person. So,
           | if you were sitting in the back of the 5th row, you were
           | potentially getting the sound from an action movie, a comedy,
           | a horror movie and a romance at the same time. In the same
           | voice.
           | 
           | You learned to filter really well. So, if that's what they
           | were trained on, watching a single movie must have been very
           | relaxing.
        
             | vanderZwan wrote:
             | Looking at the modern internet experience it sounds like
             | the Soviet Union's illegial movie theatres were ahead of
             | their time!
        
         | calgoo wrote:
         | Audible has thousands of books available "for free" with their
         | membership that are all AI generated. I was the same in the
         | start, but after listening to a few, it really comes down to
         | the voice used. I spent 8h on a plane listening to 1 book, and
         | there was maybe 5 occasions where i had an issue with the
         | voice; and i think all where just "AI weirdness", similar to
         | chat LLMs messing up simple sentence structure or image
         | generating LLMs adding an extra finger.
        
           | cess11 wrote:
           | I don't think dominant suppliers like Audible should exist so
           | that matters little to me.
        
           | arafalov wrote:
           | The one I tried, had a lot of issues. It was a music theory
           | book and it did not know how to pronounce C# (it kept saying
           | C 'hash'). It also referred to, but did not read out the
           | diagrams, or tables.
           | 
           | So, it was not just the voice, but the quality control
           | pipeline that was missing as well.
           | 
           | Maybe it mostly works for old plain text books, but if nobody
           | is checking.....
        
       | mg wrote:
       | Would this also be the best option if you just want to convert
       | plain text files to audio?
        
         | bArray wrote:
         | Markdown and PDF would also be cool. I think it's just a case
         | of feeding the TTS model the right data at the right time. The
         | special sauce is in the model, there's really not much to the
         | code:
         | https://github.com/santinic/audiblez/blob/main/audiblez.py
        
       | katspaugh wrote:
       | Sounds better than many books on Audible.
        
       | ekianjo wrote:
       | japanese is not supported yet despite the claims. you can easily
       | realize that by running the examples provided.
        
       | laserbeam wrote:
       | On the one hand, this is very convenient. Probably cool for some
       | non-fiction.
       | 
       | On the other, some of my favorite audio books all stood out
       | because the narrator was interpreting the text really well, for
       | example by changing the pacing during chaotic moments. Or those
       | audiobooks with multiple narrators and different voices for each
       | character. Not to mention that sometimes the only cue you get for
       | who's speaking during dialogue is how the voice actor changes
       | their tone. I have mixed feelings about using this and losing
       | some of that quality.
       | 
       | I would totally use this over amateur ebooks or public domain
       | audiobooks like the ones on project guttenberg. As cool as it
       | is/was for someone to contribute to free books... as a listener
       | it was always jarring to switch to a new chapter and hear a
       | completely different voice and microphone quality for no reason.
        
         | ahoka wrote:
         | I guess this is still very useful if you are blind.
        
           | loktarogar wrote:
           | Yeah, for accessibility purposes on things that aren't
           | already narrated, this is kind of thing is huge.
        
             | em-bee wrote:
             | that's the thing. it's not just for accessibility. anything
             | not already narrated is a fair target for TTS. i don't have
             | time to sit down and read books. all reading is done on the
             | go, while getting around or doing daily routines at home. i
             | have a small book that i am reading now, which should take
             | a few hours to finish, but in the time i manage to get done
             | reading it i will probably have listened to two or three
             | audio books.
             | 
             | oh, and it's also a boon for those who can't afford to buy
             | audiobooks.
        
               | vasco wrote:
               | You don't choose to spend your time reading books. You
               | probably roll your eyes when someone tells you they don't
               | have time for some activity you deem valuable. This is
               | the 'no time to exercise' debate in a different shape.
               | 
               | They are also different activities, with audio it's
               | easier to listen to more but retention is usually lower.
               | Not casting any elitist "you need to read" bullshit by
               | the way, but find it odd to define it in terms of lack of
               | time, and I really like both mediums.
        
               | em-bee wrote:
               | there is not much of a choice here. sure, i could use the
               | time i spend reading and commenting on HN to read books
               | instead. so technically speaking it is a choice. but i
               | want to do both and many other things besides also having
               | to work and a family to take care of. so the result is, i
               | can't afford the time to read without giving up other
               | things that are also important to me. listening to books
               | allows me to access books i would otherwise not be able
               | to read because of these priorities.
               | 
               | there are other factors as well. i love reading so much
               | that i tend to forget time around me. as a result reading
               | would cause me to neglect other duties. i can't allow
               | that, and therefore i am forced to avoid reading. i also
               | don't like long form reading on electronic devices, and
               | as a frequent traveler, printed books are simply not
               | practical and often not even accessible.
               | 
               | i agree with the retention issue, but i found that a much
               | larger factor for retention is how well i can follow the
               | story. a good story that is easy to get into is also
               | easier to retain. and finally, reading fiction is for
               | entertainment. i don't have to retain it.
        
               | esrauch wrote:
               | > You probably roll your eyes when someone tells you they
               | don't have time for some activity you deem valuable.
               | 
               | There's a few categories where it makes sense to roll
               | your eyes, like if they say they have no time to shower
               | or have never been to one of their kid's baseball games.
               | 
               | But for things that aren't basic human expectations, I
               | think you'd have to a real jerk to roll your eyes at
               | someone not having time. No time to cook multi-pot
               | dishes? No time to exercise? No time to read? No time to
               | go to museums? No time to meet at the bar for a drink?
               | Any of them sensible.
               | 
               | No one can do everything, we all make our priorities and
               | its well within their choice not to have any one optional
               | life thing at the top of their personal stack.
        
               | vasco wrote:
               | Agree completely, my point was indeed they are choices,
               | not lack of time. I think I came across too judgy even
               | trying not to. You made a better job of it.
        
               | hombre_fatal wrote:
               | This is a weird comment. They are just saying why they
               | prefer audiobooks thus why general TTS is useful for
               | them.
               | 
               | Why are you trying to argue about their preference? They
               | didn't cast any judgement on others with different
               | preferences.
               | 
               | This is nothing like "no time for exercise".
               | 
               | It's more like "I have no time (preference) to fire up
               | the wood stove so I use microwave" and then you come in
               | with "wow so you roll your eyes at us fire stove users?"
        
               | vasco wrote:
               | Two hours before you posted this there was already an
               | admission from me in a sister comment that I came across
               | too judgy and someone else made the point I tried better
               | than myself - not sure how much penitence I need to do
               | but sorry again :)
        
             | flir wrote:
             | I was just thinking about automatically slapping an mp3 on
             | every blog post, just an accessibility nicety.
             | 
             | Can someone with low vision tell me if this would be useful
             | to them? It may be that specialist tools already do this
             | better.
        
               | laserbeam wrote:
               | People use screen readers for accessibility. I would not
               | expect anyone to be able to "look for and find" your
               | mp3... I would instead expect them to use the tool they
               | normally use for accessibility.
               | 
               | The real question is "what tools are they already using
               | and how can I make sure those tools are providing higher
               | quality output?". There are standards in browsers for
               | these kinds of things (ways to hint navigation via
               | accessibility tools for example).
        
               | flir wrote:
               | > I would instead expect them to use the tool they
               | normally use for accessibility.
               | 
               | Yes, that was my second thought. But I'd rather ask
               | someone than rely on my assumptions.
        
         | felixhummel wrote:
         | I wholeheartedly agree.
         | https://en.m.wikipedia.org/wiki/Stephen_Briggs got me hooked on
         | Terry Pratchett's Discworld series. I loved "Going Postal".
        
           | IndrekR wrote:
           | I know someone who listened Terry Pratchett's "Wachen!
           | Wachen!" audiobook on Spotify while living in Germany for few
           | years. It was so well narrated that he also acquired some
           | peculiarities of local dialects used by specific characters
           | in the book. Locals in Bavaria were quite surprised of a
           | foreigner speaking such language.
        
         | micw wrote:
         | With this technology, one could produce high quality audio
         | books without having access to high quality narrators by
         | annotating the books with the voice, speed and such things.
         | 
         | I wonder if a standardized markup exists to do so.
        
           | KeplerBoy wrote:
           | Don't end to end trained models already do this to some
           | extent? Like raising the pitch towards a question mark, like
           | a human would.
           | 
           | TortoiseTTS has a few examples under prompt engineering on
           | their demo site:
           | https://nonint.com/static/tortoise_v2_examples.html
        
             | micw wrote:
             | That's a bit of basic and random. Some models have the
             | features you describe. From the better models you get a
             | slightly different voice for text in quotes.
             | 
             | But the difference to good audio books is that you have *
             | different voices for the narrator and each character *
             | different emotions and/or speed in certain situations.
             | 
             | I guess you could use a LLM to "understand" and annotate an
             | existing book if there's a markup and then use TTS to
             | create an audio book from it and so automate most of the
             | the process.
        
               | micw wrote:
               | Edit: I actually tried this. I prompted in ChatGPT:
               | 
               | "Annotate the following text with speakers and emotions
               | so that it can be turned into an audiobook via TTS",
               | followed by a short text from "The Hobbit" (The "Good
               | morning scene"). The result is very good.
        
           | pegasus wrote:
           | They still wouldn't be high quality. It's just not possible
           | to capture the precise tone of voice in an annotation, and
           | that precision I believe really makes a difference. My
           | experience is that the deeper the narrator understands the
           | text and conveys that understanding, the easier it becomes
           | for me to absorb that information.
        
             | vasco wrote:
             | Have you tried those "podcast from a paper" models? They do
             | some of the things you are saying they don't, although it's
             | not 100% it's also miles ahead of for example human Polish
             | TV lectors, or other monotone style narrations.
        
           | albert_e wrote:
           | There is SSML for speech markup to indicate various
           | characters of speech like whispers, pronunciation, pace,
           | emphasis, etc.
           | 
           | With LLMs proving to be very good at generating code, it may
           | be reasonable to assume they can get good at generating SSML
           | as well.
           | 
           | Not sure if there is a more direct way to channel the
           | interpretation of the tone/context/emotion etc from prose
           | into generated voice qualities.
           | 
           | If we train some models on ebooks along with their
           | professionally produced human-narrated audiobooks, with
           | enough variety and volume of training data, the models might
           | capture the essence of that human-interpretation of written
           | text? Just maybe?
           | 
           | Amazon with its huge collection of Audible + Kindle library
           | -- if it can do this without violating any rights -- has a
           | huge corpus for this. They already have "whispersync" which
           | is a feature that syncs text in a kindle ebook with words in
           | corresponding audible audiobook.
        
             | micw wrote:
             | Good points, thank you! I just tested it. While ChatGPT was
             | very good in adding generic (textual) annotations, the
             | result for generating SSML where very poor (lack of voice
             | names, lack of distinction between narrator and character
             | etc).
             | 
             | Probably the results with a model trained for this plus
             | human audit could lead to very good results.
        
         | stavros wrote:
         | > On the other, some of my favorite audio books all stood out
         | because the narrator was interpreting the text really well
         | 
         | This (and everything else with AI) isn't saying "you don't need
         | good actors any more". It's saying "if you don't have an
         | audiobook, you can make a mediocre one automatically".
         | 
         | AI (text, images, videos, whatever) doesn't replace the top
         | end, it replaces the entire bottom-to-middle end.
        
           | j4coh wrote:
           | RIP to future top-enders that would normally have started out
           | on the bottom to middle end.
        
             | aredox wrote:
             | Bingo. AI is going to destroy any pathway for training and
             | accruing experience.
             | 
             | An embalming tech for our dying civilization.
        
               | lupusreal wrote:
               | Just like printing presses killed the profession of
               | copying books by hand, eliminating the training pathway
               | for illuminated manuscripts. Death of civilization itself
               | I say, damn those printing presses.
        
               | j4coh wrote:
               | If you see podcasts as useless in modern society as
               | illuminated manuscripts, no big loss I suppose, but I do
               | enjoy the human made ones and would be sad to see them go
               | extinct as the manuscripts did. And the same thing is
               | happening to other entry-level creative roles, some of
               | which you may personally regret the loss of too.
        
               | akho wrote:
               | I enjoy looking at illuminated manuscripts. Podcasts are
               | bullshit and can die in a ditch.
        
               | teekert wrote:
               | I enjoy podcasts but I still hope illuminated manuscripts
               | won't die in a ditch so other people can enjoy content
               | the way they prefer ;)
        
               | lupusreal wrote:
               | Actually I think illuminated manuscripts had more value,
               | insofar as they were art, than podcasts (99% of which are
               | vapid timewasters and/or friend simulators.) The good
               | podcasts are those view which involve interviewing
               | interesting people, and AI isn't replacing those.
               | 
               | There's a lot more to be said for the value of audio
               | books, but the accessibility gains of proliferated auto-
               | generated audiobooks outweigh the downside of losing a
               | small number of expertly produced audio books.
               | 
               | For context, I listen to audio books a lot, and for years
               | I have listened to traditional TTS readings of books too.
               | Better voice generation for books without audiobooks is a
               | great win for society.
        
               | littlestymaar wrote:
               | Given that the printing press was the root cause for the
               | century of religious wars that soaked Europe with blood,
               | and was key in the revolutions that overthrown absolute
               | monarchies all over Europe, I don't think it's as good as
               | an example as you think it is.
               | 
               | Death of a civilization doesn't mean disappearance of
               | mankind or even overall regression on the long term.
        
               | megaloblasto wrote:
               | Do you have a source for that? I don't think the printing
               | press was the cause of religious wars any more than
               | bullets were the cause of WWII
        
               | baq wrote:
               | Easy access to the Bible text instead of being only read
               | to, hence high literacy of the faithful, was one of the
               | core tenets of some branches of Protestantism.
        
               | llamaimperative wrote:
               | Have you heard of the Protestant Reformation and the
               | following 120 years of war? The entire Protestant <>
               | Catholic blow up that consumed Europe was pretty directly
               | attributable to the printing press.
               | 
               | (To be clear, nothing is solely and exclusively caused by
               | any one thing. Causality is a very fuzzy concept. But
               | sans printing press, those wars certainly wouldn't have
               | happened when/where/how they did, if they ever happened
               | at all).
        
               | thoroughburro wrote:
               | This is common enough knowledge that "read, like, any
               | history" is an appropriate response. However, if you're
               | genuinely curious, here's a random link:
               | 
               | https://ehne.fr/en/encyclopedia/themes/european-
               | humanism/eur...
        
               | lupusreal wrote:
               | I blame canned food and trains for solving the logistics
               | problems that previously prevented massive wars.
        
               | littlestymaar wrote:
               | Napoleonic wars beg to differ.
        
               | sigilis wrote:
               | While they didn't have trains, the Napoleanic wars did
               | feature the first use of canned food to aid in logistical
               | supply of armies. You could argue that the lack of trains
               | (and can openers) probably meant that they jumped the gun
               | on starting giant wars. We Americans fixed that in the
               | Civil War, to great and deadly effect.
        
               | _DeadFred_ wrote:
               | An interesting one I read was public schools and their
               | creation of a national identity. Before public schools
               | there weren't really standardized languages forced upon
               | an entire nation, etc. The countryside was more one
               | country/people/language morphing into the next, not clean
               | delineated lines where country/language switched
               | instantly. It was also said borders were much more
               | open/abstract before the resultant shift as well.
        
               | turnsout wrote:
               | Those revolutions were ultimately positive. The
               | alternative would be the continued rule by monarchs and a
               | single powerful religion
        
               | littlestymaar wrote:
               | See my second paragraph. It can be ultimately positive
               | while still being civilization-ending.
        
               | chairmansteve wrote:
               | No comfort to the millions who died though.
        
               | oldgradstudent wrote:
               | There's a big difference.
               | 
               | Printing presses produce superior products.
               | 
               | A mediocre audiobook is certainly better than no
               | audiobook at all, but it is an inferior product to a well
               | produced audiobook.
        
               | gampleman wrote:
               | > Printing presses produce superior products.
               | 
               | That seems like a highly dubitable statement. Many hand
               | illuminated manuscripts are masterpieces of art. The
               | advantage of the printing press was chiefly economical
               | making the cost of a copy dramatically less, not an
               | increase in quality (especially so by the aesthetical
               | standards of the time).
        
               | karamanolev wrote:
               | Many (most, if not all) hand-made copies contained
               | errors, which printed books did not. They were much
               | closer to 1:1 copies.
        
               | jhbadger wrote:
               | If the mistake happened in the typesetting stage, printed
               | books could spread errors much more efficiently, as in
               | the infamous "wicked bible" of 1631, where a typesetting
               | error made the ten commandments contain the amusing
               | phrase "Thou shalt commit adultery". Surviving copies are
               | quite the collectors' item as most were destroyed.
               | 
               | https://en.wikipedia.org/wiki/Wicked_Bible
        
               | oldgradstudent wrote:
               | Usually, though, errors are corrected and every every
               | printing has fewer errors than the previous one.
        
               | kamarg wrote:
               | What percentage of books get a second print run on a
               | printing press? And what's the process for that? Do they
               | have to reset each word for the second run? I genuinely
               | don't know how a physical process like typesetting can
               | result in increased accuracy on each print.
        
               | jhbadger wrote:
               | Indeed. Even Gutenberg had his Bibles touched up by
               | artists after they were printed (illuminated capital
               | letters and so on) because even he believed his printed
               | copies were inferior to the hand-made ones.
        
               | oldgradstudent wrote:
               | As a work of art, sure. But as books containing
               | information, printing presses produced superior products.
        
               | Workaccount2 wrote:
               | What we have today is early gen "practical" AI.
               | 
               | Even current SOTA models would almost certainly be able
               | to handle multiple speakers and pick-up on the intended
               | tone and intonation.
               | 
               | Don't make the mistake of thinking what we have today is
               | what we will still be working with in 5 or 10 years.
        
               | fidelramos wrote:
               | Some people will learn to use these AIs to make top-
               | quality audiobooks (and books, movies, TV shows,
               | comics...). It will be a more manual process than
               | pressing a button, but still orders of magnitude less
               | than what it took before. As a result there will be a
               | tsunami or high-quality content.
               | 
               | There will be curation and specialization. Previously
               | ignored niches now will be economically profitable. It
               | will be a Renaissance of creativity, and millions of jobs
               | will be created.
        
               | _DeadFred_ wrote:
               | It's kind of wild to me that the future will look like
               | the 80s imagined it all because AI killed the creative
               | seed corn when retro-future 80s was the aesthetic.
        
               | azeirah wrote:
               | We'll be ok lol, while it is a significant transition, it
               | IS just a transition in the media landscape.
               | 
               | AI is big and significant, but we'll be ok. There is also
               | no such "one" thing as "our civilisation". We're deeply
               | interconnected extremely vast and complex interconnected
               | networks of ever-changing relationships.
               | 
               | AI does indeed represent the commoditisation of things we
               | used to really value like "craftsmanship in book
               | narration" and "intelligence". But we've had
               | commoditisations of similar media in the past.
               | 
               | Paper used to be extremely expensive, but as time went
               | on, it became more and more commoditised.
               | 
               | Memory used to be extremely expensive (2000-3000 years
               | ago, we needed to encode memory in _dance_, _stories_ and
               | _plays_. Holy shit). Now you can purchase enough memory
               | to store a billion books for maybe two hours of labor.
               | 
               | Most of these things don't really matter. What is
               | happening is that the media landscape is significantly
               | shifting, and that is a tale as old as history.
               | 
               | I do think the intellectual class will be affected the
               | most. People who understand this shift stand to benefit
               | enormously, while those who don't _might_ end up in a
               | super awful super low class.
               | 
               | And yet, all of that doesn't really matter if you just
               | move to, I dunno, Paramaribo or whatever. The people
               | there are pragmatic and friendly. They don't care about
               | AI too much. Or maybe New Zealand, or Iceland, or Peru,
               | or Nepal or I don't know.
               | 
               | The world isn't ending. Civilisation isn't being
               | destroyed at our core.
               | 
               | The media landscape is changing, classes are shifting,
               | power-relationships are changing. I suggest you think
               | deeply about where you want to live, what you stand for
               | and what is most important to you in life.
               | 
               | I don't need money or tech to be happy. I am fine with
               | just my cats, my closest friends and family and healthy
               | food.
               | 
               | If it happens to be the case that I need to leave tech or
               | that extremely high-end narrated audiobooks cease to
               | exist? Then all I have to say is "oh no, anyway".
               | 
               | We'll be fine. One way or another.
               | 
               | Just different.
        
             | credit_guy wrote:
             | By that time, AI will beat the toppest of the top enders.
             | Remember the time Deep Blue barely beat Kasparov? Now no
             | human, or group of humans can beat a chess engine, even one
             | that runs on an iPhone.
        
               | plastic3169 wrote:
               | I don't think chess is a good example of AI destroying
               | the path to the top. Chess is more popular now and humans
               | keep advancing even though it is futile effort against
               | computers.
        
               | rcxdude wrote:
               | And people are better at chess now in part because of
               | practicing with/against machines. But chess has never
               | been something you can make a living off of unless you
               | were at the very top.
        
             | sam_lowry_ wrote:
             | > RIP to future top-enders that would normally have started
             | out on the bottom to middle end.
             | 
             | This stance always reminds me of the Profession, a 1957
             | novella by Isaac Asimov that depicts pretty much the future
             | where there are only top performers and the ignorant crowd.
        
               | xyproto wrote:
               | He was a clear thinker.
        
             | Der_Einzige wrote:
             | Not RIP at all. "Meritocracy" was coined in a book
             | literally warning us about how terrible such a society
             | would be:
             | https://en.wikipedia.org/wiki/The_Rise_of_the_Meritocracy
             | 
             | The "top-enders" are the privileged who need to have some
             | of their gains for their intelligence redistributed to
             | others. The alternative is "survival of the smartest",
             | which is de-facto what we have today and what Young was
             | trying to warn us about.
        
             | gosub100 wrote:
             | I'm super opposed to AI, but I see this as a rare positive.
             | As someone already said, the win here is to have a
             | audiobook where one doesn't yet exist. hell, maybe the
             | tables will turn and the scrubs will do the hard work of
             | discovering which titles are popular with an audience, then
             | the ebook industry can capitalize on AI by hiring voice
             | actors to produce proper titles?
        
               | DidYaWipe wrote:
               | Not gonna happen. Once the AI shit is out there, people
               | will have consumed it by the time a real actor can create
               | (and edit) the audiobook.
        
             | anothermathbozo wrote:
             | Virtually every book I want this for has been around for
             | 70+ years and still no high or low quality audiobook has
             | been produced. How long do I have to wait for those
             | aspiring top-enders before an audiobook can be made
             | available?
        
               | Arainach wrote:
               | That has nothing to do with audiobook voice actors and
               | everything to do with copyright and who owns the rights
               | to the book (and whether they believe there's any money
               | to be made selling an audiobook version).
        
             | cmdtab wrote:
             | The value of distribution is increasing while the value of
             | content and product is decreasing for all but the top end.
        
             | CuriouslyC wrote:
             | It's common for shows to use big name actors as voices
             | because they draw an audience, nothing will change. Just
             | means a smaller pool of voice actors and they'll mostly be
             | good looking.
        
           | numpad0 wrote:
           | AI TTS has been available for quite some time. Tacotron V1 is
           | about 8 years old. I don't think we saw much bottom end
           | replacement.
           | 
           | IMGO(gut opinion), generative AI is a consumption aid, like a
           | strong antacid. It lets us be done with $content quicker, for
           | content = {book, art, noisy_email, coding_task}. There's
           | obvious preconceptions forming among us all from "generative"
           | nomenclature, but lots of surviving usages are rather
           | reductive in relevant useful manners.
        
             | sam_lowry_ wrote:
             | Yeah, let us not blame AI. Audible damaged the quality of
             | audiobooks than AI.
        
           | no_wizard wrote:
           | Bottom end really, Middle end is still superior to this AI
           | drivel.
        
         | dmazin wrote:
         | Absolutely.
         | 
         | Even on the non-fiction side, the narration for Gleick's The
         | Information adds something.
         | 
         | While I want this tool for all the stuff with no narration,
         | NYT/New Yorker/etc replacing human narrators with AI ones has
         | been so shitty. The human narrators sound _good_ , not just
         | average. They add something. The AI narrators are simply _bad_.
        
         | WillAdams wrote:
         | Yes, but if the alternative is not having a book, or having to
         | listen to one poorly read (I love Librivox, but there are some
         | books which I just haven't been able to finish because of
         | readers, and many more which were nixed for family vacation
         | travel listening on that account), this may be workable.
        
         | rd11235 wrote:
         | I agree but the opposite can be true too. Sometimes the
         | narrator seems to target some general audience that doesn't fit
         | me at all, in a way that makes me cringe when I listen, until I
         | stop listening altogether. In these cases I'd rather listen to
         | a relatively flat narration from a tool like this.
        
         | whazor wrote:
         | A GenAI model that read audiobooks with such dramatisation is
         | really my dream. There are so many books that I would want to
         | listen to, but still lack such an adaptation. Also it takes
         | months after the book release before the audiobook gets
         | released.
         | 
         | Just imagine what this would do for writers. They can get
         | instant feedback and adjust their book for the audiobook.
        
         | ldoughty wrote:
         | I agree with you, but also want to point out:
         | 
         | New authors, self-publishers, can't afford tens of thousands of
         | dollars to get an audiobook recorded professionally... This can
         | limit their distribution.
         | 
         | Authors might even choose not to make such version (or lack
         | confidence to record themselves), so AI capable of making a
         | decently passable version would be nice -- something more than
         | reading text blandly. AI in theory could attempt to track the
         | scene and adjust.
        
           | DidYaWipe wrote:
           | You can get narrators to work on a royalty basis.
        
           | plorg wrote:
           | By observation the current approach is for authors to narrate
           | the book themselves of they think their readers will want it
           | and if they feel reasonably confident in their own narration.
        
         | gmuslera wrote:
         | Would a "better" AI would do a "better" narration with a better
         | understanding of the text? Of course that it would imply a
         | different (and far bigger?) model.
         | 
         | Anyway, even if in theory it might, in practice things may end
         | even worse than doing it with a monotone voice.
        
         | taude wrote:
         | Agree with you on this.
         | 
         | My example, I was never a Wheel of Time fan, but the new audio
         | editions done by Rosamund Pike are quite the performance, and
         | make me like the story. She brings all the characters to life
         | in a way thats different than just reading. It's a true
         | performance.
        
       | Havoc wrote:
       | Wow that sample sounds really good
        
       | pprotas wrote:
       | I would love to have an e-reader that allows me to switch between
       | text and audio at the press of a button. Imagine reading your
       | book on the couch and then switching into audio mode while doing
       | the dishes seamlessly, by connecting bluetooth headphones.
        
         | InsideOutSanta wrote:
         | Kindles used to provide this feature, but publishers and/or the
         | Authors Guild stopped it, because audio rights and text rights
         | are handled differently. In other words, when Amazon sells you
         | a text book, it does not have the right to then also do TTS on
         | that text and let you listen to it.
         | 
         | There's some contemporary discussion of what happened here:
         | https://tidbits.com/2009/03/02/why-the-kindle-2-should-speak...
         | 
         | I think there is still integration with Audible, though. If you
         | buy a book on the Kindle and on Audible, the position will
         | sync, and you can switch between listening and reading without
         | losing your place in the book.
        
           | albert_e wrote:
           | Yes the feature is called WhisperSync -- I used it many years
           | ago and it was pretty good.
           | 
           | I tried it while on a treadmill so it allowed me to follow
           | the book with more focus without sacrificing much else.
        
             | thfuran wrote:
             | Isn't whisper sync the current version that relies on
             | owning both the ebook and audiobook?
        
           | Brybry wrote:
           | I used that TTS feature semi-regularly on a Kindle 2.
           | 
           | It wasn't a good experience but it was nice to be able to
           | keep 'reading' a book while I was exercising.
           | 
           | It worked for me for over a decade, until I broke the device.
           | I don't know if I never updated the firmware or if the fact I
           | used Calibre to convert books bypassed the feature gate.
        
           | hamzakc wrote:
           | I am not sure if this still works, but 2-3 years ago I
           | listened to a kindle book that I bought through my Echo show
           | device. It was pretty good. I listened to it while I was
           | cooking. It even allowed you to carry on where you left off.
           | But I did notice that a few pages were skipped as I had read
           | the book before. I have since packed away my echo show so I
           | can't verify if they have removed this feature or not.
        
         | freefaler wrote:
         | You can do it easily with non-DRM books (or DRM stripped
         | books):
         | 
         | For Android:
         | 
         | - Moon+ reader pro - some paid high-quality TTS voices (like
         | Acapella)
         | 
         | For iOS:
         | 
         | - Kybook reader and internal iOS voices (no external TTS voices
         | for the walled garden)
         | 
         | This works well enough to listen to a book while you walk and
         | when you get back home read on the WC from the place you
         | stopped.
         | 
         | Additionally if you buy a tablet or an android ebook reader,
         | you install the app there an you can continue on your
         | bigger/better device seamlessly.
         | 
         | Whisper-sync for the masses! Ahoy...
        
           | basedrum wrote:
           | But you need an android phone, and can't use a kobo or
           | similar wink reader?
        
             | freefaler wrote:
             | for ios you use Kybook on your iphone and your ipad. It
             | syncs positions between the devices. When you go for a
             | walk, opens Kybook, start TTS. When back home, open your
             | tablet, you'll see the page TTS has stopped reading to.
        
               | figers wrote:
               | How does this compare to using Apple's iBook or Kindle
               | reader app and then the iPhone's built in text to voice
               | (the female British voice is pretty good).
        
               | freefaler wrote:
               | On iOS it is the same voice.
        
         | dsign wrote:
         | It is a supported feature in the epub 3.0 standard. It's
         | possible to distribute an epub with audio, and have the audio
         | sync to the HTML elements that form the ebook's text. And there
         | is an e-reader that actually supports this feature, I can't
         | remember which one now but it should be possible to find it
         | with Google.
         | 
         | It's more of an open problem how to create those epubs. I have
         | some code that can do it using Elevenlabs audio, but I imagine
         | it way harder to have something similar for a human
         | narrator.... who's going to do the sync? Maybe we need a sync
         | AI.
        
         | llamaimperative wrote:
         | Boox Ultra Tab whatever the fuck (their product naming sucks) +
         | Readwise Reader = amazing for this
         | 
         | Not quite seamless but it works. It has a cursor that follows
         | the words as they're spoken to, which allows you to read and
         | hear ("immersive reading") which I find to be extremely helpful
         | for maintaining focus.
        
         | monkeydust wrote:
         | Literally started doing that this week with Amazon Audible. I
         | gave in an started the three month 99c trial and downloaded the
         | app.
         | 
         | What surprised me a good way was my Kindle app was aware of
         | this and asked if I wanted to download the audible version of
         | the current book I am reading.
         | 
         | Been listening on the way to work and then reading on the way
         | back. Enjoying it so far.
        
           | mmahemoff wrote:
           | Some Kindke books also have a checkbox to add the audio (for
           | a fee) when you buy it. Sometimes I've seen books discounted
           | to e.g. PS0.99, but adding the audio might be PS5.99. The
           | upsell seems to be a good hack for adding some revenue when
           | there's a deep discount being used to drive interest.
        
       | mrklol wrote:
       | How can this support more languages than the model itself?
        
         | Kye wrote:
         | The model might have stumbled on the generative AI equivalent
         | of IPA.
        
       | msoad wrote:
       | To people who are experts in AI TTS:
       | 
       | Why elevenlabs has such a lead in this space? It sounds better
       | than OpenAI and Google models
        
         | dbspin wrote:
         | Does it? The podcasts created by Notebook LLM are completely
         | convincing, at least in terms of voice generation.
        
       | swores wrote:
       | Can anyone recommend an open source option that would allow
       | training on a custom voice (my own, so I'd be able to record as
       | many snippets as it needed to train on) to allow me to use it for
       | TTS generation without sharing it off my machine?
       | 
       | Edit: I'll wait to see if any recommendations get made here, if
       | not I might give this one a go: https://github.com/coqui-ai/TTS
        
         | numpad0 wrote:
         | I think you can probably generate TTS audio by classical means,
         | and voice2voice that audio through RVC or Beatrice V2. Haven't
         | looked into it in a while but Beatrice is apparently super fast
         | and CPU only.
        
         | phrotoma wrote:
         | https://github.com/DrewThomasson/ebook2audiobook
        
         | esskay wrote:
         | If I recall Coqui is very much a dead project, just one to be
         | aware of.
        
         | hm64 wrote:
         | Coqui is great, but in practice, I found Piper easier to set
         | up, train, and deploy as an ONNX file. Big thanks to the Sherpa
         | development team for their helpful resources:
         | https://k2-fsa.github.io/sherpa/onnx/tts/piper.html and to the
         | Rhasspy team for their training guide:
         | https://github.com/rhasspy/piper/blob/master/TRAINING.md.
         | 
         | I also found DEMUCS + Whisper + pydub to be a super helpful
         | combo for creating quality datasets.
        
         | drewbitt wrote:
         | There is a fork here https://github.com/idiap/coqui-ai-TTS
         | 'coqui-tts'
         | 
         | Though according to the TTS leaderboard, Fish Speech
         | https://github.com/fishaudio/fish-speech and Kokoro are higher.
         | 
         | https://huggingface.co/hexgrad/Kokoro-82M
         | 
         | https://huggingface.co/fishaudio/fish-speech-1.5
        
           | xnx wrote:
           | AFAIK Kokoro can't be fine tuned
        
         | jsemrau wrote:
         | I wrote this a while ago about xTTSv2 mixed with Nvidia's Nemo.
         | Maybe it kicks off your journey.
         | 
         | https://jdsemrau.substack.com/p/teaching-your-agent-to-speak...
        
       | lc64 wrote:
       | "was trained on <100 hours of audio"
       | 
       | How the hell was it trained on that little data ?
        
         | Havoc wrote:
         | Yeah that surprised me as well - seems low vs what is used on
         | text llms . To be fair 100 hours of speaking is a lot of
         | speaking though
        
           | edude03 wrote:
           | But it covers five? Languages so if all equal it's just 20
           | hours per language.
        
             | em-bee wrote:
             | in the linked audio sample it says the training data is
             | mostly english. also another comment claims that the
             | japanese quality is not good, so i'd be suspicious about
             | all the other languages.
        
         | bbminner wrote:
         | I suppose it means per speaker. And it is based on a simplified
         | style tts 2 which from my small dive into the subject seems one
         | of the smaller models achieving great quality.
        
       | vinni2 wrote:
       | Can it also translate? I have family who would like audiobooks in
       | German but most are in English only.
        
         | em-bee wrote:
         | german is not listed as a supported language, so no. aside from
         | that, i would not want to use computer translation. unlike TTS,
         | which keeps getting better, translation quality still leaves a
         | lot to be desired.
        
           | vinni2 wrote:
           | Ah thanks just noticed that. But which voice to use for
           | French?
        
       | october8140 wrote:
       | All these AI text to voice models seem to ignore emotion. It
       | always sounds like a robot.
        
         | lyu07282 wrote:
         | Like with almost everything, its an active area of research:
         | 
         | https://emosphere-tts.github.io/
         | 
         | We are getting there
        
           | boxed wrote:
           | Some of those samples sound like they are emoting in Korean
           | while speaking English.
        
             | lyu07282 wrote:
             | True, maybe an artifact of the training data, here is
             | another one:
             | 
             | https://www.microsoft.com/en-us/research/project/emoctrl-
             | tts...
        
         | croes wrote:
         | Emotion is the acting part of voice acting. Hard to copy with
         | AI
        
         | iagooar wrote:
         | I wonder if AI could create a "commentary" script that
         | instructs the TTS how to read certain words or chapters. The
         | commentary would be like an additional meta-track to help the
         | TTS make the best reading.
         | 
         | That should actually be possible to do already with existing
         | tech. I haven't seen if you can instruct Kokoro to read in a
         | certain way, does anyone know if this is possible?
        
         | arafalov wrote:
         | Try this one https://www.hume.ai/ - I found the demos (voice to
         | voice) interesting.
        
       | nottorp wrote:
       | Well there was some hope with ChatGPT that people will go back to
       | being able to process text communication.
       | 
       | Guess it was just a matter of time till someone figured out how
       | to use "AI" to resume encouraging illiteracy.
        
         | stavros wrote:
         | There was some hope with the rise of equestrianism that people
         | will go back to be able to shoe horses.
         | 
         | Guess it was just a matter of time till someone figured out how
         | to use "cars" to resume encouraging being unable to to a basic
         | farrier job.
        
           | nottorp wrote:
           | Except cars were faster than horses, while audio or video
           | content is much slower than reading.
        
             | stavros wrote:
             | Cars also have legs while audio doesn't, a point which is
             | equally irrelevant. If people don't need to read, they
             | don't need to read, and no matter how much a random
             | Internet commenter wants them to need it, it won't change
             | anything.
             | 
             | Skills atrophy for a reason. It's fine to let them. You may
             | as well be lamenting the lost art of long division.
        
               | nottorp wrote:
               | > Cars also have legs while audio doesn't, a point which
               | is equally irrelevant.
               | 
               | That's what a LLM would say :)
        
               | stavros wrote:
               | I'm sure an LLM wouldn't say anything as inane as that :P
        
             | hombre_fatal wrote:
             | You can multitask with audio content, so you can consume
             | content when you can't sit down to read. And you can even
             | potentially consume more volume like on a long daily
             | commute.
             | 
             | It's not the case that it's worse.
        
       | floppiplopp wrote:
       | It sounds okay, but it lacks emotion and is monotone for fiction,
       | it's the voice equivalent of the uncanny valley, which is
       | probably fine if you don't really care.
        
         | laserbeam wrote:
         | And when I don't care... to be honest I'm even OK with the dull
         | browser TTS implementation when reading your average substack
         | post. Shove the phone in my pocket, go shopping, get the jist
         | of the article.
        
       | yoavm wrote:
       | Was just looking for a TTS model to run locally for reading out
       | loud articles, and never heard about Kokoro before! This looks
       | great. I wonder if it can run in the browser somehow - could be a
       | nice WebExtension.
        
         | xkriva11 wrote:
         | What about the WASM running sherpa-onnx? No intallation
         | required and can be served locally as well.
         | 
         | https://k2-fsa-web-assembly-tts-sherpa-onnx-en.static.hf.spa...
        
         | jiehong wrote:
         | I think most browsers support this already. Even maybe OS wide.
         | 
         | I know it should work for Firefox on an article in reader mode.
         | 
         | Or in MacOS you can select text and have it read out loud.
        
           | yoavm wrote:
           | I'm using Firefox and I do not see this option. Probably not
           | working on Linux?
        
             | sriacha wrote:
             | You might need to install/setup Speech Dispatcher. I was
             | just using this implementation with Piper:
             | https://github.com/Elleo/pied?tab=readme-ov-file.
             | 
             | However easier way to read articles aloud is with Read
             | Aloud extension: https://github.com/ken107/read-aloud.
        
       | albert_e wrote:
       | I hope a plugin for Calibre ebook management software comes along
       | that makes it easier to convert select titles from your epub
       | library to decent audio versions -- and a decent open source app
       | for tablets and smartphones that can let us seamlessly consume
       | both the ebook and audiobook at will.
        
       | cwmoore wrote:
       | The word "kokoro" means "heart" in Japanese, which I learned
       | making the (heart shaped and paperback) puzzle books at
       | https://www.kakurokokoro.com/
        
         | terhechte wrote:
         | Its also the name of the AI in Terminator Zero
         | https://villains.fandom.com/wiki/Kokoro
         | 
         | I'm not sure if that is related here.
        
         | tkgally wrote:
         | Note that _kokoro_ (Xin ) means "heart" in the sense of
         | "spirit," "soul," "mind," "emotions," etc. It _doesn't_ mean
         | "heart" in the sense of "internal organ that pumps blood." That
         | is _shinzo_ (Xin Zang ).
         | 
         | I once heard an American friend with so-so Japanese ability ask
         | a Japanese woman who had recently had a heart operation how her
         | _kokoro_ was doing, and she looked surprised and taken aback.
         | 
         | Side note: After I started reading HN in 2019, I was struck by
         | how many tech products mentioned here have Japanese names. I
         | compiled a list for a few years and eventually posted it:
         | 
         | https://news.ycombinator.com/item?id=31310370
        
       | TypoAtLineZero wrote:
       | I am having a very similar setup locally, which uses Chrome with
       | the 'Read Aloud' plugin. I am capturing the audio stream via
       | QJackCtl/VLC. Voices, speed, pitch can be adjusted. Efficient and
       | quickly set up
        
       | TheChaplain wrote:
       | For accessibility I think this is a great thing, but as
       | entertainment less so.
       | 
       | Example is Hobbit and Lord of the Rings, the narrator Rob Inglis,
       | makes an amazing voice performance giving depth to environments
       | and characters. And of course the songs!
        
       | basedrum wrote:
       | I want to be able to seemlessly read on my ebook reader and then
       | put in my headphones and go for a walk with the dog and resume on
       | audio where I left off. then when I come back, my ereader is at
       | the right place where the audio finished and I can resume reading
        
         | llamaimperative wrote:
         | Readwise Reader does this. A litttttle finicky at tracking read
         | location but it's workable
        
       | GaggiX wrote:
       | There is also this TTS: https://github.com/rhasspy/piper that is
       | pretty good (depending on the language) and extremely fast, would
       | be cool to change the script to user Piper instead of Kokoro in
       | case you want to use a language that is not supported by Kokoro
       | or it's too slow, Piper supports a lot of them.
        
       | mikkom wrote:
       | What I really want and hope that someone does is to make an
       | audiobook service that converts books to audiobooks but so that
       | each character has own voice.
       | 
       | Som audiobooks have this and I think it really makes the
       | experience much more engaging.
       | 
       | (Also maybe some background sound effects but not sure about
       | that, some books also have this and it's quite nice too)
        
       | ajsnigrutin wrote:
       | Just tried it, and "meh"...
       | 
       | It's one step above "normal" text-to-speech solutions, but not
       | much above it. The epub has "Chapter 1" as the title on the page,
       | and a lot of whitespace, and then "This was...." (actual text).
       | The software somehow managed to ignore all the whitespace and
       | reach "chapter 1 this was.." as a single sentance, no pauses, no
       | nothing.
       | 
       | Blind? A great tool. Will it replace actual audiobooks? Well..
       | not yet at least.
        
       | carlosjobim wrote:
       | Why isn't the audiobook market strong enough that it would make
       | business sense to pay good narrators and actors for each book
       | published?
        
         | DidYaWipe wrote:
         | It is. But since when is "enough" enough for
         | monopolistic/oligopolistic corporations?
        
       | causi wrote:
       | I'm not able to try it until later, but regarding the sample
       | audio: The voice quality is quite good, but what's going on with
       | all the random pauses between words? It's very Captain Kirk.
        
       | cliftonpowell wrote:
       | There's another project called ebook2audiobook that has produces
       | some decent results.
        
       | woolion wrote:
       | If you look for a lot of the great classics, audiobooks results
       | are inundated with basic TTS "audiobooks" that are impossible to
       | filter out. These are impossible to listen to because they lack
       | the proper intonation marking the end of sentences, making it
       | very tiring to parse. It might be better than tuna can sounding
       | recordings, especially if you want to ear them in traffic (a
       | common requirement), but that's about it. The alternative, if you
       | want real quality recordings, is to stop reading classics and
       | instead read latest Japanime Isekai of murder mystery, these have
       | very good options on the market. Anyway, I don't think it needs
       | more justification that it covers a good niche usage.
       | 
       | I'm checking what the actual quality is (not a cherry-picked
       | example), but:
       | 
       | Started at: 13:20:04 Total characters: 264,081 Total words: 41548
       | Reading chapter 1 (197,687 characters)...
       | 
       | That's 1h30 ago, there's no kind of progress notification of any
       | kind, so I'm hoping it will finish sometime. It's using 100% of
       | all available CPUs so it's quite a bother. (this is "tale of a
       | tub" by Swift, it's about half of a typical novel length)
        
         | csantini wrote:
         | Yeah, that's a known issue, if the book is all on a single
         | chapter you don't get any sense of progress. I may fix that
         | next weekend
        
           | woolion wrote:
           | It's not in one Chapter, but Chapters are called "Section"
           | (and so ignored!). It should be simple to have a dictionary
           | of the different units that are used (I would assume "Part"
           | would fail too, as would the hilarious "Catpter" of some cat-
           | themed kid book, but that's more complicated I guess?).
           | 
           | It did finish and result is basically as good as the provided
           | example, so I'd say quite good! I'll plan to process some
           | book before going to bed next time!
           | 
           | Chapter 1 read in 6033.30 seconds (33 characters per second)
        
       | callamdelaney wrote:
       | It's insufferable.
        
       | zoidb wrote:
       | Not directly related to the software, but interestingly on the
       | authors website there is a Schedule a free call with me
       | (https://claudio.uk/templates/call.html). I wonder if randos on
       | the internet ever do that, and how it works out.
        
         | sam_lowry_ wrote:
         | His LLM will answer the call.
        
         | rpastuszak wrote:
         | I've been doing it for a few years (+200 calls) and have met a
         | ton of wonderful people this way.
         | 
         | https://untested.sonnet.io/notes/say-hi/
         | 
         | https://sonnet.io/posts/hi
        
       | herculity275 wrote:
       | Very nice! I fiddled with this idea a few months back but the
       | models available at the time were woefully slow on a macbook.
       | Will definitely give this a spin, there's a large category of web
       | serials and less popular translated novels that never get
       | audiobook releases.
        
       | delegate wrote:
       | The quality is great (amazing even), but I can't listen to AI
       | generated voices for more than 1 minute. I don't know why, I just
       | don't like it. I immediately skip the video on youtube if the
       | voice is AI generated.
       | 
       | Might be because our brains try to 'feel' the speaker, the
       | emotion, the pauses, the invisible smile, etc.
       | 
       | No doubt models will improve and will be harder to identify as AI
       | generated, but for now, as with diffusion images, I still notice
       | it and react by just moving on..
        
         | rockemsockem wrote:
         | That kinda means the quality isn't great or amazing. Good TTS
         | should be nearly or indistinguishable from a human speaker and
         | should include emoting, natural pauses, etc
        
         | xdennis wrote:
         | Among other things, what I don't like is the hallucinated
         | stress. Take the classic example of:
         | 
         | > I never said she stole my money
         | 
         | It can have 7 different meanings based on which word you stress
         | out.
         | 
         | The new AI voices sound very natural at a shallow level, but
         | overall pronounce things in odd ways. Not quite wrong, but
         | subtly unnatural which introduces some cognitive load.
         | 
         | Old TTS systems with their monotonic voices are less confusing,
         | but sound very robotic.
        
           | DidYaWipe wrote:
           | erroneous or inappropriate [?] hallucinated
        
         | CMay wrote:
         | Haven't really been following the latest in TTS ML, but I
         | expected this to be better or at least as good-bad as the stuff
         | you hear on YouTube. Somehow it sounds worse. It really is
         | jarring to listen to any of these ML voices and can't really
         | stand it. Nope out of every video that uses them and can't tell
         | if YouTube never recommends them to me for that reason, or just
         | because the recommendations around what I watch are just so
         | rarely going to be from some low reputation channel.
         | 
         | Take a moment here for a second though and think about it. Even
         | if these voices got to be really good, indistinguishable
         | almost... would I want to listen to it even then? If it was an
         | NPC's generated voice and generated dialogue in a game to help
         | enrich the world building, maybe in that context. On YouTube or
         | with newscasters? Probably not. Audio books? Think I would
         | still rather have it be a real person, because it's like
         | they're reading a story to me and it feels better if it's
         | coming from someone. There's also the unknown factor, where if
         | it's ML generated it's so sterile that the unknowns are kind of
         | gone.
         | 
         | Think about it like this, in the movie industry we had
         | practical effects that were charming in a way. You could think
         | about the physical things that had to occur to make that
         | happen. Movie magic. Now, everything is so CG it's like the
         | magic is gone. Even though you know people put serious hard
         | work into it, there's a kind of inauthenticity and just lack of
         | relevance to the real world that takes something away from it.
         | 
         | It's like a real magician has interesting tricks, while an
         | artificial magician is most likely just a liar.
         | 
         | Still, I grant that it makes some cool things possible and
         | there is potential if things are done right. Some positive
         | mixture of real humans and machine generated stuff so it isn't
         | devoid of anything connected to real life effort.
        
         | yjftsjthsd-h wrote:
         | > I immediately skip the video on youtube if the voice is AI
         | generated.
         | 
         | I mean, I do that because it's correlated with the content
         | being garbage. If I'm intentionally using it on content I want
         | to consume I expect it to be different, though I haven't gotten
         | around to trying it properly yet so I guess we'll see. (OTOH I
         | already listen to ebooks via pre-AI TTS, so I'm optimistic)
        
         | _DeadFred_ wrote:
         | For new generations/those coming up now this will be the norm
         | and not generate the negative reaction is does for us, it will
         | just be part of how the world is and has always been, and
         | eventually we will be the minority.
         | 
         | Future generations will never know a world where you don't
         | watch a 2 hour AI generated orientation video about the wonders
         | of working for Generic Corp when you start a new job.
        
         | karmasimida wrote:
         | Yeah same.
         | 
         | Doesn't mean the quality is bad. In fact I think Kokoro's
         | quality is amazing.
         | 
         | But it is not the right tool for narration, the kind of
         | training data they use make the sound too flat, if that makes
         | sense.
        
       | maxglute wrote:
       | Sounds really nice at 3x-4x speed, which I can't say for high
       | quality TTS options last year. I'm wondering if there's metrics
       | out there for audio speed vs clarity.
        
       | monkeydust wrote:
       | I have been looking for something credible that can voice over
       | written emails (long form ones), documents and powerpoints
       | locally ...this might be just the thing!
        
       | plumbees wrote:
       | As a mandarin learner, I find that the Chinese one lacks cadence,
       | which makes it very hard as a learner to comprehend. It's like a
       | machine gun of words without the subtle slight pause between sets
       | of words that I would normally lean on.
        
       | jaggs wrote:
       | I really like this a lot. The default provides a really good
       | audiobook feel, especially the Isabella voice. Any chance you
       | could add in an API hook for optional ElevenLabs use?
        
       | therealdrag0 wrote:
       | Do folks have a preferred toolkit for extracting text from web
       | articles? I'd like to TTS articles friends send me.
        
       | Dowwie wrote:
       | 2025 may be the year where we can generate a dramatic audiobook
       | with ambient music, sound effects, and theatrical narration using
       | neural networks. Many of the parts already exist.
        
       | DidYaWipe wrote:
       | Yes, because real narrator/actors are rolling in the dough. Let's
       | kill one more profession with trash.
        
         | bongodongobob wrote:
         | If it's trash then why would it kill the profession?
        
           | DidYaWipe wrote:
           | Because people will opt for readily-available free trash
           | instead of paying for high quality. And then that quality
           | isn't available to anyone at any price, so everyone loses.
           | 
           | If you haven't observed this in many other markets, you live
           | an unusual (or unobservant) life.
        
       | geor9e wrote:
       | This one sounds a bit robotic and takes ~4 hours per book on my
       | M1 laptop, so I'll keep looking. For now, I'm happy my current
       | method - EPUBReader browser extension, which opens .epub as an
       | HTML page in Microsoft Edge browser, which has a "Read Aloud"
       | button set to the Stephan natural voice at 1.6 speed. Best
       | sounding voice I've ever heard, speaks fast, clear, crisp, with
       | natural inflections to the sentences, and if I want to jump to
       | somewhere I just left click the text at that spot. And it's
       | instant - no conversions. Downside is I have to stay in bluetooth
       | range of my laptop, so I'm still looking for a good phone based
       | method. Google Play Books works okay, but gets buggy at 1.6
       | speed.
        
       | nickpsecurity wrote:
       | The page says it was trained on under 100 hours of audio. Then,
       | the link says "we employ large pre-trained SLMs, such as WavLM,
       | as discriminators with our novel differentiable duration modeling
       | for end-to-end training." I don't have time to read the paper to
       | see what that means.
       | 
       | Depending on what that means, it might be more accurate to say it
       | was trained on 100 hours of audio and with the aid of another,
       | pre-trained model. The reader who thinks "only 100 hours?!" will
       | know to look at the pretraining requirements of the other model,
       | too.
        
       | leecarraher wrote:
       | in case you are wondering how audiblez becomes an executable in
       | the PATH from a pip install audiblez per the documentation
       | 
       | ... audiblez book.epub -l en-gb -v af_sky.
       | 
       | it does not, instead it installs a python package with a cli
       | interface, to run you then have to prepend python and load the
       | module like this:
       | 
       | python3 -m audiblez book.epub -l en-gb -v af_sky.
        
       | skwee357 wrote:
       | Soon, AI will flood the market with mediocre everything: books,
       | audio books, art, movies, websites.
       | 
       | The saddest thing is that people will still continue to
       | participate in consuming these AI produced "goods".
        
         | abroadwin wrote:
         | If there's one thing our capitalist society has taught me it's
         | that people are always willing to endure a crappier product.
         | I'm not sure we've found the bottom yet...
        
         | vanderZwan wrote:
         | I think the saddest thing is that it's highly likely that real
         | people will start to produce aesthetics that look/sound/etc
         | like AI slop
        
           | skwee357 wrote:
           | True, and I think with the recent news around Sporify using
           | AI to fill their playlists, we are already getting there.
           | Just need to condition the public that there is no better
        
       | flypunk wrote:
       | I really liked it and added a variable speed argument:
       | https://github.com/santinic/audiblez/pull/4
        
       | sysworld wrote:
       | Finally! Been trying all the TTS models popping up on here for
       | ages, and they've all been pretty average, or not work on Mac, or
       | only work on really short text, or be reeealy slow.
       | 
       | But this one works pretty quick, is easy to install, has some
       | passible voices. Finally I can start listening to those books
       | that have no audio version.
       | 
       | I'm a slow reader, so don't read many books. If a book doesn't
       | have an audiobook version, chances are I won't read it.
       | 
       | PS, I have used elevenlabs in the past for some small TTS
       | projects, but for a full book, it's price prohibitive for
       | personal use. (elevenlabs has some amazing voices)
       | 
       | Thank you to the dev/s who worked on this!
        
       | crorella wrote:
       | Nice! It would be great to have per character voices
        
         | boznz wrote:
         | this would be a game changer if done right. All good voice
         | actors can carry a dozen different 'voices' for characters
        
       ___________________________________________________________________
       (page generated 2025-01-15 23:01 UTC)