[HN Gopher] Generate audiobooks from E-books with Kokoro-82M
___________________________________________________________________
Generate audiobooks from E-books with Kokoro-82M
Author : csantini
Score : 375 points
Date : 2025-01-15 08:47 UTC (14 hours ago)
(HTM) web link (claudio.uk)
(TXT) w3m dump (claudio.uk)
| jaggs wrote:
| This looks really nice. And fast too it seems.
| treetalker wrote:
| For anyone looking for an easier alternative (and one without the
| bugs the author describes, such as skipping some prefaces or
| failing to detect some chapters), Voice Dream Reader on iOS (and
| macOS) handles .epub and other e-books just fine and supports a
| variety of built-in and external voices.
| huhtenberg wrote:
| Another subscription.
|
| $80/yr.
|
| Yaaaaaay.
| treetalker wrote:
| Unless something has changed, the iOS version is a one-time
| purchase. I bought the app many years ago (8?) and have been
| a happy user since.
|
| Like you, though, I had that reaction to the subscription
| model for macOS and therefore decided not to "buy" it when it
| came out.
| huhtenberg wrote:
| They got greedy and decided to milk it. That's what
| changed.
|
| It's $80/yr for the iOS app.
| treetalker wrote:
| Oof, I believe they changed ownership so I must have been
| grandfathered in. That's steep.
| danman2 wrote:
| Do you know if it's possible to train it to use my own voice?
| rhizoma wrote:
| Yes, I've used Voice Dream for years with Pocket articles &
| ebooks because the Pocket app took up too much space and was
| limited to web articles. The voice quality is ok for short
| pieces or stints. The choice of voices is a bit robotic, but I
| find it useful while making written notes in Split View.
| freefaler wrote:
| Kybook is 1 time payment and can use iOS TTS voices.
| jdlyga wrote:
| ElevenLabs Reader is the same thing, but much higher quality
| voices for free. I've lost my place a few times so it's not
| quite as reliable as VoiceDream. But you aren't paying an
| expensive subscription with mediocre voices.
| qurashee wrote:
| This looks incredible! I've had an idea simmering in the back of
| my mind for a while now: creating an audiobook from an ebook for
| my commute using the voice of a specific audiobook narrator I
| really enjoy. The concept struck me after coming across the
| Infinite Conversation project here on HN. Unfortunately, I just
| haven't found the time to bring it to life yet. :(
| vinni2 wrote:
| What about the copyright issue? You can't mimic the voice of a
| narrator without their consent. OpenAI landed in trouble after
| using Scarlett Johansson's voice in a demo.
|
| https://www.theverge.com/2024/5/20/24161253/scarlett-johanss...
| notachatbot123 wrote:
| No limitations on this kind of thing if you are in private
| use.
| vinni2 wrote:
| Forgive me for not knowing it was for personal use.
| qurashee wrote:
| Indeed I was thinking about private use only.
| benatkin wrote:
| She only won in that OpenAI decided it wasn't worth the
| trouble.
| K0balt wrote:
| Yeah, by my ear it was pretty clearly not SJ's voice-
| likeness, although there were some superficial
| similarities.
|
| But some people could have mistook it due to some regional
| accent similarities, though it would be akin to
| interpretation of any light southern drawl with a similar
| timbre as being SJ.
| mmahemoff wrote:
| They also asked her for permission in advance, which was
| never going to help their case.
| amrrs wrote:
| Kokoro really mentions that they used only permissive
| licensed voice
| gunalx wrote:
| Kokoro seemed pretty nice for the size. I guess it is not much
| mvetter than a lot of the simpler tts. But at least it sounds
| less machinic than a few bad ones.
| outofpaper wrote:
| It is essentially a set of voice models building on
| https://huggingface.co/spaces/styletts2/styletts2
|
| The odd thing is that while they are releasing these great
| sounding models, they are not documenting the training process.
| What we want to know is what magic if any allowed them to
| create such wonderful voices...
| cess11 wrote:
| I would for sure not want this for fiction, it's too obvious that
| the voice has no understanding whatsoever of the text, but it's
| probably pretty nice for converting short news texts or
| notifications to audio.
| vanderZwan wrote:
| Your point is a valid one, but I want to add to it that it is
| also a matter of expectations and how one listens.
|
| Years ago, when I was dating someone who spoke Russian as one
| of her native languages, we had to do a funny compromise when
| watching films together with her parents: they didn't speak a
| word of English, so we'd use the Russian dub with English
| subtitles.
|
| I noticed that the Russian dub was just _one man_ reading a
| translation in a flat voice over what was happening on the
| screen, no attempts at voice acting or matching the emotions.
| Usually the dub would have a split second delay to the actual
| lines, so you 'd still hear the original voices for a moment
| (and also a little bit in the background).
|
| At first I found it very jarring, but they explained that this
| flatness was a feature. You'll quickly learn to "filter out"
| the voice while still hearing the translation, and the faint
| presence of the original voices was enough to bring the
| emotional flavor back. The lack of voice acting helped with the
| filtering.
|
| This turned out to apply to me as well, even though I don't
| speak Russian! My brain subconsciously would filter out the
| dub, and extract most of the original performance through the
| subtitles and faint presence of the original voices. Obviously
| the original version would have been a better experience for
| me, but it was still very enjoyable.
|
| Of course a generated audiobook is not a dub, as there is no
| "original voice" to extract an emotional performance from. But
| some listeners might still be able do something similar. The
| lack of understanding in the generated voice and its
| predictable monotony might allow them to filter out everything
| but the literal text, and then fill it in with their own
| emotional interpretations. Still not as great as having proper
| story teller who _does_ understand the text and knows how to
| deliver dramatic lines, but perhaps not as bad as expected
| either.
| cess11 wrote:
| It's not a "point", I didn't make an argument.
|
| I dislike german and russian style dubs as well, I'd rather
| learn a bit of the original language.
| em-bee wrote:
| indeed, audio books come in many forms, some are rather flat,
| and some include different voices, even by different
| speakers, or include a few voiced sound effects, laughing,
| crying, singing, etc. TTS is extra flat, but if the quality
| is good otherwise then it is like reading with my ears, and i
| add the emotions myself.
| aleksiy123 wrote:
| Watching these as russian/english bilingual is very painful,
| tho I grew up in western world so maybe I'm just not used to
| it.
|
| To add on a slight tangent. Many books/audiobooks just don't
| exist in other languages at all. So even getting some
| monotone is a lot better than getting nothing.
|
| I think this is where these models really shine. Cheaply
| creating cross language media and unlocking the
| knowledge/media to underprivileged parts of the world.
| vanderZwan wrote:
| > _Watching these as russian /english bilingual is very
| painful, tho I grew up in western world so maybe I'm just
| not used to it._
|
| I figured that their opinion probably wasn't universal,
| hahaha.
|
| And yes, it's at the very least a win for accessibility
| arafalov wrote:
| Here is the rest of that story.
|
| When the foreign movies started to filter into the Soviet
| Union's illegal movie theatres, you would get 3 or 4 movies
| playing at once in one room. There would be a TV in each
| corner of the room and 4 or 5 rows of plastic chairs in front
| of it in an arch.
|
| ALL of the movies were being revoiced by the same person. So,
| if you were sitting in the back of the 5th row, you were
| potentially getting the sound from an action movie, a comedy,
| a horror movie and a romance at the same time. In the same
| voice.
|
| You learned to filter really well. So, if that's what they
| were trained on, watching a single movie must have been very
| relaxing.
| vanderZwan wrote:
| Looking at the modern internet experience it sounds like
| the Soviet Union's illegial movie theatres were ahead of
| their time!
| calgoo wrote:
| Audible has thousands of books available "for free" with their
| membership that are all AI generated. I was the same in the
| start, but after listening to a few, it really comes down to
| the voice used. I spent 8h on a plane listening to 1 book, and
| there was maybe 5 occasions where i had an issue with the
| voice; and i think all where just "AI weirdness", similar to
| chat LLMs messing up simple sentence structure or image
| generating LLMs adding an extra finger.
| cess11 wrote:
| I don't think dominant suppliers like Audible should exist so
| that matters little to me.
| arafalov wrote:
| The one I tried, had a lot of issues. It was a music theory
| book and it did not know how to pronounce C# (it kept saying
| C 'hash'). It also referred to, but did not read out the
| diagrams, or tables.
|
| So, it was not just the voice, but the quality control
| pipeline that was missing as well.
|
| Maybe it mostly works for old plain text books, but if nobody
| is checking.....
| mg wrote:
| Would this also be the best option if you just want to convert
| plain text files to audio?
| bArray wrote:
| Markdown and PDF would also be cool. I think it's just a case
| of feeding the TTS model the right data at the right time. The
| special sauce is in the model, there's really not much to the
| code:
| https://github.com/santinic/audiblez/blob/main/audiblez.py
| katspaugh wrote:
| Sounds better than many books on Audible.
| ekianjo wrote:
| japanese is not supported yet despite the claims. you can easily
| realize that by running the examples provided.
| laserbeam wrote:
| On the one hand, this is very convenient. Probably cool for some
| non-fiction.
|
| On the other, some of my favorite audio books all stood out
| because the narrator was interpreting the text really well, for
| example by changing the pacing during chaotic moments. Or those
| audiobooks with multiple narrators and different voices for each
| character. Not to mention that sometimes the only cue you get for
| who's speaking during dialogue is how the voice actor changes
| their tone. I have mixed feelings about using this and losing
| some of that quality.
|
| I would totally use this over amateur ebooks or public domain
| audiobooks like the ones on project guttenberg. As cool as it
| is/was for someone to contribute to free books... as a listener
| it was always jarring to switch to a new chapter and hear a
| completely different voice and microphone quality for no reason.
| ahoka wrote:
| I guess this is still very useful if you are blind.
| loktarogar wrote:
| Yeah, for accessibility purposes on things that aren't
| already narrated, this is kind of thing is huge.
| em-bee wrote:
| that's the thing. it's not just for accessibility. anything
| not already narrated is a fair target for TTS. i don't have
| time to sit down and read books. all reading is done on the
| go, while getting around or doing daily routines at home. i
| have a small book that i am reading now, which should take
| a few hours to finish, but in the time i manage to get done
| reading it i will probably have listened to two or three
| audio books.
|
| oh, and it's also a boon for those who can't afford to buy
| audiobooks.
| vasco wrote:
| You don't choose to spend your time reading books. You
| probably roll your eyes when someone tells you they don't
| have time for some activity you deem valuable. This is
| the 'no time to exercise' debate in a different shape.
|
| They are also different activities, with audio it's
| easier to listen to more but retention is usually lower.
| Not casting any elitist "you need to read" bullshit by
| the way, but find it odd to define it in terms of lack of
| time, and I really like both mediums.
| em-bee wrote:
| there is not much of a choice here. sure, i could use the
| time i spend reading and commenting on HN to read books
| instead. so technically speaking it is a choice. but i
| want to do both and many other things besides also having
| to work and a family to take care of. so the result is, i
| can't afford the time to read without giving up other
| things that are also important to me. listening to books
| allows me to access books i would otherwise not be able
| to read because of these priorities.
|
| there are other factors as well. i love reading so much
| that i tend to forget time around me. as a result reading
| would cause me to neglect other duties. i can't allow
| that, and therefore i am forced to avoid reading. i also
| don't like long form reading on electronic devices, and
| as a frequent traveler, printed books are simply not
| practical and often not even accessible.
|
| i agree with the retention issue, but i found that a much
| larger factor for retention is how well i can follow the
| story. a good story that is easy to get into is also
| easier to retain. and finally, reading fiction is for
| entertainment. i don't have to retain it.
| esrauch wrote:
| > You probably roll your eyes when someone tells you they
| don't have time for some activity you deem valuable.
|
| There's a few categories where it makes sense to roll
| your eyes, like if they say they have no time to shower
| or have never been to one of their kid's baseball games.
|
| But for things that aren't basic human expectations, I
| think you'd have to a real jerk to roll your eyes at
| someone not having time. No time to cook multi-pot
| dishes? No time to exercise? No time to read? No time to
| go to museums? No time to meet at the bar for a drink?
| Any of them sensible.
|
| No one can do everything, we all make our priorities and
| its well within their choice not to have any one optional
| life thing at the top of their personal stack.
| vasco wrote:
| Agree completely, my point was indeed they are choices,
| not lack of time. I think I came across too judgy even
| trying not to. You made a better job of it.
| hombre_fatal wrote:
| This is a weird comment. They are just saying why they
| prefer audiobooks thus why general TTS is useful for
| them.
|
| Why are you trying to argue about their preference? They
| didn't cast any judgement on others with different
| preferences.
|
| This is nothing like "no time for exercise".
|
| It's more like "I have no time (preference) to fire up
| the wood stove so I use microwave" and then you come in
| with "wow so you roll your eyes at us fire stove users?"
| vasco wrote:
| Two hours before you posted this there was already an
| admission from me in a sister comment that I came across
| too judgy and someone else made the point I tried better
| than myself - not sure how much penitence I need to do
| but sorry again :)
| flir wrote:
| I was just thinking about automatically slapping an mp3 on
| every blog post, just an accessibility nicety.
|
| Can someone with low vision tell me if this would be useful
| to them? It may be that specialist tools already do this
| better.
| laserbeam wrote:
| People use screen readers for accessibility. I would not
| expect anyone to be able to "look for and find" your
| mp3... I would instead expect them to use the tool they
| normally use for accessibility.
|
| The real question is "what tools are they already using
| and how can I make sure those tools are providing higher
| quality output?". There are standards in browsers for
| these kinds of things (ways to hint navigation via
| accessibility tools for example).
| flir wrote:
| > I would instead expect them to use the tool they
| normally use for accessibility.
|
| Yes, that was my second thought. But I'd rather ask
| someone than rely on my assumptions.
| felixhummel wrote:
| I wholeheartedly agree.
| https://en.m.wikipedia.org/wiki/Stephen_Briggs got me hooked on
| Terry Pratchett's Discworld series. I loved "Going Postal".
| IndrekR wrote:
| I know someone who listened Terry Pratchett's "Wachen!
| Wachen!" audiobook on Spotify while living in Germany for few
| years. It was so well narrated that he also acquired some
| peculiarities of local dialects used by specific characters
| in the book. Locals in Bavaria were quite surprised of a
| foreigner speaking such language.
| micw wrote:
| With this technology, one could produce high quality audio
| books without having access to high quality narrators by
| annotating the books with the voice, speed and such things.
|
| I wonder if a standardized markup exists to do so.
| KeplerBoy wrote:
| Don't end to end trained models already do this to some
| extent? Like raising the pitch towards a question mark, like
| a human would.
|
| TortoiseTTS has a few examples under prompt engineering on
| their demo site:
| https://nonint.com/static/tortoise_v2_examples.html
| micw wrote:
| That's a bit of basic and random. Some models have the
| features you describe. From the better models you get a
| slightly different voice for text in quotes.
|
| But the difference to good audio books is that you have *
| different voices for the narrator and each character *
| different emotions and/or speed in certain situations.
|
| I guess you could use a LLM to "understand" and annotate an
| existing book if there's a markup and then use TTS to
| create an audio book from it and so automate most of the
| the process.
| micw wrote:
| Edit: I actually tried this. I prompted in ChatGPT:
|
| "Annotate the following text with speakers and emotions
| so that it can be turned into an audiobook via TTS",
| followed by a short text from "The Hobbit" (The "Good
| morning scene"). The result is very good.
| pegasus wrote:
| They still wouldn't be high quality. It's just not possible
| to capture the precise tone of voice in an annotation, and
| that precision I believe really makes a difference. My
| experience is that the deeper the narrator understands the
| text and conveys that understanding, the easier it becomes
| for me to absorb that information.
| vasco wrote:
| Have you tried those "podcast from a paper" models? They do
| some of the things you are saying they don't, although it's
| not 100% it's also miles ahead of for example human Polish
| TV lectors, or other monotone style narrations.
| albert_e wrote:
| There is SSML for speech markup to indicate various
| characters of speech like whispers, pronunciation, pace,
| emphasis, etc.
|
| With LLMs proving to be very good at generating code, it may
| be reasonable to assume they can get good at generating SSML
| as well.
|
| Not sure if there is a more direct way to channel the
| interpretation of the tone/context/emotion etc from prose
| into generated voice qualities.
|
| If we train some models on ebooks along with their
| professionally produced human-narrated audiobooks, with
| enough variety and volume of training data, the models might
| capture the essence of that human-interpretation of written
| text? Just maybe?
|
| Amazon with its huge collection of Audible + Kindle library
| -- if it can do this without violating any rights -- has a
| huge corpus for this. They already have "whispersync" which
| is a feature that syncs text in a kindle ebook with words in
| corresponding audible audiobook.
| micw wrote:
| Good points, thank you! I just tested it. While ChatGPT was
| very good in adding generic (textual) annotations, the
| result for generating SSML where very poor (lack of voice
| names, lack of distinction between narrator and character
| etc).
|
| Probably the results with a model trained for this plus
| human audit could lead to very good results.
| stavros wrote:
| > On the other, some of my favorite audio books all stood out
| because the narrator was interpreting the text really well
|
| This (and everything else with AI) isn't saying "you don't need
| good actors any more". It's saying "if you don't have an
| audiobook, you can make a mediocre one automatically".
|
| AI (text, images, videos, whatever) doesn't replace the top
| end, it replaces the entire bottom-to-middle end.
| j4coh wrote:
| RIP to future top-enders that would normally have started out
| on the bottom to middle end.
| aredox wrote:
| Bingo. AI is going to destroy any pathway for training and
| accruing experience.
|
| An embalming tech for our dying civilization.
| lupusreal wrote:
| Just like printing presses killed the profession of
| copying books by hand, eliminating the training pathway
| for illuminated manuscripts. Death of civilization itself
| I say, damn those printing presses.
| j4coh wrote:
| If you see podcasts as useless in modern society as
| illuminated manuscripts, no big loss I suppose, but I do
| enjoy the human made ones and would be sad to see them go
| extinct as the manuscripts did. And the same thing is
| happening to other entry-level creative roles, some of
| which you may personally regret the loss of too.
| akho wrote:
| I enjoy looking at illuminated manuscripts. Podcasts are
| bullshit and can die in a ditch.
| teekert wrote:
| I enjoy podcasts but I still hope illuminated manuscripts
| won't die in a ditch so other people can enjoy content
| the way they prefer ;)
| lupusreal wrote:
| Actually I think illuminated manuscripts had more value,
| insofar as they were art, than podcasts (99% of which are
| vapid timewasters and/or friend simulators.) The good
| podcasts are those view which involve interviewing
| interesting people, and AI isn't replacing those.
|
| There's a lot more to be said for the value of audio
| books, but the accessibility gains of proliferated auto-
| generated audiobooks outweigh the downside of losing a
| small number of expertly produced audio books.
|
| For context, I listen to audio books a lot, and for years
| I have listened to traditional TTS readings of books too.
| Better voice generation for books without audiobooks is a
| great win for society.
| littlestymaar wrote:
| Given that the printing press was the root cause for the
| century of religious wars that soaked Europe with blood,
| and was key in the revolutions that overthrown absolute
| monarchies all over Europe, I don't think it's as good as
| an example as you think it is.
|
| Death of a civilization doesn't mean disappearance of
| mankind or even overall regression on the long term.
| megaloblasto wrote:
| Do you have a source for that? I don't think the printing
| press was the cause of religious wars any more than
| bullets were the cause of WWII
| baq wrote:
| Easy access to the Bible text instead of being only read
| to, hence high literacy of the faithful, was one of the
| core tenets of some branches of Protestantism.
| llamaimperative wrote:
| Have you heard of the Protestant Reformation and the
| following 120 years of war? The entire Protestant <>
| Catholic blow up that consumed Europe was pretty directly
| attributable to the printing press.
|
| (To be clear, nothing is solely and exclusively caused by
| any one thing. Causality is a very fuzzy concept. But
| sans printing press, those wars certainly wouldn't have
| happened when/where/how they did, if they ever happened
| at all).
| thoroughburro wrote:
| This is common enough knowledge that "read, like, any
| history" is an appropriate response. However, if you're
| genuinely curious, here's a random link:
|
| https://ehne.fr/en/encyclopedia/themes/european-
| humanism/eur...
| lupusreal wrote:
| I blame canned food and trains for solving the logistics
| problems that previously prevented massive wars.
| littlestymaar wrote:
| Napoleonic wars beg to differ.
| sigilis wrote:
| While they didn't have trains, the Napoleanic wars did
| feature the first use of canned food to aid in logistical
| supply of armies. You could argue that the lack of trains
| (and can openers) probably meant that they jumped the gun
| on starting giant wars. We Americans fixed that in the
| Civil War, to great and deadly effect.
| _DeadFred_ wrote:
| An interesting one I read was public schools and their
| creation of a national identity. Before public schools
| there weren't really standardized languages forced upon
| an entire nation, etc. The countryside was more one
| country/people/language morphing into the next, not clean
| delineated lines where country/language switched
| instantly. It was also said borders were much more
| open/abstract before the resultant shift as well.
| turnsout wrote:
| Those revolutions were ultimately positive. The
| alternative would be the continued rule by monarchs and a
| single powerful religion
| littlestymaar wrote:
| See my second paragraph. It can be ultimately positive
| while still being civilization-ending.
| chairmansteve wrote:
| No comfort to the millions who died though.
| oldgradstudent wrote:
| There's a big difference.
|
| Printing presses produce superior products.
|
| A mediocre audiobook is certainly better than no
| audiobook at all, but it is an inferior product to a well
| produced audiobook.
| gampleman wrote:
| > Printing presses produce superior products.
|
| That seems like a highly dubitable statement. Many hand
| illuminated manuscripts are masterpieces of art. The
| advantage of the printing press was chiefly economical
| making the cost of a copy dramatically less, not an
| increase in quality (especially so by the aesthetical
| standards of the time).
| karamanolev wrote:
| Many (most, if not all) hand-made copies contained
| errors, which printed books did not. They were much
| closer to 1:1 copies.
| jhbadger wrote:
| If the mistake happened in the typesetting stage, printed
| books could spread errors much more efficiently, as in
| the infamous "wicked bible" of 1631, where a typesetting
| error made the ten commandments contain the amusing
| phrase "Thou shalt commit adultery". Surviving copies are
| quite the collectors' item as most were destroyed.
|
| https://en.wikipedia.org/wiki/Wicked_Bible
| oldgradstudent wrote:
| Usually, though, errors are corrected and every every
| printing has fewer errors than the previous one.
| kamarg wrote:
| What percentage of books get a second print run on a
| printing press? And what's the process for that? Do they
| have to reset each word for the second run? I genuinely
| don't know how a physical process like typesetting can
| result in increased accuracy on each print.
| jhbadger wrote:
| Indeed. Even Gutenberg had his Bibles touched up by
| artists after they were printed (illuminated capital
| letters and so on) because even he believed his printed
| copies were inferior to the hand-made ones.
| oldgradstudent wrote:
| As a work of art, sure. But as books containing
| information, printing presses produced superior products.
| Workaccount2 wrote:
| What we have today is early gen "practical" AI.
|
| Even current SOTA models would almost certainly be able
| to handle multiple speakers and pick-up on the intended
| tone and intonation.
|
| Don't make the mistake of thinking what we have today is
| what we will still be working with in 5 or 10 years.
| fidelramos wrote:
| Some people will learn to use these AIs to make top-
| quality audiobooks (and books, movies, TV shows,
| comics...). It will be a more manual process than
| pressing a button, but still orders of magnitude less
| than what it took before. As a result there will be a
| tsunami or high-quality content.
|
| There will be curation and specialization. Previously
| ignored niches now will be economically profitable. It
| will be a Renaissance of creativity, and millions of jobs
| will be created.
| _DeadFred_ wrote:
| It's kind of wild to me that the future will look like
| the 80s imagined it all because AI killed the creative
| seed corn when retro-future 80s was the aesthetic.
| azeirah wrote:
| We'll be ok lol, while it is a significant transition, it
| IS just a transition in the media landscape.
|
| AI is big and significant, but we'll be ok. There is also
| no such "one" thing as "our civilisation". We're deeply
| interconnected extremely vast and complex interconnected
| networks of ever-changing relationships.
|
| AI does indeed represent the commoditisation of things we
| used to really value like "craftsmanship in book
| narration" and "intelligence". But we've had
| commoditisations of similar media in the past.
|
| Paper used to be extremely expensive, but as time went
| on, it became more and more commoditised.
|
| Memory used to be extremely expensive (2000-3000 years
| ago, we needed to encode memory in _dance_, _stories_ and
| _plays_. Holy shit). Now you can purchase enough memory
| to store a billion books for maybe two hours of labor.
|
| Most of these things don't really matter. What is
| happening is that the media landscape is significantly
| shifting, and that is a tale as old as history.
|
| I do think the intellectual class will be affected the
| most. People who understand this shift stand to benefit
| enormously, while those who don't _might_ end up in a
| super awful super low class.
|
| And yet, all of that doesn't really matter if you just
| move to, I dunno, Paramaribo or whatever. The people
| there are pragmatic and friendly. They don't care about
| AI too much. Or maybe New Zealand, or Iceland, or Peru,
| or Nepal or I don't know.
|
| The world isn't ending. Civilisation isn't being
| destroyed at our core.
|
| The media landscape is changing, classes are shifting,
| power-relationships are changing. I suggest you think
| deeply about where you want to live, what you stand for
| and what is most important to you in life.
|
| I don't need money or tech to be happy. I am fine with
| just my cats, my closest friends and family and healthy
| food.
|
| If it happens to be the case that I need to leave tech or
| that extremely high-end narrated audiobooks cease to
| exist? Then all I have to say is "oh no, anyway".
|
| We'll be fine. One way or another.
|
| Just different.
| credit_guy wrote:
| By that time, AI will beat the toppest of the top enders.
| Remember the time Deep Blue barely beat Kasparov? Now no
| human, or group of humans can beat a chess engine, even one
| that runs on an iPhone.
| plastic3169 wrote:
| I don't think chess is a good example of AI destroying
| the path to the top. Chess is more popular now and humans
| keep advancing even though it is futile effort against
| computers.
| rcxdude wrote:
| And people are better at chess now in part because of
| practicing with/against machines. But chess has never
| been something you can make a living off of unless you
| were at the very top.
| sam_lowry_ wrote:
| > RIP to future top-enders that would normally have started
| out on the bottom to middle end.
|
| This stance always reminds me of the Profession, a 1957
| novella by Isaac Asimov that depicts pretty much the future
| where there are only top performers and the ignorant crowd.
| xyproto wrote:
| He was a clear thinker.
| Der_Einzige wrote:
| Not RIP at all. "Meritocracy" was coined in a book
| literally warning us about how terrible such a society
| would be:
| https://en.wikipedia.org/wiki/The_Rise_of_the_Meritocracy
|
| The "top-enders" are the privileged who need to have some
| of their gains for their intelligence redistributed to
| others. The alternative is "survival of the smartest",
| which is de-facto what we have today and what Young was
| trying to warn us about.
| gosub100 wrote:
| I'm super opposed to AI, but I see this as a rare positive.
| As someone already said, the win here is to have a
| audiobook where one doesn't yet exist. hell, maybe the
| tables will turn and the scrubs will do the hard work of
| discovering which titles are popular with an audience, then
| the ebook industry can capitalize on AI by hiring voice
| actors to produce proper titles?
| DidYaWipe wrote:
| Not gonna happen. Once the AI shit is out there, people
| will have consumed it by the time a real actor can create
| (and edit) the audiobook.
| anothermathbozo wrote:
| Virtually every book I want this for has been around for
| 70+ years and still no high or low quality audiobook has
| been produced. How long do I have to wait for those
| aspiring top-enders before an audiobook can be made
| available?
| Arainach wrote:
| That has nothing to do with audiobook voice actors and
| everything to do with copyright and who owns the rights
| to the book (and whether they believe there's any money
| to be made selling an audiobook version).
| cmdtab wrote:
| The value of distribution is increasing while the value of
| content and product is decreasing for all but the top end.
| CuriouslyC wrote:
| It's common for shows to use big name actors as voices
| because they draw an audience, nothing will change. Just
| means a smaller pool of voice actors and they'll mostly be
| good looking.
| numpad0 wrote:
| AI TTS has been available for quite some time. Tacotron V1 is
| about 8 years old. I don't think we saw much bottom end
| replacement.
|
| IMGO(gut opinion), generative AI is a consumption aid, like a
| strong antacid. It lets us be done with $content quicker, for
| content = {book, art, noisy_email, coding_task}. There's
| obvious preconceptions forming among us all from "generative"
| nomenclature, but lots of surviving usages are rather
| reductive in relevant useful manners.
| sam_lowry_ wrote:
| Yeah, let us not blame AI. Audible damaged the quality of
| audiobooks than AI.
| no_wizard wrote:
| Bottom end really, Middle end is still superior to this AI
| drivel.
| dmazin wrote:
| Absolutely.
|
| Even on the non-fiction side, the narration for Gleick's The
| Information adds something.
|
| While I want this tool for all the stuff with no narration,
| NYT/New Yorker/etc replacing human narrators with AI ones has
| been so shitty. The human narrators sound _good_ , not just
| average. They add something. The AI narrators are simply _bad_.
| WillAdams wrote:
| Yes, but if the alternative is not having a book, or having to
| listen to one poorly read (I love Librivox, but there are some
| books which I just haven't been able to finish because of
| readers, and many more which were nixed for family vacation
| travel listening on that account), this may be workable.
| rd11235 wrote:
| I agree but the opposite can be true too. Sometimes the
| narrator seems to target some general audience that doesn't fit
| me at all, in a way that makes me cringe when I listen, until I
| stop listening altogether. In these cases I'd rather listen to
| a relatively flat narration from a tool like this.
| whazor wrote:
| A GenAI model that read audiobooks with such dramatisation is
| really my dream. There are so many books that I would want to
| listen to, but still lack such an adaptation. Also it takes
| months after the book release before the audiobook gets
| released.
|
| Just imagine what this would do for writers. They can get
| instant feedback and adjust their book for the audiobook.
| ldoughty wrote:
| I agree with you, but also want to point out:
|
| New authors, self-publishers, can't afford tens of thousands of
| dollars to get an audiobook recorded professionally... This can
| limit their distribution.
|
| Authors might even choose not to make such version (or lack
| confidence to record themselves), so AI capable of making a
| decently passable version would be nice -- something more than
| reading text blandly. AI in theory could attempt to track the
| scene and adjust.
| DidYaWipe wrote:
| You can get narrators to work on a royalty basis.
| plorg wrote:
| By observation the current approach is for authors to narrate
| the book themselves of they think their readers will want it
| and if they feel reasonably confident in their own narration.
| gmuslera wrote:
| Would a "better" AI would do a "better" narration with a better
| understanding of the text? Of course that it would imply a
| different (and far bigger?) model.
|
| Anyway, even if in theory it might, in practice things may end
| even worse than doing it with a monotone voice.
| taude wrote:
| Agree with you on this.
|
| My example, I was never a Wheel of Time fan, but the new audio
| editions done by Rosamund Pike are quite the performance, and
| make me like the story. She brings all the characters to life
| in a way thats different than just reading. It's a true
| performance.
| Havoc wrote:
| Wow that sample sounds really good
| pprotas wrote:
| I would love to have an e-reader that allows me to switch between
| text and audio at the press of a button. Imagine reading your
| book on the couch and then switching into audio mode while doing
| the dishes seamlessly, by connecting bluetooth headphones.
| InsideOutSanta wrote:
| Kindles used to provide this feature, but publishers and/or the
| Authors Guild stopped it, because audio rights and text rights
| are handled differently. In other words, when Amazon sells you
| a text book, it does not have the right to then also do TTS on
| that text and let you listen to it.
|
| There's some contemporary discussion of what happened here:
| https://tidbits.com/2009/03/02/why-the-kindle-2-should-speak...
|
| I think there is still integration with Audible, though. If you
| buy a book on the Kindle and on Audible, the position will
| sync, and you can switch between listening and reading without
| losing your place in the book.
| albert_e wrote:
| Yes the feature is called WhisperSync -- I used it many years
| ago and it was pretty good.
|
| I tried it while on a treadmill so it allowed me to follow
| the book with more focus without sacrificing much else.
| thfuran wrote:
| Isn't whisper sync the current version that relies on
| owning both the ebook and audiobook?
| Brybry wrote:
| I used that TTS feature semi-regularly on a Kindle 2.
|
| It wasn't a good experience but it was nice to be able to
| keep 'reading' a book while I was exercising.
|
| It worked for me for over a decade, until I broke the device.
| I don't know if I never updated the firmware or if the fact I
| used Calibre to convert books bypassed the feature gate.
| hamzakc wrote:
| I am not sure if this still works, but 2-3 years ago I
| listened to a kindle book that I bought through my Echo show
| device. It was pretty good. I listened to it while I was
| cooking. It even allowed you to carry on where you left off.
| But I did notice that a few pages were skipped as I had read
| the book before. I have since packed away my echo show so I
| can't verify if they have removed this feature or not.
| freefaler wrote:
| You can do it easily with non-DRM books (or DRM stripped
| books):
|
| For Android:
|
| - Moon+ reader pro - some paid high-quality TTS voices (like
| Acapella)
|
| For iOS:
|
| - Kybook reader and internal iOS voices (no external TTS voices
| for the walled garden)
|
| This works well enough to listen to a book while you walk and
| when you get back home read on the WC from the place you
| stopped.
|
| Additionally if you buy a tablet or an android ebook reader,
| you install the app there an you can continue on your
| bigger/better device seamlessly.
|
| Whisper-sync for the masses! Ahoy...
| basedrum wrote:
| But you need an android phone, and can't use a kobo or
| similar wink reader?
| freefaler wrote:
| for ios you use Kybook on your iphone and your ipad. It
| syncs positions between the devices. When you go for a
| walk, opens Kybook, start TTS. When back home, open your
| tablet, you'll see the page TTS has stopped reading to.
| figers wrote:
| How does this compare to using Apple's iBook or Kindle
| reader app and then the iPhone's built in text to voice
| (the female British voice is pretty good).
| freefaler wrote:
| On iOS it is the same voice.
| dsign wrote:
| It is a supported feature in the epub 3.0 standard. It's
| possible to distribute an epub with audio, and have the audio
| sync to the HTML elements that form the ebook's text. And there
| is an e-reader that actually supports this feature, I can't
| remember which one now but it should be possible to find it
| with Google.
|
| It's more of an open problem how to create those epubs. I have
| some code that can do it using Elevenlabs audio, but I imagine
| it way harder to have something similar for a human
| narrator.... who's going to do the sync? Maybe we need a sync
| AI.
| llamaimperative wrote:
| Boox Ultra Tab whatever the fuck (their product naming sucks) +
| Readwise Reader = amazing for this
|
| Not quite seamless but it works. It has a cursor that follows
| the words as they're spoken to, which allows you to read and
| hear ("immersive reading") which I find to be extremely helpful
| for maintaining focus.
| monkeydust wrote:
| Literally started doing that this week with Amazon Audible. I
| gave in an started the three month 99c trial and downloaded the
| app.
|
| What surprised me a good way was my Kindle app was aware of
| this and asked if I wanted to download the audible version of
| the current book I am reading.
|
| Been listening on the way to work and then reading on the way
| back. Enjoying it so far.
| mmahemoff wrote:
| Some Kindke books also have a checkbox to add the audio (for
| a fee) when you buy it. Sometimes I've seen books discounted
| to e.g. PS0.99, but adding the audio might be PS5.99. The
| upsell seems to be a good hack for adding some revenue when
| there's a deep discount being used to drive interest.
| mrklol wrote:
| How can this support more languages than the model itself?
| Kye wrote:
| The model might have stumbled on the generative AI equivalent
| of IPA.
| msoad wrote:
| To people who are experts in AI TTS:
|
| Why elevenlabs has such a lead in this space? It sounds better
| than OpenAI and Google models
| dbspin wrote:
| Does it? The podcasts created by Notebook LLM are completely
| convincing, at least in terms of voice generation.
| swores wrote:
| Can anyone recommend an open source option that would allow
| training on a custom voice (my own, so I'd be able to record as
| many snippets as it needed to train on) to allow me to use it for
| TTS generation without sharing it off my machine?
|
| Edit: I'll wait to see if any recommendations get made here, if
| not I might give this one a go: https://github.com/coqui-ai/TTS
| numpad0 wrote:
| I think you can probably generate TTS audio by classical means,
| and voice2voice that audio through RVC or Beatrice V2. Haven't
| looked into it in a while but Beatrice is apparently super fast
| and CPU only.
| phrotoma wrote:
| https://github.com/DrewThomasson/ebook2audiobook
| esskay wrote:
| If I recall Coqui is very much a dead project, just one to be
| aware of.
| hm64 wrote:
| Coqui is great, but in practice, I found Piper easier to set
| up, train, and deploy as an ONNX file. Big thanks to the Sherpa
| development team for their helpful resources:
| https://k2-fsa.github.io/sherpa/onnx/tts/piper.html and to the
| Rhasspy team for their training guide:
| https://github.com/rhasspy/piper/blob/master/TRAINING.md.
|
| I also found DEMUCS + Whisper + pydub to be a super helpful
| combo for creating quality datasets.
| drewbitt wrote:
| There is a fork here https://github.com/idiap/coqui-ai-TTS
| 'coqui-tts'
|
| Though according to the TTS leaderboard, Fish Speech
| https://github.com/fishaudio/fish-speech and Kokoro are higher.
|
| https://huggingface.co/hexgrad/Kokoro-82M
|
| https://huggingface.co/fishaudio/fish-speech-1.5
| xnx wrote:
| AFAIK Kokoro can't be fine tuned
| jsemrau wrote:
| I wrote this a while ago about xTTSv2 mixed with Nvidia's Nemo.
| Maybe it kicks off your journey.
|
| https://jdsemrau.substack.com/p/teaching-your-agent-to-speak...
| lc64 wrote:
| "was trained on <100 hours of audio"
|
| How the hell was it trained on that little data ?
| Havoc wrote:
| Yeah that surprised me as well - seems low vs what is used on
| text llms . To be fair 100 hours of speaking is a lot of
| speaking though
| edude03 wrote:
| But it covers five? Languages so if all equal it's just 20
| hours per language.
| em-bee wrote:
| in the linked audio sample it says the training data is
| mostly english. also another comment claims that the
| japanese quality is not good, so i'd be suspicious about
| all the other languages.
| bbminner wrote:
| I suppose it means per speaker. And it is based on a simplified
| style tts 2 which from my small dive into the subject seems one
| of the smaller models achieving great quality.
| vinni2 wrote:
| Can it also translate? I have family who would like audiobooks in
| German but most are in English only.
| em-bee wrote:
| german is not listed as a supported language, so no. aside from
| that, i would not want to use computer translation. unlike TTS,
| which keeps getting better, translation quality still leaves a
| lot to be desired.
| vinni2 wrote:
| Ah thanks just noticed that. But which voice to use for
| French?
| october8140 wrote:
| All these AI text to voice models seem to ignore emotion. It
| always sounds like a robot.
| lyu07282 wrote:
| Like with almost everything, its an active area of research:
|
| https://emosphere-tts.github.io/
|
| We are getting there
| boxed wrote:
| Some of those samples sound like they are emoting in Korean
| while speaking English.
| lyu07282 wrote:
| True, maybe an artifact of the training data, here is
| another one:
|
| https://www.microsoft.com/en-us/research/project/emoctrl-
| tts...
| croes wrote:
| Emotion is the acting part of voice acting. Hard to copy with
| AI
| iagooar wrote:
| I wonder if AI could create a "commentary" script that
| instructs the TTS how to read certain words or chapters. The
| commentary would be like an additional meta-track to help the
| TTS make the best reading.
|
| That should actually be possible to do already with existing
| tech. I haven't seen if you can instruct Kokoro to read in a
| certain way, does anyone know if this is possible?
| arafalov wrote:
| Try this one https://www.hume.ai/ - I found the demos (voice to
| voice) interesting.
| nottorp wrote:
| Well there was some hope with ChatGPT that people will go back to
| being able to process text communication.
|
| Guess it was just a matter of time till someone figured out how
| to use "AI" to resume encouraging illiteracy.
| stavros wrote:
| There was some hope with the rise of equestrianism that people
| will go back to be able to shoe horses.
|
| Guess it was just a matter of time till someone figured out how
| to use "cars" to resume encouraging being unable to to a basic
| farrier job.
| nottorp wrote:
| Except cars were faster than horses, while audio or video
| content is much slower than reading.
| stavros wrote:
| Cars also have legs while audio doesn't, a point which is
| equally irrelevant. If people don't need to read, they
| don't need to read, and no matter how much a random
| Internet commenter wants them to need it, it won't change
| anything.
|
| Skills atrophy for a reason. It's fine to let them. You may
| as well be lamenting the lost art of long division.
| nottorp wrote:
| > Cars also have legs while audio doesn't, a point which
| is equally irrelevant.
|
| That's what a LLM would say :)
| stavros wrote:
| I'm sure an LLM wouldn't say anything as inane as that :P
| hombre_fatal wrote:
| You can multitask with audio content, so you can consume
| content when you can't sit down to read. And you can even
| potentially consume more volume like on a long daily
| commute.
|
| It's not the case that it's worse.
| floppiplopp wrote:
| It sounds okay, but it lacks emotion and is monotone for fiction,
| it's the voice equivalent of the uncanny valley, which is
| probably fine if you don't really care.
| laserbeam wrote:
| And when I don't care... to be honest I'm even OK with the dull
| browser TTS implementation when reading your average substack
| post. Shove the phone in my pocket, go shopping, get the jist
| of the article.
| yoavm wrote:
| Was just looking for a TTS model to run locally for reading out
| loud articles, and never heard about Kokoro before! This looks
| great. I wonder if it can run in the browser somehow - could be a
| nice WebExtension.
| xkriva11 wrote:
| What about the WASM running sherpa-onnx? No intallation
| required and can be served locally as well.
|
| https://k2-fsa-web-assembly-tts-sherpa-onnx-en.static.hf.spa...
| jiehong wrote:
| I think most browsers support this already. Even maybe OS wide.
|
| I know it should work for Firefox on an article in reader mode.
|
| Or in MacOS you can select text and have it read out loud.
| yoavm wrote:
| I'm using Firefox and I do not see this option. Probably not
| working on Linux?
| sriacha wrote:
| You might need to install/setup Speech Dispatcher. I was
| just using this implementation with Piper:
| https://github.com/Elleo/pied?tab=readme-ov-file.
|
| However easier way to read articles aloud is with Read
| Aloud extension: https://github.com/ken107/read-aloud.
| albert_e wrote:
| I hope a plugin for Calibre ebook management software comes along
| that makes it easier to convert select titles from your epub
| library to decent audio versions -- and a decent open source app
| for tablets and smartphones that can let us seamlessly consume
| both the ebook and audiobook at will.
| cwmoore wrote:
| The word "kokoro" means "heart" in Japanese, which I learned
| making the (heart shaped and paperback) puzzle books at
| https://www.kakurokokoro.com/
| terhechte wrote:
| Its also the name of the AI in Terminator Zero
| https://villains.fandom.com/wiki/Kokoro
|
| I'm not sure if that is related here.
| tkgally wrote:
| Note that _kokoro_ (Xin ) means "heart" in the sense of
| "spirit," "soul," "mind," "emotions," etc. It _doesn't_ mean
| "heart" in the sense of "internal organ that pumps blood." That
| is _shinzo_ (Xin Zang ).
|
| I once heard an American friend with so-so Japanese ability ask
| a Japanese woman who had recently had a heart operation how her
| _kokoro_ was doing, and she looked surprised and taken aback.
|
| Side note: After I started reading HN in 2019, I was struck by
| how many tech products mentioned here have Japanese names. I
| compiled a list for a few years and eventually posted it:
|
| https://news.ycombinator.com/item?id=31310370
| TypoAtLineZero wrote:
| I am having a very similar setup locally, which uses Chrome with
| the 'Read Aloud' plugin. I am capturing the audio stream via
| QJackCtl/VLC. Voices, speed, pitch can be adjusted. Efficient and
| quickly set up
| TheChaplain wrote:
| For accessibility I think this is a great thing, but as
| entertainment less so.
|
| Example is Hobbit and Lord of the Rings, the narrator Rob Inglis,
| makes an amazing voice performance giving depth to environments
| and characters. And of course the songs!
| basedrum wrote:
| I want to be able to seemlessly read on my ebook reader and then
| put in my headphones and go for a walk with the dog and resume on
| audio where I left off. then when I come back, my ereader is at
| the right place where the audio finished and I can resume reading
| llamaimperative wrote:
| Readwise Reader does this. A litttttle finicky at tracking read
| location but it's workable
| GaggiX wrote:
| There is also this TTS: https://github.com/rhasspy/piper that is
| pretty good (depending on the language) and extremely fast, would
| be cool to change the script to user Piper instead of Kokoro in
| case you want to use a language that is not supported by Kokoro
| or it's too slow, Piper supports a lot of them.
| mikkom wrote:
| What I really want and hope that someone does is to make an
| audiobook service that converts books to audiobooks but so that
| each character has own voice.
|
| Som audiobooks have this and I think it really makes the
| experience much more engaging.
|
| (Also maybe some background sound effects but not sure about
| that, some books also have this and it's quite nice too)
| ajsnigrutin wrote:
| Just tried it, and "meh"...
|
| It's one step above "normal" text-to-speech solutions, but not
| much above it. The epub has "Chapter 1" as the title on the page,
| and a lot of whitespace, and then "This was...." (actual text).
| The software somehow managed to ignore all the whitespace and
| reach "chapter 1 this was.." as a single sentance, no pauses, no
| nothing.
|
| Blind? A great tool. Will it replace actual audiobooks? Well..
| not yet at least.
| carlosjobim wrote:
| Why isn't the audiobook market strong enough that it would make
| business sense to pay good narrators and actors for each book
| published?
| DidYaWipe wrote:
| It is. But since when is "enough" enough for
| monopolistic/oligopolistic corporations?
| causi wrote:
| I'm not able to try it until later, but regarding the sample
| audio: The voice quality is quite good, but what's going on with
| all the random pauses between words? It's very Captain Kirk.
| cliftonpowell wrote:
| There's another project called ebook2audiobook that has produces
| some decent results.
| woolion wrote:
| If you look for a lot of the great classics, audiobooks results
| are inundated with basic TTS "audiobooks" that are impossible to
| filter out. These are impossible to listen to because they lack
| the proper intonation marking the end of sentences, making it
| very tiring to parse. It might be better than tuna can sounding
| recordings, especially if you want to ear them in traffic (a
| common requirement), but that's about it. The alternative, if you
| want real quality recordings, is to stop reading classics and
| instead read latest Japanime Isekai of murder mystery, these have
| very good options on the market. Anyway, I don't think it needs
| more justification that it covers a good niche usage.
|
| I'm checking what the actual quality is (not a cherry-picked
| example), but:
|
| Started at: 13:20:04 Total characters: 264,081 Total words: 41548
| Reading chapter 1 (197,687 characters)...
|
| That's 1h30 ago, there's no kind of progress notification of any
| kind, so I'm hoping it will finish sometime. It's using 100% of
| all available CPUs so it's quite a bother. (this is "tale of a
| tub" by Swift, it's about half of a typical novel length)
| csantini wrote:
| Yeah, that's a known issue, if the book is all on a single
| chapter you don't get any sense of progress. I may fix that
| next weekend
| woolion wrote:
| It's not in one Chapter, but Chapters are called "Section"
| (and so ignored!). It should be simple to have a dictionary
| of the different units that are used (I would assume "Part"
| would fail too, as would the hilarious "Catpter" of some cat-
| themed kid book, but that's more complicated I guess?).
|
| It did finish and result is basically as good as the provided
| example, so I'd say quite good! I'll plan to process some
| book before going to bed next time!
|
| Chapter 1 read in 6033.30 seconds (33 characters per second)
| callamdelaney wrote:
| It's insufferable.
| zoidb wrote:
| Not directly related to the software, but interestingly on the
| authors website there is a Schedule a free call with me
| (https://claudio.uk/templates/call.html). I wonder if randos on
| the internet ever do that, and how it works out.
| sam_lowry_ wrote:
| His LLM will answer the call.
| rpastuszak wrote:
| I've been doing it for a few years (+200 calls) and have met a
| ton of wonderful people this way.
|
| https://untested.sonnet.io/notes/say-hi/
|
| https://sonnet.io/posts/hi
| herculity275 wrote:
| Very nice! I fiddled with this idea a few months back but the
| models available at the time were woefully slow on a macbook.
| Will definitely give this a spin, there's a large category of web
| serials and less popular translated novels that never get
| audiobook releases.
| delegate wrote:
| The quality is great (amazing even), but I can't listen to AI
| generated voices for more than 1 minute. I don't know why, I just
| don't like it. I immediately skip the video on youtube if the
| voice is AI generated.
|
| Might be because our brains try to 'feel' the speaker, the
| emotion, the pauses, the invisible smile, etc.
|
| No doubt models will improve and will be harder to identify as AI
| generated, but for now, as with diffusion images, I still notice
| it and react by just moving on..
| rockemsockem wrote:
| That kinda means the quality isn't great or amazing. Good TTS
| should be nearly or indistinguishable from a human speaker and
| should include emoting, natural pauses, etc
| xdennis wrote:
| Among other things, what I don't like is the hallucinated
| stress. Take the classic example of:
|
| > I never said she stole my money
|
| It can have 7 different meanings based on which word you stress
| out.
|
| The new AI voices sound very natural at a shallow level, but
| overall pronounce things in odd ways. Not quite wrong, but
| subtly unnatural which introduces some cognitive load.
|
| Old TTS systems with their monotonic voices are less confusing,
| but sound very robotic.
| DidYaWipe wrote:
| erroneous or inappropriate [?] hallucinated
| CMay wrote:
| Haven't really been following the latest in TTS ML, but I
| expected this to be better or at least as good-bad as the stuff
| you hear on YouTube. Somehow it sounds worse. It really is
| jarring to listen to any of these ML voices and can't really
| stand it. Nope out of every video that uses them and can't tell
| if YouTube never recommends them to me for that reason, or just
| because the recommendations around what I watch are just so
| rarely going to be from some low reputation channel.
|
| Take a moment here for a second though and think about it. Even
| if these voices got to be really good, indistinguishable
| almost... would I want to listen to it even then? If it was an
| NPC's generated voice and generated dialogue in a game to help
| enrich the world building, maybe in that context. On YouTube or
| with newscasters? Probably not. Audio books? Think I would
| still rather have it be a real person, because it's like
| they're reading a story to me and it feels better if it's
| coming from someone. There's also the unknown factor, where if
| it's ML generated it's so sterile that the unknowns are kind of
| gone.
|
| Think about it like this, in the movie industry we had
| practical effects that were charming in a way. You could think
| about the physical things that had to occur to make that
| happen. Movie magic. Now, everything is so CG it's like the
| magic is gone. Even though you know people put serious hard
| work into it, there's a kind of inauthenticity and just lack of
| relevance to the real world that takes something away from it.
|
| It's like a real magician has interesting tricks, while an
| artificial magician is most likely just a liar.
|
| Still, I grant that it makes some cool things possible and
| there is potential if things are done right. Some positive
| mixture of real humans and machine generated stuff so it isn't
| devoid of anything connected to real life effort.
| yjftsjthsd-h wrote:
| > I immediately skip the video on youtube if the voice is AI
| generated.
|
| I mean, I do that because it's correlated with the content
| being garbage. If I'm intentionally using it on content I want
| to consume I expect it to be different, though I haven't gotten
| around to trying it properly yet so I guess we'll see. (OTOH I
| already listen to ebooks via pre-AI TTS, so I'm optimistic)
| _DeadFred_ wrote:
| For new generations/those coming up now this will be the norm
| and not generate the negative reaction is does for us, it will
| just be part of how the world is and has always been, and
| eventually we will be the minority.
|
| Future generations will never know a world where you don't
| watch a 2 hour AI generated orientation video about the wonders
| of working for Generic Corp when you start a new job.
| karmasimida wrote:
| Yeah same.
|
| Doesn't mean the quality is bad. In fact I think Kokoro's
| quality is amazing.
|
| But it is not the right tool for narration, the kind of
| training data they use make the sound too flat, if that makes
| sense.
| maxglute wrote:
| Sounds really nice at 3x-4x speed, which I can't say for high
| quality TTS options last year. I'm wondering if there's metrics
| out there for audio speed vs clarity.
| monkeydust wrote:
| I have been looking for something credible that can voice over
| written emails (long form ones), documents and powerpoints
| locally ...this might be just the thing!
| plumbees wrote:
| As a mandarin learner, I find that the Chinese one lacks cadence,
| which makes it very hard as a learner to comprehend. It's like a
| machine gun of words without the subtle slight pause between sets
| of words that I would normally lean on.
| jaggs wrote:
| I really like this a lot. The default provides a really good
| audiobook feel, especially the Isabella voice. Any chance you
| could add in an API hook for optional ElevenLabs use?
| therealdrag0 wrote:
| Do folks have a preferred toolkit for extracting text from web
| articles? I'd like to TTS articles friends send me.
| Dowwie wrote:
| 2025 may be the year where we can generate a dramatic audiobook
| with ambient music, sound effects, and theatrical narration using
| neural networks. Many of the parts already exist.
| DidYaWipe wrote:
| Yes, because real narrator/actors are rolling in the dough. Let's
| kill one more profession with trash.
| bongodongobob wrote:
| If it's trash then why would it kill the profession?
| DidYaWipe wrote:
| Because people will opt for readily-available free trash
| instead of paying for high quality. And then that quality
| isn't available to anyone at any price, so everyone loses.
|
| If you haven't observed this in many other markets, you live
| an unusual (or unobservant) life.
| geor9e wrote:
| This one sounds a bit robotic and takes ~4 hours per book on my
| M1 laptop, so I'll keep looking. For now, I'm happy my current
| method - EPUBReader browser extension, which opens .epub as an
| HTML page in Microsoft Edge browser, which has a "Read Aloud"
| button set to the Stephan natural voice at 1.6 speed. Best
| sounding voice I've ever heard, speaks fast, clear, crisp, with
| natural inflections to the sentences, and if I want to jump to
| somewhere I just left click the text at that spot. And it's
| instant - no conversions. Downside is I have to stay in bluetooth
| range of my laptop, so I'm still looking for a good phone based
| method. Google Play Books works okay, but gets buggy at 1.6
| speed.
| nickpsecurity wrote:
| The page says it was trained on under 100 hours of audio. Then,
| the link says "we employ large pre-trained SLMs, such as WavLM,
| as discriminators with our novel differentiable duration modeling
| for end-to-end training." I don't have time to read the paper to
| see what that means.
|
| Depending on what that means, it might be more accurate to say it
| was trained on 100 hours of audio and with the aid of another,
| pre-trained model. The reader who thinks "only 100 hours?!" will
| know to look at the pretraining requirements of the other model,
| too.
| leecarraher wrote:
| in case you are wondering how audiblez becomes an executable in
| the PATH from a pip install audiblez per the documentation
|
| ... audiblez book.epub -l en-gb -v af_sky.
|
| it does not, instead it installs a python package with a cli
| interface, to run you then have to prepend python and load the
| module like this:
|
| python3 -m audiblez book.epub -l en-gb -v af_sky.
| skwee357 wrote:
| Soon, AI will flood the market with mediocre everything: books,
| audio books, art, movies, websites.
|
| The saddest thing is that people will still continue to
| participate in consuming these AI produced "goods".
| abroadwin wrote:
| If there's one thing our capitalist society has taught me it's
| that people are always willing to endure a crappier product.
| I'm not sure we've found the bottom yet...
| vanderZwan wrote:
| I think the saddest thing is that it's highly likely that real
| people will start to produce aesthetics that look/sound/etc
| like AI slop
| skwee357 wrote:
| True, and I think with the recent news around Sporify using
| AI to fill their playlists, we are already getting there.
| Just need to condition the public that there is no better
| flypunk wrote:
| I really liked it and added a variable speed argument:
| https://github.com/santinic/audiblez/pull/4
| sysworld wrote:
| Finally! Been trying all the TTS models popping up on here for
| ages, and they've all been pretty average, or not work on Mac, or
| only work on really short text, or be reeealy slow.
|
| But this one works pretty quick, is easy to install, has some
| passible voices. Finally I can start listening to those books
| that have no audio version.
|
| I'm a slow reader, so don't read many books. If a book doesn't
| have an audiobook version, chances are I won't read it.
|
| PS, I have used elevenlabs in the past for some small TTS
| projects, but for a full book, it's price prohibitive for
| personal use. (elevenlabs has some amazing voices)
|
| Thank you to the dev/s who worked on this!
| crorella wrote:
| Nice! It would be great to have per character voices
| boznz wrote:
| this would be a game changer if done right. All good voice
| actors can carry a dozen different 'voices' for characters
___________________________________________________________________
(page generated 2025-01-15 23:01 UTC)