[HN Gopher] ESpeak-ng: speech synthesizer with more than one hun...
___________________________________________________________________
ESpeak-ng: speech synthesizer with more than one hundred languages
and accents
Author : nateb2022
Score : 226 points
Date : 2024-05-02 01:06 UTC (21 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| mewse-hn wrote:
| No example output? Here's a youtube video where he plays with
| this software
|
| https://www.youtube.com/watch?v=493xbPIQBSU
| scoot wrote:
| With timestamp, as there's hardly any TTS speech in that video:
| https://youtu.be/493xbPIQBSU?t=605
| bufferoverflow wrote:
| Oh man, it sounds awful, like 15-year old tech.
|
| I've been spoiled by modern AI generated voices that sound
| indistinguishable from humans to me.
| tredre3 wrote:
| It is 30 years old tech.
| vlovich123 wrote:
| Anyone know why the default voice is set to be so bad?
| mrob wrote:
| Why specifically do you consider it to be bad? Espeak-ng is
| primarily an accessibility tool, used as the voice synthesizer
| for screen readers. Clarity at high speed is more important
| than realism.
| vlovich123 wrote:
| That can't be a serious question. Go look at the
| accessibility voice for windows or Mac and then compare the
| way it sounds. Both of those are both more human like with
| better pronunciation.
| rhdunn wrote:
| The default voice sounds robotic for several reasons. It has a
| low sample rate to conserve space. It is built using a mix of
| techniques that make it difficult to reconstruct the original
| waveform exactly. And it uses things like artificial noise for
| the plosives, etc.
|
| The default voice is optimized for space and speed instead of
| quality of the generated audio.
| codedokode wrote:
| But today disk space is not an issue.
| vlovich123 wrote:
| I'll suggest that's the wrong optimization to make for an
| accessibility tool. Modern CPUs are more than capable of
| handling its speed requirements by several orders of
| magnitude (they can decode h265 in real time for gods sake
| without HW acceleration). And same goes for size.
|
| It's simply the wrong tuning tradeoff.
| follower wrote:
| As I've learned over time (and other people in these comments
| have clarified) it turns out that evaluating "quality" of Text
| To Speech is somewhat dependent on the domain in which the
| audio output is being used (obviously with overlaps), broadly:
|
| * accessibility
|
| * non-accessibility (e.g. voice interfaces; narration; voice
| over)
|
| The qualities of the generated speech which are favoured may
| differ significantly between the two domains, e.g. AIUI non-
| accessibility focused TTS often prioritises "realism" &
| "naturalness" while more accessibility focussed TTS often
| prioritizes clarity at high words-per-minute speech rates
| (which often sounds distinctly _non_ -"realistic").
|
| And, AIUI espeak-ng has historically been more focused on the
| accessibility domain.
| vlovich123 wrote:
| I don't have any disabilities so I don't know if espeak-ng is
| better on the pure accessibility axis. But given that MacOS
| tends to be received quite well by the accessibility crowd &
| it's definitely a focus from what I observed internally,
| given that MacOS has much higher realism & naturalness out of
| the box, I'm going to posit that it's not the linear tradeoff
| argument you've made & that espeak-ng defaults aren't tuned
| well out of the box.
| droopyEyelids wrote:
| Another project falls victim to the tragic "ng" relative naming,
| leaving it without options for future generations
| nialv7 wrote:
| They can name the next iteration ESpeak-DS9 ;)
| hibikir wrote:
| I actually have seen that done at a former employer, a very
| large agribusiness. I bet there are more examples of that
| very specific, not so intended versioning system out there.
| retrac wrote:
| Classic speech synthesis is interesting, in that relatively
| simple approaches, produce useful results. Formant synthesis
| takes relatively simple sounds, and modifies them according to
| the various distinctions the human speech tract can make. The
| basic vowel quality can be modelled as two sine waves that change
| over time. (Nothing more complex than what's needed to generate
| touch tone dialing tones, basically.) Add a few types of buzzing
| or clicking noises before or after that for consonants, and
| you're halfway there. The technique predates computers; it's
| basically the same technique used by the original voder [1] just
| under computer control.
|
| Join that with algorithms which can translate English into
| phonetic tokens with relatively high accuracy, and you have
| speech synthesis. Make the dictionary big enough, add enough
| finesse, and a few hundred rules about transitioning from phoneme
| to phoneme, and it's produces relatively understandable speech.
|
| Part of me feels that we are losing something, moving away from
| these classic approaches to AI. It used to be that, to teach a
| machine how to speak, or translate, the designer of the system
| had to understand how language worked. Sometimes these models
| percolated back into broader thinking about language. Formant
| synthesis ended up being an inspiration to some ideas for how the
| brain recognizes phonemes. (Or maybe that worked in both
| directions.) It was thought, further advances would come from
| better theories about language, better abstractions. Deep
| learning has produced far better systems than the classic
| approach, but they also offer little in terms of understanding or
| simplifying.
|
| [1] https://en.wikipedia.org/wiki/Voder
| senkora wrote:
| It is possible to synthesize an English voice with a 1.5MB
| model using http://cmuflite.org/ or some of the Apple VoiceOver
| voices, which is just crazy to me. Most of the model is diphone
| samples for pairs of phonemes.
|
| I don't know of any way to go smaller than that with software.
| I tried, but it seems like a fundamental limit for English.
| userbinator wrote:
| _I don 't know of any way to go smaller than that with
| software. I tried, but it seems like a fundamental limit for
| English._
|
| If you include "robotic" speech, then there's
| https://en.wikipedia.org/wiki/Software_Automatic_Mouth in a
| few tens of KB, and the demoscene has done similar in around
| 1/10th that. All formant synths, of course, not the sample-
| based ones that you're referring to.
| vidarh wrote:
| SAM is awful but at the same time tantalisingly _close_
| (one of the demos of it apparently draws on a great reverse
| engineered version by Sebastian Macke and refactoring
| efforts by a couple of others including me - I spent too
| many hours listening to SAM output...) - especially when
| comparing to the still awful Festival /Flite models -, that
| I keep wanting to see what a better generic formant synth
| used as constraint on an ML model would produce.
|
| That is, instead of allowing a generic machine learning
| model to output unconstrained audio, train it on the basis
| of letting it produce low bitrate input/control values for
| a formant synth instead, and see just how small you can
| push the model.
| tapper wrote:
| Me as a blind dude that is listening to a sinth all the
| time I don't cair about the size of the thing I just want
| good SQ.
| vidarh wrote:
| That's fair, but those two _are_ in constant tension. We
| can do very good speech if we ship gigs of samples.
| Better quality smaller models make it easier to get
| better quality speech in more places.
| miki123211 wrote:
| You probably could by what I call the "eastern-european
| method." Record one wave period of each phoneme, perhaps two
| for plosives, downsample to 8 or 11 kHz 8 bit, and repeat
| that recording on-the-fly enough times to make the right
| sound. If you're thinking "mod file", you're on the right
| track.
|
| For phonetically simple languages, such a system can easily
| fit on a microcontroller with kilobytes of RAM and a slow
| CPU. English might require a little bit more on the text-to-
| phoneme stage, but you can definitely go far below 1MB.
| rhdunn wrote:
| That's effectively what eSpeak-ng is doing.
|
| For the CMU flite voices they represent the data as LPC
| (linear predictive coding) data with residual remainder
| (residual excited LPC). The HTS models use simple neural
| networks to predict the waveforms -- IIRC, these are
| similar to RNNs.
|
| The MBROLA models use OLA (overlapped add) to overlap small
| waveform samples. They also use diphone samples taken from
| midpoint to midpoint in order to create better phoneme
| transitions.
| Animats wrote:
| Most of the links there are dead.
|
| That's a descendant of Festival Singer, which was well
| respected in its day.
|
| What's a current practical text-to-speech system that's open
| source, local, and not huge?
| anthk wrote:
| Flite.
| Animats wrote:
| That's another spinoff of Festival Singer. What's good
| from the LLM era?
| follower wrote:
| Depending on your definition of "huge", you might find
| Piper TTS fits your requirements:
| https://github.com/rhasspy/piper
|
| The size of the associated voice files varies but there are
| options that are under 100MB:
| https://huggingface.co/rhasspy/piper-voices/tree/main/en
| anthk wrote:
| I always wanted something like Flite but for Spanish.
| aidenn0 wrote:
| The speak & spell did it with a 32kB [edit, previously
| incorrectly wrote 16kB] ROM and a TMS0280.
| bhaney wrote:
| > It used to be that, to teach a machine how to [X], the
| designer of the system had to understand how [X] worked.
|
| It does feel like we're rapidly losing this relationship in
| general. I think it's going to be a good thing overall for
| productivity and the advancement of mankind, but it definitely
| takes a lot of the humanity out of our collective
| accomplishments. I feel warm and fuzzy when a person goes on a
| quest to deeply understand a subject and then shares the fruits
| of their efforts with everyone else, but I don't feel like that
| when someone points at a subject and says "hey computer, become
| good at that" with similar end results.
| userbinator wrote:
| _I think it 's going to be a good thing overall for
| productivity and the advancement of mankind, but it
| definitely takes a lot of the humanity out of our collective
| accomplishments._
|
| I think AI will only cause us to become stuck in another
| local maximum, since not understanding how something works
| can only lead to imitation at best, and not inspiration.
| vidarh wrote:
| I'm not convinced, because I think there will be a drive to
| distill down models and constrain them, and try to train
| models with access to "premade blocks" of functionality we
| know should help.
|
| E.g. we know human voices can be produced well with formant
| synthesis because we know how the human vocal tract is
| shaped. So you can "give" a model a formant synth, and try
| to train smaller models outputting to it.
|
| I think there's going to be a whole lot of research
| possibilities in placing constraints and training smaller
| models, and even training ensembles of models constrained
| in how they're interacting and their relative sizes to try
| to "force" extraction of functionality.
|
| E.g. we have reasonable estimates at the lowest bitrate raw
| audio that produces passable voice. Now consider training
| two models A, B, where A => B => audio, and the "channel"
| between A and B is constrained to a small fraction of the
| bitrate that'd let A do all the work, and where the size of
| B is set at a level you've first struggled to get passable
| TTS output from.
|
| Try to squeeze the bitrate and/or the size of B down and
| see if you can get something to emerge where analysing what
| happens in B is doable.
| nuc1e0n wrote:
| I've said this before on HN, but neural networks are not a
| tabula rasa. Their structure is created by people. With
| better domain understanding we can make better structures
| for AI models. It's not an either-or situation.
| T_MacThrowFace wrote:
| isn't the ultimate point of AI though, that instead of the
| traditional situation where the machines are the tools of
| humans, we become the (optional) tools for the AI? The AGI
| will do the understanding for us, and we'll profit by
| getting what we want from the black box, like a dog
| receiving industrially manufactured treats it cannot
| comprehend from its owner.
| vidarh wrote:
| I feel like you. Relatively simple formant models gets "close
| enough" that it feels like you should be able to do well with
| very little.
|
| One of the things I've long wanted to do but not found time
| for, is to take a few different variants of formant synths, and
| try to train a simple TTS model to control one instead of
| producing "raw" output. It's amazing what TTS models can do
| with "raw" output, but we know our brains aren't producing raw,
| unconstrained digital audio, and so I think there's a lot of
| potential to understanding more and simplifying if you train
| models constrained to produce outputs we know ought to be
| sufficient, and push their size as far down as we can.
| vidarh wrote:
| Too late to edit, but to any one who needs "convincing" of
| the flexibility of a formant synthesizer, you should 1) play
| with Pink Trombone[1], a Javascript formant synthesizer with
| a UI that lets you graphically manipulate a vocal tract, and
| 2) have a look at this programmable version of it[2]
|
| [1] https://dood.al/pinktrombone/
|
| [2] https://github.com/zakaton/Pink-Trombone
| drcongo wrote:
| Thanks for those links, that's superb. Sounds surprisingly
| like the "Oh long Johnson" cat -
| https://www.youtube.com/watch?v=kkwiQmGWK4c
| codedokode wrote:
| What a fun toy! But I mostly get sounds of a drunk man who
| tries to say something but cannot.
| vidarh wrote:
| Check out the videos in the second link - it gives some
| better examples of what you can do with it.
| tapper wrote:
| Funny that the Pink Trombone website does not work with
| screen readers.
| vidarh wrote:
| It's entirely visual in that it is a graphic of a vocal
| tract where you can move the tongue and shape the mouth,
| so in that form trying to make it do anything with screen
| readers would mean basically doing what the second link
| does, and create an API for it and hook the backend up to
| an entirely different UI.
|
| The videos on the other link does show that the guy who
| has done that API has done stuff to the UI to allow
| interfacing to it in non-graphical ways, but sadly I
| don't see any online demos of those alternative user
| interfaces anywhere, which is a great shame. Sadly it
| doesn't look like the videos are very accessible either,
| as they're mostly demos with no commentary about what is
| going on on the screen, without which they're mostly just
| random sounds.
| follower wrote:
| Yeah, Pink Trombone is _awesome_ [0]. :D
|
| Thanks for the link to the programmable version--I don't
| think I'd been aware of that previously...
|
| [0] And, from personal experience, also rather difficult to
| "safely" search for if you don't quite remember its name
| exactly. :D
| miki123211 wrote:
| Do you have any good resources on this?
|
| I took a few stabs at understanding Klatt, but I feel like I
| had far too little DSP, math and linguistic intuitions back
| then to fully comprehend it, perhaps I should take another one
| now.
| maksimur wrote:
| > The technique predates computers; it's basically the same
| technique used by the original voder [1] just under computer
| control.
|
| Something similar from the 800s is the Euphonia talking machine
| ( https://en.m.wikipedia.org/wiki/Euphonia_(device) ).
| nkozyra wrote:
| * 1800s
|
| Clicked that thinking someone had made a talking machine in
| the Middle Ages :)
| maksimur wrote:
| Oops :)
| beeboobaa3 wrote:
| > Part of me feels that we are losing something, moving away
| from these classic approaches to AI
|
| Absolutely. Seems a large amount of software developers have
| moved on from trying to understand how things work to solve the
| problem, and they are now instead just essentially throwing
| shit at a magical wall until something sticks long enough.
| WalterBright wrote:
| Now try it with a trumpet! Herb Alpert's "The Trolley Song" is
| an underrated masterpiece of control over the sound a trumpet
| makes. No synthesized trumpet sound has ever done anything like
| this.
|
| https://www.youtube.com/watch?v=mqr9E9Q-P5o
| pcdoodle wrote:
| Haven't heard that one, it's great, thank you.
| barfbagginus wrote:
| Quality of life improvements are much much more important than
| understandable models of speech, so we should live with,
| appreciate, and work to interpret and improve the current
| generation of complex neural TTS models.
|
| I depend on TTS to overcome dyslexia, but I also struggle with
| auditory processing disorder that causes me to misunderstand
| words. As a result , classical TTS does not help me read faster
| or more accurately than struggling through my dyslexia. It
| causes me to rapidly fatigue, zone out, and rewind often, in a
| way that is more severe than when I sight read.
|
| On the other hand, modern neural TTS is a huge enabler. My
| error rate, rewind rate, and fatigue are much better thanks to
| the natural tone, articulation, and prosody. I'm able to read
| for hours this way, and my productivity is higher than sight
| reading alone. This unlocks long and complex readings that I
| would never complete by sight reading alone, like papers in
| history, philosophy, and law. Previously I was limited to
| reading math, computer science, and engineering work, where I
| heavily depended on diagrams and math formulas to help me gloss
| over dense text readings.
|
| The old tech had no impact on my life, given my combination of
| reading and listening difficulty, since it was not
| comparatively better than sight reading. But my life changed
| about 6 years ago with neural TTS. The improvement has been
| massive, and has helped me work with many non-technical
| readings that I would previously give up on.
|
| The main issues I see now is not that neural models are hard to
| understand. For better or worse, we're able to improve the
| models just by throwing capital and ML PhDs at the problem. The
| problem I see is that the resulting technology is proprietary
| and not freely available for the people whose lives it would
| change.
|
| We should work towards a future where people can depend on
| useful and free TTS that improves their quality of life. I
| don't think simple synthetic models will be enough. We must
| work to seize control of models that can provide the same
| quality of life improvements that new proprietary models can
| provide. And we must make these models free for everyone to
| use!
| vidarh wrote:
| It's not at all a given that these two thing are in conflict.
| The best path towards free TTS might well turn out to be to
| identify ways of making smaller models that are cheaper to
| train and improve on if/when we can split out things we know
| to do (be it with separate neural models or other methods)
| and train models to "fill the gaps" instead of the entire
| end-to-end process.
|
| There are also plenty of places where the current modern
| "neural" models are too compute intensive / costly to run,
| and so picking just the current big models isn't an option
| for all uses.
| SoftTalker wrote:
| Can I get my map navigation prompts in the voice of Yoda please?
|
| "At the roundabout, the second exit take."
|
| "At your destination, arrived have you."
| 082349872349872 wrote:
| Quenya is an option, so (assuming you speak it) you _could_ get
| your map navigation prompts in the voice of Galadriel...
|
| "A star shall shine on the hour of our taking the second exit."
|
| "You have reached your Destination, fair as the Sea and the Sun
| and the Snow upon the Mountain!"
| albertzeyer wrote:
| You mean the voice, or the grammar? The grammar part is outside
| of the scope of a synthesizer. That's completely up to the
| user.
|
| Or you want a model which translates normal English into Yoda-
| English (on text level) and then attach a speech synthesizer on
| that?
|
| Or I guess an end-to-end speech synthesizer, a big neural
| network which operates on the whole sentence at once, could
| also internally learn to do that grammar transformation.
| scoot wrote:
| Yes (if you use a TomTom): https://tomtom.gps-data-
| team.com/celebrity_voices/Yoda-Star_...
| zambonidriver wrote:
| Is it an LLM? What base model does it use?
| celestache wrote:
| eSpeak uses what is known as formant synthesis, and no LLM as
| far as I know.
| nmstoker wrote:
| Definitely no LLM! Espeak dates from at least 10 years before
| LLMs appeared and was based on the approach used on Acorn
| computers in the 80s and 90s.
| webprofusion wrote:
| "More than hundred"
| Aachen wrote:
| Fwiw, in many languages that's correct. Coming from Dutch'
| "meer dan honderd", being taught to say _one_ hundred is like
| teaching an English person to say "more than one ten" for
| values >10
| fisian wrote:
| I used it on Android and it seems to be one of very few apps that
| can replace the default Google services text-to-speech engine.
|
| However, I wasn't satisfied with the speech quality so now I'm
| using RHVoice. RHVoice seems to produce more natural/human-
| sounding output yo me.
| wakeupcall wrote:
| Depending on context, I cycle between espeak-ng with mbrola-en
| or RHVoice, but even plain espeak shouldn't be discarded.
|
| RHVoice sounds slightly more natural in some cases, but one
| advantage of espeak-ng is that the text parsing logic is
| cleaner, by default.
|
| For example, RHVoice likes to spell a lot regular text
| formatting. One example would be spelling " -- " as dash-dash
| instead of pausing between sentences. So while text sounds a
| little more natural, it's actually harder to understand in
| context unless the text is clean to begin with.
|
| I don't know if speech-dispatcher does this for you, but I'm
| using a shell script and some regex rules to make the text
| cleaner for TTS which I don't need when using espeak-ng.
|
| Another tradeoff: espeak-ng with the mbrola doesn't offer all
| the inflexion customization options you have with the "robotic-
| sounding" voices. When accelerating speech, these options make
| a qualitative difference in my experience.
|
| I can see why each of these can have its place.
| tapper wrote:
| On android I use ETI-Eloquence, but you cant get a legal one.
| Google it and look on the website blind help. There is a apk.
| dheera wrote:
| Why is the quality of open source TTS so horribly, horribly,
| horribly behind the commercial neural ones? This is nowhere near
| the quality of Google, Microsoft, or Amazon TTS, yet for image
| generation and LLMs almost everything outside of OpenAI seems to
| be open-sourced.
| str3wer wrote:
| almost like there's a few bilion dollars difference in their
| budgets
| dheera wrote:
| Sure but it's been 15 years and the quality of the espeak
| command is equally horrible. I would have expected some
| changes ... especially considering even the free TTS inside
| Google Chrome is actually pretty decent, that could just be
| extracted and packaged up as a new version of espeak.
| anthk wrote:
| Festival it's nicer; Flite would run on a toaster and Mbrola
| can work with Espeak but the data it's restricted for
| commercial usage.
| albertzeyer wrote:
| The quality also depends on the type of model. I'm not really
| sure what ESpeak-ng actually uses? The classical TTS approaches
| often use some statistical model (e.g. HMM) + some vocoder. You
| can get to intelligible speech pretty easily but the quality is
| bad (w.r.t. how natural it sounds).
|
| There are better open source TTS models. E.g. check
| https://github.com/neonbjb/tortoise-tts or
| https://github.com/NVIDIA/tacotron2. Or here for more:
| https://www.reddit.com/r/MachineLearning/comments/12kjof5/d_...
| ClawsOnPaws wrote:
| I'm glad that it doesn't. A lot of us use these voices as an
| accessibility tool in our screen readers. They need to perform
| well and be understandable at very high rate, and they need to
| be very responsive. ESpeak is one of the most responsive speech
| synths out there, so for a screen reader this means key press
| to speech output is extremely low. Adding AI would just make
| this a lot slower and unpredictable, and unusable for daily
| work, at least right now. This is anecdotal, but part of what
| makes a synth work well at high speech rates is predictability.
| I know how a speech synth is going to say something exactly.
| This let's me put more focus on the thing I'm doing rather than
| trying to decipher what the synth is saying. Neural TTS always
| has differences in how they say a thing, and at times, those
| differences can be large enough to trip me up. Then I'm
| focusing on the speech again and not what I'm doing. But ESpeak
| is very predictable, so I can let my brain do the pattern
| matching and focus actively on something else.
| exceptione wrote:
| Deepspeech from Mozilla is open source. Did you know this one?
|
| From the samples I listened, it sounds great to me?
| dheera wrote:
| I have to dry this, thanks! Unfortunately I couldn't find
| samples on their git repo and it looks like it isn't apt-
| gettable. Maybe that's part of the reason.
|
| They should make it so that I can do sudo
| apt-get install deepspeech sudo ln -s
| /usr/bin/deepspeech /usr/bin/espeak
|
| Anything more than that is an impediment to mass adoption.
|
| Seems they need some new product management ...
| follower wrote:
| As I understand it DeepSpeech is no longer actively
| maintained by Mozilla:
| https://github.com/mozilla/DeepSpeech/issues/3693
|
| For Text To Speech, I've found Piper TTS useful (for
| situations where "quality"=="realistic"/"natual"):
| https://github.com/rhasspy/piper
|
| For Speech to Text (which AIUI DeepSpeech provided), I've had
| some success with Vosk: https://github.com/alphacep/vosk-api
| miki123211 wrote:
| Blind person here, ESpeak-ng is literally what I use on all of my
| devices for most of my day, every day.
|
| I switched to it in early childhood, at a time where human-
| sounding synthesizers were notoriously slow and noticeably
| unresponsive, and just haven't found anything better ever since.
| I've used Vocalizer for a while, which is what iOS and Mac OS
| ship with, but then third-party synthesizer support was added and
| I switched right back.
| maxglute wrote:
| How fast do you set speech playback speed/rate?
|
| I tried a bunch of speech synthesis, with speed and
| intelligibility in mind.
|
| ESpeakng-ng barely intelligible past ~500 words per minute, and
| just generally unpleasant to listen to. Maybe my brain just
| can't acclimatize to it.
|
| Microsoft Zira Mobile (unlock on win11 desktop via regex)
| sounds much more natural and intelligible at max windows SAPI
| speech rate, which I estimate is around ~600 and equivalent to
| most conversation/casual spoken word at 2x speed. I wish
| windows could increase playback even further, my brain can
| process 900-1200 words per minute or 3x-4x normal playback
| speed.
|
| On Android, Google's "United States - 1" sounds a little
| awkward but also intelligible at 3x-4x speed.
| trwm wrote:
| Similar to OP if information is low density like a legal
| contract I can do 1200wpm after a few hours of getting used
| to it. Daily normal is 600wpm, if the text is heavy going
| enough I have to drop it down to 100 wpm and put it on loop.
|
| Like usual the limit isn't how fast human io is but how fast
| human processing works.
| maxglute wrote:
| Yeah 600wpm is passive listening. 900-1200wpm is listening
| lecture on youtube at 3-4x speed. Skim listening for
| content I'm familiar with. Active listening for things I
| just want to speed through. It's context dependent, I find
| I can ramp up 600-1200 and get into flow state of
| listening.
|
| >text is heavy going enough I have to drop it down to 100
| wpm
|
| What is heavy text for you? Like very dense technical text?
|
| >put it on loop
|
| I find this very helpful as well, but for content I
| consume, not very technical, I listen at ~600wpm and loop
| it multipe times. It's like listening a song to death.
| Engrain it on a vocal / story telling level.
|
| E: semi related comment to a deleted comment about
| processing speed that I can no longer reply to. Posting
| here because related.
|
| Some speech synthesis are much more intelligible at higher
| speeds, and aids processing at higher wpms. What I've been
| trying to find is the most intelligible speech synthesis
| voice for upper limit of concentrated/burst listening which
| for me is around 1200wpm / 4x speed, i.e. many have wierd
| audio artefacts past 3x. There's synthesis engines whose
| high speed intelligbility improves if text is processed
| with SSML markup to add longer pauses after punctuation.
| Just little tweaks that makes processing easier. Doesn't
| apply to all content, all contexts, but I think some
| consumption are suitable for that, and it's something that
| can be trained like many mental tasks, and dedicated speech
| synthesis like fancy sport equipments improve top end
| performance.
|
| IMO also something neural model can be tuned for. There are
| some podcasters/audiobook narrators who are "easy"
| listening at 3x speed vs others because they just have
| better enunciation/cadence at same word density. Most
| voices out there from traditional SAPI models to neural
| are... very mid fast "narrators". Think need to bundle
| speech sythensis with content awareness - AI to filter
| content then synthesis speech that emphasis/slow on
| significant information, breeze past filler - just present
| information more efficiently for consumption.
| agumonkey wrote:
| Thanks for the heads-up. May I ask you if you know websites /
| articles that explain daily setups for blind people ? I had
| issues that required me not to rely on sight and I couldn't
| find much.
| deknos wrote:
| Is this better than the classical espeak which is available in
| opensource repositories?
|
| I would be very glad if there's a truly open source local hosted
| text to speech software which brings good human sounding speech
| in woman/man german/english/french/spanish/russian/arabic
| language...
| yorwba wrote:
| When you install espeak with a distro package manager, you're
| quite likely to get espeak-ng.
| follower wrote:
| Based on your description of your requirements Piper TTS might
| be of interest to you: https://github.com/rhasspy/piper
| nmstoker wrote:
| I always feel sympathy for the devs on this project as they get
| so many issues raised by people that are largely lazy (since the
| solution is documented and/or they left out obvious detail) or
| plain wrong. I suspect it's a side effect from espeak-ng being
| behind various other tools and in particular critical to many
| screen readers, thus you can see why the individuals need help
| even if they struggle to ask for it effectively.
| liotier wrote:
| I hoped "-ng" would be standing for Nigeria - which would have
| been most fitting, considering Nigeria's linguistic diversity !
| sandbach wrote:
| Anyone interested in formants and speech synthesis should have a
| look at Praat[0], a marvellous piece of free software that can do
| all kinds of speech analysis, synthesis, and manipulation.
|
| https://www.fon.hum.uva.nl/praat/
| readmemyrights wrote:
| I'm quite surprised to find this on HN, synthesizers like espeak
| and eloquence (ibm TTS) have fallen out of favor these days. I'm
| a blind person who uses espeak on all my devices except my
| macbook, where unfortunately I can't install the speech
| synthesizer because it apparently only supports MacOS 13
| (installing the library itself works fine though).
|
| Most times I try to use modern "natural-sounding" voices they
| take a while to initialize, and when you speed them at a certain
| point the words mix together into meaningless noise, while at the
| same rate eloquence and espeak would handle just great, well, for
| me at least.
|
| I was thinking about this a few days back while I was trying out
| piper-tts [0] how supposedly "more advanced" synthesizers powered
| by AI use up more ram and cpu and disk space to deliver a voice
| which doesn't sound much better than something like RH voice and
| gets things like inflection wrong. And that's the english voice,
| the voice for my language (serbian) makes espeak sound human and
| according to piper-tts it's "medium".
|
| Funny story about synthesizers taking a while to initialize,
| there's a local IT company here that specializes in speech
| synthesis and their voices take so long to load they had to say
| "<company> Mary is initializing..." whenever you start your
| screen reader or such. Was annoying but in a fun way. Their newer
| Serbian voices also have this "feature" where they try to
| pronounce some english words it comes upon properly. It also has
| another "feature" where it tries to pronounce words right that
| were spelled without accent marks or such, and like with most of
| these kinds of "features" they combine badly and hilariously. For
| example if you asked them to pronounce "topic" it would pronounce
| it as "topich, which was fun while browsing forums or such.
|
| [0] https://github.com/rhasspy/piper
| bArray wrote:
| I think it would be good if they provided some samples on the
| readme. It would be good for example if their list of
| languages/accents could be sampled [1]
|
| [1] https://github.com/espeak-ng/espeak-
| ng/blob/master/docs/lang...
|
| > eSpeak NG uses a "formant synthesis" method. This allows many
| languages to be provided in a small size. The speech is clear,
| and can be used at high speeds, but is not as natural or smooth
| as larger synthesizers which are based on human speech
| recordings. It also supports Klatt formant synthesis, and the
| ability to use MBROLA as backend speech synthesizer.
|
| I've been using eSpeak for many years now. It's superb for
| resource constrained systems.
|
| I always wondered whether it would be possible to have a semi-
| context aware, but not neural network, approach.
|
| I quite like the sound of Mimic 3, but it seems to be mostly
| abandoned: https://github.com/MycroftAI/mimic3
| follower wrote:
| FYI re: Mimic 3: the main developer Michael Hansen (a.k.a
| synesthesiam) (who also previously developed Larynx TTS) now
| develops Piper TTS (https://github.com/rhasspy/piper) which is
| essentially a "successor" to the earlier projects.
|
| IIUC ongoing development of Piper TTS is now financially
| supported by the recently announced Open Home Foundation (which
| is great news as IMO synesthesiam has almost single-handed
| revolutionized the quality level--in terms of
| naturalness/realism--of FLOSS TTS over the past few years and
| it would be a real loss if financial considerations stalled
| continued development):
| https://www.openhomefoundation.org/projects/ (Ok, on re-reading
| OHF is more generally funding development of Rhasspy of which
| Piper TTS is one component.)
| replete wrote:
| I listen to ebooks with TTS. On Android via FDroid the speech
| packs in this software are extremely robotic.
|
| There aren't many options for degoogled Android users. In the end
| I settled for the Google Speech Services and disabled network
| access and used the default voice. GSS has its issues and voices
| don't download properly, but the default voice is tolerable in
| this situation.
| tapper wrote:
| you can get ETI-Eloquence for android.
| jryb wrote:
| When speaking Chinese, it says the tone number in English after
| each character. So "Ni Hao " is pronounced "ni three hao three".
| Am I using this wrong? I'm running `espeak-ng -v cmn "Ni Hao "`.
|
| If this is just how it is, the "more than one hundred languages"
| claim is a bit suspect.
| TomK32 wrote:
| I was curious, 300 minority languages are spoken in China,
| spread across 55 minority ethnic groups
| https://en.wikipedia.org/wiki/Languages_of_China
| follower wrote:
| After some brief research it seems the issue you're seeing may
| be a known bug in at least some versions/release of espeak-ng.
|
| Here's some potentially related links if you'd like to dig
| deeper:
|
| * "questions about mandarin data packet #1044":
| https://github.com/espeak-ng/espeak-ng/issues/1044
|
| * "ESpeak NJ-1.51's Mandarin pronunciation is corrupted
| #12952": https://github.com/nvaccess/nvda/issues/12952
|
| * "The pronunciation of Mandarin Chinese using ESpeak NJ in
| NVDA is not normal #1028": https://github.com/espeak-ng/espeak-
| ng/issues/1028
|
| * "When espeak-ng translates Chinese (cmn), IPA tone symbols
| are not output correctly #305":
| https://github.com/rhasspy/piper/issues/305
|
| * "Please default ESpeak NG's voice role to 'Chinese (Mandarin,
| latin as Pinyin)' for Chinese to fix #12952 #13572":
| https://github.com/nvaccess/nvda/issues/13572
|
| * "Cmn voice not correctly translated #1370":
| https://github.com/espeak-ng/espeak-ng/issues/1370
| devinprater wrote:
| ESpeak is pretty great, and now that Piper is using it, hopefully
| strange issues like it saying nineteen hundred eighty four for
| 1984 the year, can be fixed.
| follower wrote:
| Yeah, it would be nice if the financial backing behind
| Rhasspy/Piper led to improvements in espeak-ng too but based on
| my own development-related experience with the espeak-ng code
| base (related elsewhere in the thread) I suspect it would be
| significantly easier to extract the specific required text to
| phonemes functionality or (to a certain degree) reimplement it
| (or use a different project as a base[3]) than to more
| closely/fully integrate changes with espeak-ng itself[4]. :/
|
| It seems Piper currently abstracts its phonemize-related
| functionality with a library[0] that currently makes use of a
| espeak-ng fork[1].
|
| Unfortunately it also seems license-related issues may have an
| impact[2] on whether Piper continues to make use of espeak-ng.
|
| For your specific example of handling 1984 as a year, my
| understanding is that espeak-ng _can_ handle situations like
| that via parameters /configuration but in my experience there
| can be unexpected interactions between different
| configuration/API options[6].
|
| [0] https://github.com/rhasspy/piper-phonemize
|
| [1] https://github.com/rhasspy/espeak-ng
|
| [2] https://github.com/rhasspy/piper-
| phonemize/issues/30#issueco...
|
| [3] Previously I've made note of some potential options here:
| https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...
|
| [4] For example, as I note here[5] there's currently at least
| _four_ different ways to access espeak-ng 's phoneme-related
| functionality--and it seems that they _all_ differ in their
| output, sometimes consistently and other times dependent on
| configuration (e.g. audio output mode, spoken punctuation) and
| probably also input. : /
|
| [5] https://gitlab.com/RancidBacon/floss-various-
| contribs/-/blob...
|
| [6] For example, see my test cases for some other numeric-
| related configuration options here:
| https://gitlab.com/RancidBacon/floss-various-contribs/-/blob...
| synergy20 wrote:
| just used it a few days ago, the quality is honestly subpar.
|
| I use chrome's extension 'read aloud', which is as natural as you
| can get.
| follower wrote:
| It's been mentioned elsewhere in the comments but espeak-ng has
| historically prioritized accessibility use cases which is a
| domain where "quality" doesn't necessarily correlate with
| "naturalness" (e.g. there is a preference for clarity at high
| words-per-minute rates of speech where the speech doesn't sound
| "natural" but is still understandable, for people who have
| acclimatized to it through daily use, at least :) ).
| iamleppert wrote:
| SORA AI should integrate this into their LLM.
| spdustin wrote:
| Now I just want DECTalk ported to MacOS. The original Stephen
| Hawking voice.
|
| I have an Emic2 board I use (through UART so my ESP32 can send
| commands to it) and I use Home Assistant to send notifications to
| it. My family are science nerds like me, so when the voice of
| Stephen Hawking tells us there is someone at the door, it brings
| a lot of joy to us.
| DoktorDelta wrote:
| I found this, might be promising:
| https://www.reddit.com/r/DecTalk/comments/rhz01n/dectalk_for...
| incontrol_77 wrote:
| Here you go: https://github.com/dectalk/dectalk
| spdustin wrote:
| Awesome, thank you!
| manzanarama wrote:
| hugging face?
| follower wrote:
| Based on my own recent experience[0] with espeak-ng, IMO the
| project is currently in a really tough situation[3]:
|
| * the project seems to provide real value to a huge number of
| people who rely on it for reasons of accessibility (even more so
| for non-English languages); and,
|
| * the project is a valuable trove of knowledge about multiple
| languages--collected & refined over multiple decades by both
| linguistic specialists and everyday speakers/readers; but...
|
| * the project's code base is very much of "a different era"
| reflecting its mid-90s origins (on RISC OS, no less :) ) and a
| somewhat piecemeal development process over the following decades
| --due in part to a complex Venn diagram of skills, knowledge &
| familiarity required to make modifications to it.
|
| Perhaps the prime example of the last point is that `espeak-ng`
| has a _hand-rolled XML parser_ --which attempts to handle both
| valid & invalid SSML markup--and markup parsing is interleaved
| with internal language-related parsing in the code. And this is
| implemented in C.
|
| [Aside: Due to this I would _strongly_ caution against feeding
| "untrusted" input to espeak-ng in its current state but
| unfortunately that's what most people who rely on espeak-ng for
| accessibility purposes inevitably do while browsing the web.]
|
| [TL;DR: More detail/repros/observations on espeak-ng issues here:
|
| * https://gitlab.com/RancidBacon/floss-various-contribs/-/blob...
|
| * https://gitlab.com/RancidBacon/floss-various-contribs/-/blob...
|
| * https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...
|
| ]
|
| Contributors to the project are not unaware of the issues with
| the code base (which are exacerbated by the difficulty of even
| tracing the execution flow in order to understand how the library
| operates) nor that it would benefit from a significant
| refactoring effort.
|
| However as is typical with such projects which greatly benefit
| individual humans but don't offer an opportunity to generate
| significant corporate financial return, a lack of developers with
| sufficient skill/knowledge/time to devote to a significant
| refactoring means a "quick workaround" for an specific individual
| issue is often all that can be managed.
|
| This is often exacerbated by outdated/unclear/missing
| documentation.
|
| IMO there are two contribution approaches that could help the
| project moving forward while requiring the least amount of
| specialist knowledge/experience:
|
| * Improve visibility into the code by adding logging/tracing to
| make it easier to see why a particular code path gets taken.
|
| * Integrate an existing XML parser as a "pre-processor" to ensure
| that only valid/"sanitized"/cleaned-up XML is passed through to
| the SSML parsing code--this would increase robustness/safety and
| facilitate future removal of XML parsing-specific workarounds
| from the code base (leading to less tangled control flow) and
| potentially future removal/replacement of the entire bespoke XML
| parser.
|
| Of course, the project is not short on ideas/suggestions for how
| to improve the situation but, rather, direct developer
| contributions so... _shrug_
|
| In light of this, last year when I was developing the personal
| project[0] which made use of a dependency that in turn used
| espeak-ng I wanted to try to contribute something more tangible
| than just "ideas" so began to write-up & create reproductions for
| some of the issues I encountered while using espeak-ng and at
| least document the current behaviour/issues I encountered.
|
| Unfortunately while doing so I kept encountering _new_ issues
| which would lead to the start of yet another round of debugging
| to try to understand what was happening in the new case.
|
| Perhaps inevitably this effort eventually stalled--due to a
| combination of available time, a need to attempt to prioritize
| income generation opportunities and the downsides of living with
| ADHD--before I was able to share the fruits of my research.
| (Unfortunately I seem to be way better at discovering & root-
| causing bugs than I am at writing up the results...)
|
| However I just now used the espeak-ng project being mentioned on
| HN as a catalyst to at least upload some of my notes/repros to a
| public repo (see links in TLDR section above) in that hopes that
| maybe they will be useful to someone who might have the
| time/inclination to make a more direct code contribution to the
| project. (Or, you know, prompt someone to offer to fund my
| further efforts in this area... :) )
|
| [0] A personal project to "port" my "Dialogue Tool for Larynx
| Text To Speech" project[1] to use the more recent Piper TTS[2]
| system which makes use of espeak-ng for transforming text to
| phonemes.
|
| [1] https://rancidbacon.itch.io/dialogue-tool-for-larynx-text-
| to... & https://gitlab.com/RancidBacon/larynx-
| dialogue/-/tree/featur...
|
| [2] https://github.com/rhasspy/piper
|
| [3] Very much no shade toward the project intended.
___________________________________________________________________
(page generated 2024-05-02 23:01 UTC)