[HN Gopher] ESpeak-ng: speech synthesizer with more than one hun...
       ___________________________________________________________________
        
       ESpeak-ng: speech synthesizer with more than one hundred languages
       and accents
        
       Author : nateb2022
       Score  : 226 points
       Date   : 2024-05-02 01:06 UTC (21 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | mewse-hn wrote:
       | No example output? Here's a youtube video where he plays with
       | this software
       | 
       | https://www.youtube.com/watch?v=493xbPIQBSU
        
         | scoot wrote:
         | With timestamp, as there's hardly any TTS speech in that video:
         | https://youtu.be/493xbPIQBSU?t=605
        
           | bufferoverflow wrote:
           | Oh man, it sounds awful, like 15-year old tech.
           | 
           | I've been spoiled by modern AI generated voices that sound
           | indistinguishable from humans to me.
        
             | tredre3 wrote:
             | It is 30 years old tech.
        
       | vlovich123 wrote:
       | Anyone know why the default voice is set to be so bad?
        
         | mrob wrote:
         | Why specifically do you consider it to be bad? Espeak-ng is
         | primarily an accessibility tool, used as the voice synthesizer
         | for screen readers. Clarity at high speed is more important
         | than realism.
        
           | vlovich123 wrote:
           | That can't be a serious question. Go look at the
           | accessibility voice for windows or Mac and then compare the
           | way it sounds. Both of those are both more human like with
           | better pronunciation.
        
         | rhdunn wrote:
         | The default voice sounds robotic for several reasons. It has a
         | low sample rate to conserve space. It is built using a mix of
         | techniques that make it difficult to reconstruct the original
         | waveform exactly. And it uses things like artificial noise for
         | the plosives, etc.
         | 
         | The default voice is optimized for space and speed instead of
         | quality of the generated audio.
        
           | codedokode wrote:
           | But today disk space is not an issue.
        
           | vlovich123 wrote:
           | I'll suggest that's the wrong optimization to make for an
           | accessibility tool. Modern CPUs are more than capable of
           | handling its speed requirements by several orders of
           | magnitude (they can decode h265 in real time for gods sake
           | without HW acceleration). And same goes for size.
           | 
           | It's simply the wrong tuning tradeoff.
        
         | follower wrote:
         | As I've learned over time (and other people in these comments
         | have clarified) it turns out that evaluating "quality" of Text
         | To Speech is somewhat dependent on the domain in which the
         | audio output is being used (obviously with overlaps), broadly:
         | 
         | * accessibility
         | 
         | * non-accessibility (e.g. voice interfaces; narration; voice
         | over)
         | 
         | The qualities of the generated speech which are favoured may
         | differ significantly between the two domains, e.g. AIUI non-
         | accessibility focused TTS often prioritises "realism" &
         | "naturalness" while more accessibility focussed TTS often
         | prioritizes clarity at high words-per-minute speech rates
         | (which often sounds distinctly _non_ -"realistic").
         | 
         | And, AIUI espeak-ng has historically been more focused on the
         | accessibility domain.
        
           | vlovich123 wrote:
           | I don't have any disabilities so I don't know if espeak-ng is
           | better on the pure accessibility axis. But given that MacOS
           | tends to be received quite well by the accessibility crowd &
           | it's definitely a focus from what I observed internally,
           | given that MacOS has much higher realism & naturalness out of
           | the box, I'm going to posit that it's not the linear tradeoff
           | argument you've made & that espeak-ng defaults aren't tuned
           | well out of the box.
        
       | droopyEyelids wrote:
       | Another project falls victim to the tragic "ng" relative naming,
       | leaving it without options for future generations
        
         | nialv7 wrote:
         | They can name the next iteration ESpeak-DS9 ;)
        
           | hibikir wrote:
           | I actually have seen that done at a former employer, a very
           | large agribusiness. I bet there are more examples of that
           | very specific, not so intended versioning system out there.
        
       | retrac wrote:
       | Classic speech synthesis is interesting, in that relatively
       | simple approaches, produce useful results. Formant synthesis
       | takes relatively simple sounds, and modifies them according to
       | the various distinctions the human speech tract can make. The
       | basic vowel quality can be modelled as two sine waves that change
       | over time. (Nothing more complex than what's needed to generate
       | touch tone dialing tones, basically.) Add a few types of buzzing
       | or clicking noises before or after that for consonants, and
       | you're halfway there. The technique predates computers; it's
       | basically the same technique used by the original voder [1] just
       | under computer control.
       | 
       | Join that with algorithms which can translate English into
       | phonetic tokens with relatively high accuracy, and you have
       | speech synthesis. Make the dictionary big enough, add enough
       | finesse, and a few hundred rules about transitioning from phoneme
       | to phoneme, and it's produces relatively understandable speech.
       | 
       | Part of me feels that we are losing something, moving away from
       | these classic approaches to AI. It used to be that, to teach a
       | machine how to speak, or translate, the designer of the system
       | had to understand how language worked. Sometimes these models
       | percolated back into broader thinking about language. Formant
       | synthesis ended up being an inspiration to some ideas for how the
       | brain recognizes phonemes. (Or maybe that worked in both
       | directions.) It was thought, further advances would come from
       | better theories about language, better abstractions. Deep
       | learning has produced far better systems than the classic
       | approach, but they also offer little in terms of understanding or
       | simplifying.
       | 
       | [1] https://en.wikipedia.org/wiki/Voder
        
         | senkora wrote:
         | It is possible to synthesize an English voice with a 1.5MB
         | model using http://cmuflite.org/ or some of the Apple VoiceOver
         | voices, which is just crazy to me. Most of the model is diphone
         | samples for pairs of phonemes.
         | 
         | I don't know of any way to go smaller than that with software.
         | I tried, but it seems like a fundamental limit for English.
        
           | userbinator wrote:
           | _I don 't know of any way to go smaller than that with
           | software. I tried, but it seems like a fundamental limit for
           | English._
           | 
           | If you include "robotic" speech, then there's
           | https://en.wikipedia.org/wiki/Software_Automatic_Mouth in a
           | few tens of KB, and the demoscene has done similar in around
           | 1/10th that. All formant synths, of course, not the sample-
           | based ones that you're referring to.
        
             | vidarh wrote:
             | SAM is awful but at the same time tantalisingly _close_
             | (one of the demos of it apparently draws on a great reverse
             | engineered version by Sebastian Macke and refactoring
             | efforts by a couple of others including me - I spent too
             | many hours listening to SAM output...) - especially when
             | comparing to the still awful Festival /Flite models -, that
             | I keep wanting to see what a better generic formant synth
             | used as constraint on an ML model would produce.
             | 
             | That is, instead of allowing a generic machine learning
             | model to output unconstrained audio, train it on the basis
             | of letting it produce low bitrate input/control values for
             | a formant synth instead, and see just how small you can
             | push the model.
        
               | tapper wrote:
               | Me as a blind dude that is listening to a sinth all the
               | time I don't cair about the size of the thing I just want
               | good SQ.
        
               | vidarh wrote:
               | That's fair, but those two _are_ in constant tension. We
               | can do very good speech if we ship gigs of samples.
               | Better quality smaller models make it easier to get
               | better quality speech in more places.
        
           | miki123211 wrote:
           | You probably could by what I call the "eastern-european
           | method." Record one wave period of each phoneme, perhaps two
           | for plosives, downsample to 8 or 11 kHz 8 bit, and repeat
           | that recording on-the-fly enough times to make the right
           | sound. If you're thinking "mod file", you're on the right
           | track.
           | 
           | For phonetically simple languages, such a system can easily
           | fit on a microcontroller with kilobytes of RAM and a slow
           | CPU. English might require a little bit more on the text-to-
           | phoneme stage, but you can definitely go far below 1MB.
        
             | rhdunn wrote:
             | That's effectively what eSpeak-ng is doing.
             | 
             | For the CMU flite voices they represent the data as LPC
             | (linear predictive coding) data with residual remainder
             | (residual excited LPC). The HTS models use simple neural
             | networks to predict the waveforms -- IIRC, these are
             | similar to RNNs.
             | 
             | The MBROLA models use OLA (overlapped add) to overlap small
             | waveform samples. They also use diphone samples taken from
             | midpoint to midpoint in order to create better phoneme
             | transitions.
        
           | Animats wrote:
           | Most of the links there are dead.
           | 
           | That's a descendant of Festival Singer, which was well
           | respected in its day.
           | 
           | What's a current practical text-to-speech system that's open
           | source, local, and not huge?
        
             | anthk wrote:
             | Flite.
        
               | Animats wrote:
               | That's another spinoff of Festival Singer. What's good
               | from the LLM era?
        
             | follower wrote:
             | Depending on your definition of "huge", you might find
             | Piper TTS fits your requirements:
             | https://github.com/rhasspy/piper
             | 
             | The size of the associated voice files varies but there are
             | options that are under 100MB:
             | https://huggingface.co/rhasspy/piper-voices/tree/main/en
        
           | anthk wrote:
           | I always wanted something like Flite but for Spanish.
        
           | aidenn0 wrote:
           | The speak & spell did it with a 32kB [edit, previously
           | incorrectly wrote 16kB] ROM and a TMS0280.
        
         | bhaney wrote:
         | > It used to be that, to teach a machine how to [X], the
         | designer of the system had to understand how [X] worked.
         | 
         | It does feel like we're rapidly losing this relationship in
         | general. I think it's going to be a good thing overall for
         | productivity and the advancement of mankind, but it definitely
         | takes a lot of the humanity out of our collective
         | accomplishments. I feel warm and fuzzy when a person goes on a
         | quest to deeply understand a subject and then shares the fruits
         | of their efforts with everyone else, but I don't feel like that
         | when someone points at a subject and says "hey computer, become
         | good at that" with similar end results.
        
           | userbinator wrote:
           | _I think it 's going to be a good thing overall for
           | productivity and the advancement of mankind, but it
           | definitely takes a lot of the humanity out of our collective
           | accomplishments._
           | 
           | I think AI will only cause us to become stuck in another
           | local maximum, since not understanding how something works
           | can only lead to imitation at best, and not inspiration.
        
             | vidarh wrote:
             | I'm not convinced, because I think there will be a drive to
             | distill down models and constrain them, and try to train
             | models with access to "premade blocks" of functionality we
             | know should help.
             | 
             | E.g. we know human voices can be produced well with formant
             | synthesis because we know how the human vocal tract is
             | shaped. So you can "give" a model a formant synth, and try
             | to train smaller models outputting to it.
             | 
             | I think there's going to be a whole lot of research
             | possibilities in placing constraints and training smaller
             | models, and even training ensembles of models constrained
             | in how they're interacting and their relative sizes to try
             | to "force" extraction of functionality.
             | 
             | E.g. we have reasonable estimates at the lowest bitrate raw
             | audio that produces passable voice. Now consider training
             | two models A, B, where A => B => audio, and the "channel"
             | between A and B is constrained to a small fraction of the
             | bitrate that'd let A do all the work, and where the size of
             | B is set at a level you've first struggled to get passable
             | TTS output from.
             | 
             | Try to squeeze the bitrate and/or the size of B down and
             | see if you can get something to emerge where analysing what
             | happens in B is doable.
        
             | nuc1e0n wrote:
             | I've said this before on HN, but neural networks are not a
             | tabula rasa. Their structure is created by people. With
             | better domain understanding we can make better structures
             | for AI models. It's not an either-or situation.
        
             | T_MacThrowFace wrote:
             | isn't the ultimate point of AI though, that instead of the
             | traditional situation where the machines are the tools of
             | humans, we become the (optional) tools for the AI? The AGI
             | will do the understanding for us, and we'll profit by
             | getting what we want from the black box, like a dog
             | receiving industrially manufactured treats it cannot
             | comprehend from its owner.
        
         | vidarh wrote:
         | I feel like you. Relatively simple formant models gets "close
         | enough" that it feels like you should be able to do well with
         | very little.
         | 
         | One of the things I've long wanted to do but not found time
         | for, is to take a few different variants of formant synths, and
         | try to train a simple TTS model to control one instead of
         | producing "raw" output. It's amazing what TTS models can do
         | with "raw" output, but we know our brains aren't producing raw,
         | unconstrained digital audio, and so I think there's a lot of
         | potential to understanding more and simplifying if you train
         | models constrained to produce outputs we know ought to be
         | sufficient, and push their size as far down as we can.
        
           | vidarh wrote:
           | Too late to edit, but to any one who needs "convincing" of
           | the flexibility of a formant synthesizer, you should 1) play
           | with Pink Trombone[1], a Javascript formant synthesizer with
           | a UI that lets you graphically manipulate a vocal tract, and
           | 2) have a look at this programmable version of it[2]
           | 
           | [1] https://dood.al/pinktrombone/
           | 
           | [2] https://github.com/zakaton/Pink-Trombone
        
             | drcongo wrote:
             | Thanks for those links, that's superb. Sounds surprisingly
             | like the "Oh long Johnson" cat -
             | https://www.youtube.com/watch?v=kkwiQmGWK4c
        
             | codedokode wrote:
             | What a fun toy! But I mostly get sounds of a drunk man who
             | tries to say something but cannot.
        
               | vidarh wrote:
               | Check out the videos in the second link - it gives some
               | better examples of what you can do with it.
        
             | tapper wrote:
             | Funny that the Pink Trombone website does not work with
             | screen readers.
        
               | vidarh wrote:
               | It's entirely visual in that it is a graphic of a vocal
               | tract where you can move the tongue and shape the mouth,
               | so in that form trying to make it do anything with screen
               | readers would mean basically doing what the second link
               | does, and create an API for it and hook the backend up to
               | an entirely different UI.
               | 
               | The videos on the other link does show that the guy who
               | has done that API has done stuff to the UI to allow
               | interfacing to it in non-graphical ways, but sadly I
               | don't see any online demos of those alternative user
               | interfaces anywhere, which is a great shame. Sadly it
               | doesn't look like the videos are very accessible either,
               | as they're mostly demos with no commentary about what is
               | going on on the screen, without which they're mostly just
               | random sounds.
        
             | follower wrote:
             | Yeah, Pink Trombone is _awesome_ [0]. :D
             | 
             | Thanks for the link to the programmable version--I don't
             | think I'd been aware of that previously...
             | 
             | [0] And, from personal experience, also rather difficult to
             | "safely" search for if you don't quite remember its name
             | exactly. :D
        
         | miki123211 wrote:
         | Do you have any good resources on this?
         | 
         | I took a few stabs at understanding Klatt, but I feel like I
         | had far too little DSP, math and linguistic intuitions back
         | then to fully comprehend it, perhaps I should take another one
         | now.
        
         | maksimur wrote:
         | > The technique predates computers; it's basically the same
         | technique used by the original voder [1] just under computer
         | control.
         | 
         | Something similar from the 800s is the Euphonia talking machine
         | ( https://en.m.wikipedia.org/wiki/Euphonia_(device) ).
        
           | nkozyra wrote:
           | * 1800s
           | 
           | Clicked that thinking someone had made a talking machine in
           | the Middle Ages :)
        
             | maksimur wrote:
             | Oops :)
        
         | beeboobaa3 wrote:
         | > Part of me feels that we are losing something, moving away
         | from these classic approaches to AI
         | 
         | Absolutely. Seems a large amount of software developers have
         | moved on from trying to understand how things work to solve the
         | problem, and they are now instead just essentially throwing
         | shit at a magical wall until something sticks long enough.
        
         | WalterBright wrote:
         | Now try it with a trumpet! Herb Alpert's "The Trolley Song" is
         | an underrated masterpiece of control over the sound a trumpet
         | makes. No synthesized trumpet sound has ever done anything like
         | this.
         | 
         | https://www.youtube.com/watch?v=mqr9E9Q-P5o
        
           | pcdoodle wrote:
           | Haven't heard that one, it's great, thank you.
        
         | barfbagginus wrote:
         | Quality of life improvements are much much more important than
         | understandable models of speech, so we should live with,
         | appreciate, and work to interpret and improve the current
         | generation of complex neural TTS models.
         | 
         | I depend on TTS to overcome dyslexia, but I also struggle with
         | auditory processing disorder that causes me to misunderstand
         | words. As a result , classical TTS does not help me read faster
         | or more accurately than struggling through my dyslexia. It
         | causes me to rapidly fatigue, zone out, and rewind often, in a
         | way that is more severe than when I sight read.
         | 
         | On the other hand, modern neural TTS is a huge enabler. My
         | error rate, rewind rate, and fatigue are much better thanks to
         | the natural tone, articulation, and prosody. I'm able to read
         | for hours this way, and my productivity is higher than sight
         | reading alone. This unlocks long and complex readings that I
         | would never complete by sight reading alone, like papers in
         | history, philosophy, and law. Previously I was limited to
         | reading math, computer science, and engineering work, where I
         | heavily depended on diagrams and math formulas to help me gloss
         | over dense text readings.
         | 
         | The old tech had no impact on my life, given my combination of
         | reading and listening difficulty, since it was not
         | comparatively better than sight reading. But my life changed
         | about 6 years ago with neural TTS. The improvement has been
         | massive, and has helped me work with many non-technical
         | readings that I would previously give up on.
         | 
         | The main issues I see now is not that neural models are hard to
         | understand. For better or worse, we're able to improve the
         | models just by throwing capital and ML PhDs at the problem. The
         | problem I see is that the resulting technology is proprietary
         | and not freely available for the people whose lives it would
         | change.
         | 
         | We should work towards a future where people can depend on
         | useful and free TTS that improves their quality of life. I
         | don't think simple synthetic models will be enough. We must
         | work to seize control of models that can provide the same
         | quality of life improvements that new proprietary models can
         | provide. And we must make these models free for everyone to
         | use!
        
           | vidarh wrote:
           | It's not at all a given that these two thing are in conflict.
           | The best path towards free TTS might well turn out to be to
           | identify ways of making smaller models that are cheaper to
           | train and improve on if/when we can split out things we know
           | to do (be it with separate neural models or other methods)
           | and train models to "fill the gaps" instead of the entire
           | end-to-end process.
           | 
           | There are also plenty of places where the current modern
           | "neural" models are too compute intensive / costly to run,
           | and so picking just the current big models isn't an option
           | for all uses.
        
       | SoftTalker wrote:
       | Can I get my map navigation prompts in the voice of Yoda please?
       | 
       | "At the roundabout, the second exit take."
       | 
       | "At your destination, arrived have you."
        
         | 082349872349872 wrote:
         | Quenya is an option, so (assuming you speak it) you _could_ get
         | your map navigation prompts in the voice of Galadriel...
         | 
         | "A star shall shine on the hour of our taking the second exit."
         | 
         | "You have reached your Destination, fair as the Sea and the Sun
         | and the Snow upon the Mountain!"
        
         | albertzeyer wrote:
         | You mean the voice, or the grammar? The grammar part is outside
         | of the scope of a synthesizer. That's completely up to the
         | user.
         | 
         | Or you want a model which translates normal English into Yoda-
         | English (on text level) and then attach a speech synthesizer on
         | that?
         | 
         | Or I guess an end-to-end speech synthesizer, a big neural
         | network which operates on the whole sentence at once, could
         | also internally learn to do that grammar transformation.
        
         | scoot wrote:
         | Yes (if you use a TomTom): https://tomtom.gps-data-
         | team.com/celebrity_voices/Yoda-Star_...
        
       | zambonidriver wrote:
       | Is it an LLM? What base model does it use?
        
         | celestache wrote:
         | eSpeak uses what is known as formant synthesis, and no LLM as
         | far as I know.
        
           | nmstoker wrote:
           | Definitely no LLM! Espeak dates from at least 10 years before
           | LLMs appeared and was based on the approach used on Acorn
           | computers in the 80s and 90s.
        
       | webprofusion wrote:
       | "More than hundred"
        
         | Aachen wrote:
         | Fwiw, in many languages that's correct. Coming from Dutch'
         | "meer dan honderd", being taught to say _one_ hundred is like
         | teaching an English person to say  "more than one ten" for
         | values >10
        
       | fisian wrote:
       | I used it on Android and it seems to be one of very few apps that
       | can replace the default Google services text-to-speech engine.
       | 
       | However, I wasn't satisfied with the speech quality so now I'm
       | using RHVoice. RHVoice seems to produce more natural/human-
       | sounding output yo me.
        
         | wakeupcall wrote:
         | Depending on context, I cycle between espeak-ng with mbrola-en
         | or RHVoice, but even plain espeak shouldn't be discarded.
         | 
         | RHVoice sounds slightly more natural in some cases, but one
         | advantage of espeak-ng is that the text parsing logic is
         | cleaner, by default.
         | 
         | For example, RHVoice likes to spell a lot regular text
         | formatting. One example would be spelling " -- " as dash-dash
         | instead of pausing between sentences. So while text sounds a
         | little more natural, it's actually harder to understand in
         | context unless the text is clean to begin with.
         | 
         | I don't know if speech-dispatcher does this for you, but I'm
         | using a shell script and some regex rules to make the text
         | cleaner for TTS which I don't need when using espeak-ng.
         | 
         | Another tradeoff: espeak-ng with the mbrola doesn't offer all
         | the inflexion customization options you have with the "robotic-
         | sounding" voices. When accelerating speech, these options make
         | a qualitative difference in my experience.
         | 
         | I can see why each of these can have its place.
        
         | tapper wrote:
         | On android I use ETI-Eloquence, but you cant get a legal one.
         | Google it and look on the website blind help. There is a apk.
        
       | dheera wrote:
       | Why is the quality of open source TTS so horribly, horribly,
       | horribly behind the commercial neural ones? This is nowhere near
       | the quality of Google, Microsoft, or Amazon TTS, yet for image
       | generation and LLMs almost everything outside of OpenAI seems to
       | be open-sourced.
        
         | str3wer wrote:
         | almost like there's a few bilion dollars difference in their
         | budgets
        
           | dheera wrote:
           | Sure but it's been 15 years and the quality of the espeak
           | command is equally horrible. I would have expected some
           | changes ... especially considering even the free TTS inside
           | Google Chrome is actually pretty decent, that could just be
           | extracted and packaged up as a new version of espeak.
        
         | anthk wrote:
         | Festival it's nicer; Flite would run on a toaster and Mbrola
         | can work with Espeak but the data it's restricted for
         | commercial usage.
        
         | albertzeyer wrote:
         | The quality also depends on the type of model. I'm not really
         | sure what ESpeak-ng actually uses? The classical TTS approaches
         | often use some statistical model (e.g. HMM) + some vocoder. You
         | can get to intelligible speech pretty easily but the quality is
         | bad (w.r.t. how natural it sounds).
         | 
         | There are better open source TTS models. E.g. check
         | https://github.com/neonbjb/tortoise-tts or
         | https://github.com/NVIDIA/tacotron2. Or here for more:
         | https://www.reddit.com/r/MachineLearning/comments/12kjof5/d_...
        
         | ClawsOnPaws wrote:
         | I'm glad that it doesn't. A lot of us use these voices as an
         | accessibility tool in our screen readers. They need to perform
         | well and be understandable at very high rate, and they need to
         | be very responsive. ESpeak is one of the most responsive speech
         | synths out there, so for a screen reader this means key press
         | to speech output is extremely low. Adding AI would just make
         | this a lot slower and unpredictable, and unusable for daily
         | work, at least right now. This is anecdotal, but part of what
         | makes a synth work well at high speech rates is predictability.
         | I know how a speech synth is going to say something exactly.
         | This let's me put more focus on the thing I'm doing rather than
         | trying to decipher what the synth is saying. Neural TTS always
         | has differences in how they say a thing, and at times, those
         | differences can be large enough to trip me up. Then I'm
         | focusing on the speech again and not what I'm doing. But ESpeak
         | is very predictable, so I can let my brain do the pattern
         | matching and focus actively on something else.
        
         | exceptione wrote:
         | Deepspeech from Mozilla is open source. Did you know this one?
         | 
         | From the samples I listened, it sounds great to me?
        
           | dheera wrote:
           | I have to dry this, thanks! Unfortunately I couldn't find
           | samples on their git repo and it looks like it isn't apt-
           | gettable. Maybe that's part of the reason.
           | 
           | They should make it so that I can do                   sudo
           | apt-get install deepspeech         sudo ln -s
           | /usr/bin/deepspeech /usr/bin/espeak
           | 
           | Anything more than that is an impediment to mass adoption.
           | 
           | Seems they need some new product management ...
        
           | follower wrote:
           | As I understand it DeepSpeech is no longer actively
           | maintained by Mozilla:
           | https://github.com/mozilla/DeepSpeech/issues/3693
           | 
           | For Text To Speech, I've found Piper TTS useful (for
           | situations where "quality"=="realistic"/"natual"):
           | https://github.com/rhasspy/piper
           | 
           | For Speech to Text (which AIUI DeepSpeech provided), I've had
           | some success with Vosk: https://github.com/alphacep/vosk-api
        
       | miki123211 wrote:
       | Blind person here, ESpeak-ng is literally what I use on all of my
       | devices for most of my day, every day.
       | 
       | I switched to it in early childhood, at a time where human-
       | sounding synthesizers were notoriously slow and noticeably
       | unresponsive, and just haven't found anything better ever since.
       | I've used Vocalizer for a while, which is what iOS and Mac OS
       | ship with, but then third-party synthesizer support was added and
       | I switched right back.
        
         | maxglute wrote:
         | How fast do you set speech playback speed/rate?
         | 
         | I tried a bunch of speech synthesis, with speed and
         | intelligibility in mind.
         | 
         | ESpeakng-ng barely intelligible past ~500 words per minute, and
         | just generally unpleasant to listen to. Maybe my brain just
         | can't acclimatize to it.
         | 
         | Microsoft Zira Mobile (unlock on win11 desktop via regex)
         | sounds much more natural and intelligible at max windows SAPI
         | speech rate, which I estimate is around ~600 and equivalent to
         | most conversation/casual spoken word at 2x speed. I wish
         | windows could increase playback even further, my brain can
         | process 900-1200 words per minute or 3x-4x normal playback
         | speed.
         | 
         | On Android, Google's "United States - 1" sounds a little
         | awkward but also intelligible at 3x-4x speed.
        
           | trwm wrote:
           | Similar to OP if information is low density like a legal
           | contract I can do 1200wpm after a few hours of getting used
           | to it. Daily normal is 600wpm, if the text is heavy going
           | enough I have to drop it down to 100 wpm and put it on loop.
           | 
           | Like usual the limit isn't how fast human io is but how fast
           | human processing works.
        
             | maxglute wrote:
             | Yeah 600wpm is passive listening. 900-1200wpm is listening
             | lecture on youtube at 3-4x speed. Skim listening for
             | content I'm familiar with. Active listening for things I
             | just want to speed through. It's context dependent, I find
             | I can ramp up 600-1200 and get into flow state of
             | listening.
             | 
             | >text is heavy going enough I have to drop it down to 100
             | wpm
             | 
             | What is heavy text for you? Like very dense technical text?
             | 
             | >put it on loop
             | 
             | I find this very helpful as well, but for content I
             | consume, not very technical, I listen at ~600wpm and loop
             | it multipe times. It's like listening a song to death.
             | Engrain it on a vocal / story telling level.
             | 
             | E: semi related comment to a deleted comment about
             | processing speed that I can no longer reply to. Posting
             | here because related.
             | 
             | Some speech synthesis are much more intelligible at higher
             | speeds, and aids processing at higher wpms. What I've been
             | trying to find is the most intelligible speech synthesis
             | voice for upper limit of concentrated/burst listening which
             | for me is around 1200wpm / 4x speed, i.e. many have wierd
             | audio artefacts past 3x. There's synthesis engines whose
             | high speed intelligbility improves if text is processed
             | with SSML markup to add longer pauses after punctuation.
             | Just little tweaks that makes processing easier. Doesn't
             | apply to all content, all contexts, but I think some
             | consumption are suitable for that, and it's something that
             | can be trained like many mental tasks, and dedicated speech
             | synthesis like fancy sport equipments improve top end
             | performance.
             | 
             | IMO also something neural model can be tuned for. There are
             | some podcasters/audiobook narrators who are "easy"
             | listening at 3x speed vs others because they just have
             | better enunciation/cadence at same word density. Most
             | voices out there from traditional SAPI models to neural
             | are... very mid fast "narrators". Think need to bundle
             | speech sythensis with content awareness - AI to filter
             | content then synthesis speech that emphasis/slow on
             | significant information, breeze past filler - just present
             | information more efficiently for consumption.
        
         | agumonkey wrote:
         | Thanks for the heads-up. May I ask you if you know websites /
         | articles that explain daily setups for blind people ? I had
         | issues that required me not to rely on sight and I couldn't
         | find much.
        
       | deknos wrote:
       | Is this better than the classical espeak which is available in
       | opensource repositories?
       | 
       | I would be very glad if there's a truly open source local hosted
       | text to speech software which brings good human sounding speech
       | in woman/man german/english/french/spanish/russian/arabic
       | language...
        
         | yorwba wrote:
         | When you install espeak with a distro package manager, you're
         | quite likely to get espeak-ng.
        
         | follower wrote:
         | Based on your description of your requirements Piper TTS might
         | be of interest to you: https://github.com/rhasspy/piper
        
       | nmstoker wrote:
       | I always feel sympathy for the devs on this project as they get
       | so many issues raised by people that are largely lazy (since the
       | solution is documented and/or they left out obvious detail) or
       | plain wrong. I suspect it's a side effect from espeak-ng being
       | behind various other tools and in particular critical to many
       | screen readers, thus you can see why the individuals need help
       | even if they struggle to ask for it effectively.
        
       | liotier wrote:
       | I hoped "-ng" would be standing for Nigeria - which would have
       | been most fitting, considering Nigeria's linguistic diversity !
        
       | sandbach wrote:
       | Anyone interested in formants and speech synthesis should have a
       | look at Praat[0], a marvellous piece of free software that can do
       | all kinds of speech analysis, synthesis, and manipulation.
       | 
       | https://www.fon.hum.uva.nl/praat/
        
       | readmemyrights wrote:
       | I'm quite surprised to find this on HN, synthesizers like espeak
       | and eloquence (ibm TTS) have fallen out of favor these days. I'm
       | a blind person who uses espeak on all my devices except my
       | macbook, where unfortunately I can't install the speech
       | synthesizer because it apparently only supports MacOS 13
       | (installing the library itself works fine though).
       | 
       | Most times I try to use modern "natural-sounding" voices they
       | take a while to initialize, and when you speed them at a certain
       | point the words mix together into meaningless noise, while at the
       | same rate eloquence and espeak would handle just great, well, for
       | me at least.
       | 
       | I was thinking about this a few days back while I was trying out
       | piper-tts [0] how supposedly "more advanced" synthesizers powered
       | by AI use up more ram and cpu and disk space to deliver a voice
       | which doesn't sound much better than something like RH voice and
       | gets things like inflection wrong. And that's the english voice,
       | the voice for my language (serbian) makes espeak sound human and
       | according to piper-tts it's "medium".
       | 
       | Funny story about synthesizers taking a while to initialize,
       | there's a local IT company here that specializes in speech
       | synthesis and their voices take so long to load they had to say
       | "<company> Mary is initializing..." whenever you start your
       | screen reader or such. Was annoying but in a fun way. Their newer
       | Serbian voices also have this "feature" where they try to
       | pronounce some english words it comes upon properly. It also has
       | another "feature" where it tries to pronounce words right that
       | were spelled without accent marks or such, and like with most of
       | these kinds of "features" they combine badly and hilariously. For
       | example if you asked them to pronounce "topic" it would pronounce
       | it as "topich, which was fun while browsing forums or such.
       | 
       | [0] https://github.com/rhasspy/piper
        
       | bArray wrote:
       | I think it would be good if they provided some samples on the
       | readme. It would be good for example if their list of
       | languages/accents could be sampled [1]
       | 
       | [1] https://github.com/espeak-ng/espeak-
       | ng/blob/master/docs/lang...
       | 
       | > eSpeak NG uses a "formant synthesis" method. This allows many
       | languages to be provided in a small size. The speech is clear,
       | and can be used at high speeds, but is not as natural or smooth
       | as larger synthesizers which are based on human speech
       | recordings. It also supports Klatt formant synthesis, and the
       | ability to use MBROLA as backend speech synthesizer.
       | 
       | I've been using eSpeak for many years now. It's superb for
       | resource constrained systems.
       | 
       | I always wondered whether it would be possible to have a semi-
       | context aware, but not neural network, approach.
       | 
       | I quite like the sound of Mimic 3, but it seems to be mostly
       | abandoned: https://github.com/MycroftAI/mimic3
        
         | follower wrote:
         | FYI re: Mimic 3: the main developer Michael Hansen (a.k.a
         | synesthesiam) (who also previously developed Larynx TTS) now
         | develops Piper TTS (https://github.com/rhasspy/piper) which is
         | essentially a "successor" to the earlier projects.
         | 
         | IIUC ongoing development of Piper TTS is now financially
         | supported by the recently announced Open Home Foundation (which
         | is great news as IMO synesthesiam has almost single-handed
         | revolutionized the quality level--in terms of
         | naturalness/realism--of FLOSS TTS over the past few years and
         | it would be a real loss if financial considerations stalled
         | continued development):
         | https://www.openhomefoundation.org/projects/ (Ok, on re-reading
         | OHF is more generally funding development of Rhasspy of which
         | Piper TTS is one component.)
        
       | replete wrote:
       | I listen to ebooks with TTS. On Android via FDroid the speech
       | packs in this software are extremely robotic.
       | 
       | There aren't many options for degoogled Android users. In the end
       | I settled for the Google Speech Services and disabled network
       | access and used the default voice. GSS has its issues and voices
       | don't download properly, but the default voice is tolerable in
       | this situation.
        
         | tapper wrote:
         | you can get ETI-Eloquence for android.
        
       | jryb wrote:
       | When speaking Chinese, it says the tone number in English after
       | each character. So "Ni Hao " is pronounced "ni three hao three".
       | Am I using this wrong? I'm running `espeak-ng -v cmn "Ni Hao "`.
       | 
       | If this is just how it is, the "more than one hundred languages"
       | claim is a bit suspect.
        
         | TomK32 wrote:
         | I was curious, 300 minority languages are spoken in China,
         | spread across 55 minority ethnic groups
         | https://en.wikipedia.org/wiki/Languages_of_China
        
         | follower wrote:
         | After some brief research it seems the issue you're seeing may
         | be a known bug in at least some versions/release of espeak-ng.
         | 
         | Here's some potentially related links if you'd like to dig
         | deeper:
         | 
         | * "questions about mandarin data packet #1044":
         | https://github.com/espeak-ng/espeak-ng/issues/1044
         | 
         | * "ESpeak NJ-1.51's Mandarin pronunciation is corrupted
         | #12952": https://github.com/nvaccess/nvda/issues/12952
         | 
         | * "The pronunciation of Mandarin Chinese using ESpeak NJ in
         | NVDA is not normal #1028": https://github.com/espeak-ng/espeak-
         | ng/issues/1028
         | 
         | * "When espeak-ng translates Chinese (cmn), IPA tone symbols
         | are not output correctly #305":
         | https://github.com/rhasspy/piper/issues/305
         | 
         | * "Please default ESpeak NG's voice role to 'Chinese (Mandarin,
         | latin as Pinyin)' for Chinese to fix #12952 #13572":
         | https://github.com/nvaccess/nvda/issues/13572
         | 
         | * "Cmn voice not correctly translated #1370":
         | https://github.com/espeak-ng/espeak-ng/issues/1370
        
       | devinprater wrote:
       | ESpeak is pretty great, and now that Piper is using it, hopefully
       | strange issues like it saying nineteen hundred eighty four for
       | 1984 the year, can be fixed.
        
         | follower wrote:
         | Yeah, it would be nice if the financial backing behind
         | Rhasspy/Piper led to improvements in espeak-ng too but based on
         | my own development-related experience with the espeak-ng code
         | base (related elsewhere in the thread) I suspect it would be
         | significantly easier to extract the specific required text to
         | phonemes functionality or (to a certain degree) reimplement it
         | (or use a different project as a base[3]) than to more
         | closely/fully integrate changes with espeak-ng itself[4]. :/
         | 
         | It seems Piper currently abstracts its phonemize-related
         | functionality with a library[0] that currently makes use of a
         | espeak-ng fork[1].
         | 
         | Unfortunately it also seems license-related issues may have an
         | impact[2] on whether Piper continues to make use of espeak-ng.
         | 
         | For your specific example of handling 1984 as a year, my
         | understanding is that espeak-ng _can_ handle situations like
         | that via parameters /configuration but in my experience there
         | can be unexpected interactions between different
         | configuration/API options[6].
         | 
         | [0] https://github.com/rhasspy/piper-phonemize
         | 
         | [1] https://github.com/rhasspy/espeak-ng
         | 
         | [2] https://github.com/rhasspy/piper-
         | phonemize/issues/30#issueco...
         | 
         | [3] Previously I've made note of some potential options here:
         | https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...
         | 
         | [4] For example, as I note here[5] there's currently at least
         | _four_ different ways to access espeak-ng 's phoneme-related
         | functionality--and it seems that they _all_ differ in their
         | output, sometimes consistently and other times dependent on
         | configuration (e.g. audio output mode, spoken punctuation) and
         | probably also input. : /
         | 
         | [5] https://gitlab.com/RancidBacon/floss-various-
         | contribs/-/blob...
         | 
         | [6] For example, see my test cases for some other numeric-
         | related configuration options here:
         | https://gitlab.com/RancidBacon/floss-various-contribs/-/blob...
        
       | synergy20 wrote:
       | just used it a few days ago, the quality is honestly subpar.
       | 
       | I use chrome's extension 'read aloud', which is as natural as you
       | can get.
        
         | follower wrote:
         | It's been mentioned elsewhere in the comments but espeak-ng has
         | historically prioritized accessibility use cases which is a
         | domain where "quality" doesn't necessarily correlate with
         | "naturalness" (e.g. there is a preference for clarity at high
         | words-per-minute rates of speech where the speech doesn't sound
         | "natural" but is still understandable, for people who have
         | acclimatized to it through daily use, at least :) ).
        
       | iamleppert wrote:
       | SORA AI should integrate this into their LLM.
        
       | spdustin wrote:
       | Now I just want DECTalk ported to MacOS. The original Stephen
       | Hawking voice.
       | 
       | I have an Emic2 board I use (through UART so my ESP32 can send
       | commands to it) and I use Home Assistant to send notifications to
       | it. My family are science nerds like me, so when the voice of
       | Stephen Hawking tells us there is someone at the door, it brings
       | a lot of joy to us.
        
         | DoktorDelta wrote:
         | I found this, might be promising:
         | https://www.reddit.com/r/DecTalk/comments/rhz01n/dectalk_for...
        
         | incontrol_77 wrote:
         | Here you go: https://github.com/dectalk/dectalk
        
           | spdustin wrote:
           | Awesome, thank you!
        
       | manzanarama wrote:
       | hugging face?
        
       | follower wrote:
       | Based on my own recent experience[0] with espeak-ng, IMO the
       | project is currently in a really tough situation[3]:
       | 
       | * the project seems to provide real value to a huge number of
       | people who rely on it for reasons of accessibility (even more so
       | for non-English languages); and,
       | 
       | * the project is a valuable trove of knowledge about multiple
       | languages--collected & refined over multiple decades by both
       | linguistic specialists and everyday speakers/readers; but...
       | 
       | * the project's code base is very much of "a different era"
       | reflecting its mid-90s origins (on RISC OS, no less :) ) and a
       | somewhat piecemeal development process over the following decades
       | --due in part to a complex Venn diagram of skills, knowledge &
       | familiarity required to make modifications to it.
       | 
       | Perhaps the prime example of the last point is that `espeak-ng`
       | has a _hand-rolled XML parser_ --which attempts to handle both
       | valid & invalid SSML markup--and markup parsing is interleaved
       | with internal language-related parsing in the code. And this is
       | implemented in C.
       | 
       | [Aside: Due to this I would _strongly_ caution against feeding
       | "untrusted" input to espeak-ng in its current state but
       | unfortunately that's what most people who rely on espeak-ng for
       | accessibility purposes inevitably do while browsing the web.]
       | 
       | [TL;DR: More detail/repros/observations on espeak-ng issues here:
       | 
       | * https://gitlab.com/RancidBacon/floss-various-contribs/-/blob...
       | 
       | * https://gitlab.com/RancidBacon/floss-various-contribs/-/blob...
       | 
       | * https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...
       | 
       | ]
       | 
       | Contributors to the project are not unaware of the issues with
       | the code base (which are exacerbated by the difficulty of even
       | tracing the execution flow in order to understand how the library
       | operates) nor that it would benefit from a significant
       | refactoring effort.
       | 
       | However as is typical with such projects which greatly benefit
       | individual humans but don't offer an opportunity to generate
       | significant corporate financial return, a lack of developers with
       | sufficient skill/knowledge/time to devote to a significant
       | refactoring means a "quick workaround" for an specific individual
       | issue is often all that can be managed.
       | 
       | This is often exacerbated by outdated/unclear/missing
       | documentation.
       | 
       | IMO there are two contribution approaches that could help the
       | project moving forward while requiring the least amount of
       | specialist knowledge/experience:
       | 
       | * Improve visibility into the code by adding logging/tracing to
       | make it easier to see why a particular code path gets taken.
       | 
       | * Integrate an existing XML parser as a "pre-processor" to ensure
       | that only valid/"sanitized"/cleaned-up XML is passed through to
       | the SSML parsing code--this would increase robustness/safety and
       | facilitate future removal of XML parsing-specific workarounds
       | from the code base (leading to less tangled control flow) and
       | potentially future removal/replacement of the entire bespoke XML
       | parser.
       | 
       | Of course, the project is not short on ideas/suggestions for how
       | to improve the situation but, rather, direct developer
       | contributions so... _shrug_
       | 
       | In light of this, last year when I was developing the personal
       | project[0] which made use of a dependency that in turn used
       | espeak-ng I wanted to try to contribute something more tangible
       | than just "ideas" so began to write-up & create reproductions for
       | some of the issues I encountered while using espeak-ng and at
       | least document the current behaviour/issues I encountered.
       | 
       | Unfortunately while doing so I kept encountering _new_ issues
       | which would lead to the start of yet another round of debugging
       | to try to understand what was happening in the new case.
       | 
       | Perhaps inevitably this effort eventually stalled--due to a
       | combination of available time, a need to attempt to prioritize
       | income generation opportunities and the downsides of living with
       | ADHD--before I was able to share the fruits of my research.
       | (Unfortunately I seem to be way better at discovering & root-
       | causing bugs than I am at writing up the results...)
       | 
       | However I just now used the espeak-ng project being mentioned on
       | HN as a catalyst to at least upload some of my notes/repros to a
       | public repo (see links in TLDR section above) in that hopes that
       | maybe they will be useful to someone who might have the
       | time/inclination to make a more direct code contribution to the
       | project. (Or, you know, prompt someone to offer to fund my
       | further efforts in this area... :) )
       | 
       | [0] A personal project to "port" my "Dialogue Tool for Larynx
       | Text To Speech" project[1] to use the more recent Piper TTS[2]
       | system which makes use of espeak-ng for transforming text to
       | phonemes.
       | 
       | [1] https://rancidbacon.itch.io/dialogue-tool-for-larynx-text-
       | to... & https://gitlab.com/RancidBacon/larynx-
       | dialogue/-/tree/featur...
       | 
       | [2] https://github.com/rhasspy/piper
       | 
       | [3] Very much no shade toward the project intended.
        
       ___________________________________________________________________
       (page generated 2024-05-02 23:01 UTC)