[HN Gopher] Speech Synthesis on Linux (2020)
       ___________________________________________________________________
        
       Speech Synthesis on Linux (2020)
        
       Author : ducktective
       Score  : 137 points
       Date   : 2021-09-25 08:42 UTC (14 hours ago)
        
 (HTM) web link (darkshadow.io)
 (TXT) w3m dump (darkshadow.io)
        
       | wiz21c wrote:
       | Listening to the 2001 examples (I'm sorry Dave...) I wonder:
       | would it be possible to train an IA to copy a voice based only on
       | a few samples. It'd had to "model" the voice on a few minutes of
       | speech only... But I'd love my computer to use HAL's voice for
       | sure !
        
       | TylerLives wrote:
       | I followed the instructions and I'm getting:
       | Warning: HTS_fopen: Cannot open hts/htsvoice.       aplay:
       | main:666: bad speed value 0       aplay: main:666: bad speed
       | value 0       nil
       | 
       | I'm using Pop OS 20.04(based on the same version of Ubuntu) and
       | Festival 2.5.0
        
         | frumiousirc wrote:
         | Perhaps these lines from Debian 11's /etc/festival.scm are not
         | added to your install and may help?                   ;;
         | Debian-specific: Use aplay to play audio         (Parameter.set
         | 'Audio_Command "aplay -q -c 1 -t raw -f s16 -r $SR $FILE")
         | (Parameter.set 'Audio_Method 'Audio_Command)
         | 
         | In any case, I do not see this error on Debian.
         | 
         | I did however go through the motion of installing the nitech
         | voices _before_ reading that they only work with older versions
         | of festival. Doh!
        
           | superkuh wrote:
           | Yep. This one of of the primary reasons I keep a 2010-era
           | Ubuntu 10.04 box around: a working platform for festival 1.9x
           | and high quality TTS voices. Modern distros packing 2.x are a
           | big step backwards. I've tried to get things working as well
           | as they did in Ubuntu 10.04 on Ubuntu 14.04 and it's not
           | really possible; even worse on Debian 10/11.
        
           | TylerLives wrote:
           | Those lines were already in the config file. I'll try
           | compiling Festival from source when I get home. Some people
           | say that worked for them.
        
       | felixr wrote:
       | https://github.com/coqui-ai/TTS the continuation of Mozilla TTS
       | produces quite nice results I you pick the right models
        
         | tomcumming wrote:
         | How far away are we from a debian package that will work with
         | speechd?
        
           | pabs3 wrote:
           | A real Debian package is unlikely, to start with Debian
           | doesn't have a GPU farm for retraining the model from the
           | source data, then probably the training requires proprietary
           | GPU drivers, and probably a subset of the source data is
           | proprietary.
           | 
           | https://salsa.debian.org/deeplearning-team/ml-policy
           | 
           | Stuffing an existing model into a .deb is of course fairly
           | easy.
        
             | codetrotter wrote:
             | Speaking of retraining, I think it's also potentially a bit
             | hairy with regards to reproducible builds don't you think?
             | My impression is that machine learning models are often
             | initialized with random values before training begins. And
             | some of them may use additional random data while training
             | as well, I think.
        
               | DiogenesKynikos wrote:
               | You can fix the seed for the pseudorandom number
               | generator.
        
         | karussell wrote:
         | I can confirm this. The setup was also relative easy for me.
        
       | butz wrote:
       | How to get some decent sounding English voices into Firefox, to
       | use in reader view and with Speech API? By default on Fedora 34
       | Firefox offers hundreds variations (?) of the same sounding
       | robotic voice.
        
         | ajot wrote:
         | I had the same question some months ago, as it helps me to
         | focus and read long articles. You have to do some fiddling but
         | it's doable.
         | 
         | https://askubuntu.com/questions/953509/how-can-i-change-the-...
        
       | pizza wrote:
       | Cool article, I do like the extensibility provided via the unix
       | philosophy. Another thing you can do nowadays is use off-the-
       | shelf deep neural networks to do tts eg
       | 
       | https://github.com/NVIDIA/tacotron2
       | 
       | https://github.com/mozilla/TTS
       | 
       | https://github.com/CorentinJ/Real-Time-Voice-Cloning
       | 
       | https://github.com/coqui-ai/TTS
       | 
       | They're not all easy to setup however
        
         | ducktective wrote:
         | Why so many ML-based projects don't release a binary or
         | package?
        
           | echelon wrote:
           | An overwhelming number of reasons.
           | 
           | Because you need data, trained models, etc.
           | 
           | Because data scientists aren't typically product people or
           | software engineers with UX in mind.
           | 
           | Because ML packages are brittle and tied to specific hardware
           | configurations.
           | 
           | Because the ML world is evolving rapidly. It's quick, dirty,
           | and messy.
           | 
           | View these as stepping stones for research and product
           | development.
           | 
           | (I created https://vo.codes using a lot of these, fwiw, in an
           | attempt to make it easy.)
        
           | Blikkentrekker wrote:
           | Because they wouldn't know the specific a.b.i. of your system
           | which on many system changes in a rolling way as well.
           | 
           | They would have to release a great many different ones or
           | alternatively bundle all libraries with it.
        
             | moron4hire wrote:
             | That's why all major software releases exclusive as source
             | code /s
        
               | Blikkentrekker wrote:
               | They have the time and money to research this issue on
               | every system, and they typically do bundle libraries.
               | 
               | I once saw a comparison with _LibreOffice_ that showed
               | that the the package _Debian_ itself provided was 20% of
               | the size of the package _LibreOffice_ provided targeting
               | _Debian_ , -- which would not receive the same benefits
               | of security bugfixes to libraries, but of course also not
               | the same problems that often arise on _Debian_ when they
               | arrogantly patch libraries they barely understand and
               | create their own unique security problems.
        
       | amelius wrote:
       | This article would be better if they provided some speech
       | examples. Demonstrate to the reader what they will get before
       | they go through all the trouble of installing the software.
        
       | 57844743385 wrote:
       | I did a lot of work researching all the available text to speech
       | systems a couple of years ago.
       | 
       | The cloud based systems from Google, Microsoft, Amazon and IBM
       | are much better than anything else, and within them, the neural
       | network based systems, which appear to be a sort of different
       | product category, are far and away the best of all. The neural
       | voices are approaching natural voice intonation and have an
       | almost believable ability to read text.
       | 
       | The ones that sounded most natural were IBM Watson and Googles
       | neural voices.
       | 
       | Amazon Polly appeared to be the furthest behind of all the cloud
       | systems.... a really average sounding product.
       | 
       | Of the local TTS systems, the one built into MacOS sounds the
       | best... but they were all very average at best. All the linux
       | ones frankly sounded like garbage relative to the state of the
       | art.
       | 
       | Things might have advanced with the cloud systems over the past
       | couple of years but I didn't get the impression the cloud
       | companies were putting much effort into research and development.
        
         | a9h74j wrote:
         | I had reason to sample the IBM performance recently. It is
         | imressive. Do you know if NN based systems have been trained
         | on, say, audio books for which text is also available?
        
         | bluebirdfirewin wrote:
         | I searched for a TTS service recently and found wellsaidlabs.
         | It's a saas product but the quality is astonishing. It's also
         | fast to render the audio, approximately 2 times the length of
         | the audio file. Here is an article of the mit technology review
         | magasine about it
         | https://www.technologyreview.com/2021/07/09/1028140/ai-voice...
        
       | smcameron wrote:
       | There's also pico2wave (libttspico), which, to me, with the
       | "-l=en-GB" flag, sounds the best _by far_ of any offline TTS that
       | I 've tried.
       | 
       | You can hear it in this video:
       | https://www.youtube.com/watch?v=tfcme7maygw&t=131s
        
         | giuseppeciuni wrote:
         | I agree, I use Pico2wave too after testing other TTS and
         | pico2wave has the best voice in offline systems. I use it
         | combined with home assistant: whenever a window trigger is
         | fired, pico2wave generate a wav file and it is read by aplay
         | command and transmitted to an 90' stereo HI-FI. The result is:
         | the window x it is opening because x.
         | 
         | The italian voice sounds great
        
       | [deleted]
        
       | kongin wrote:
       | I find that people who have never used text to speech think that
       | the closer it is to real speech the better it is.
       | 
       | Which is simply not the case.
       | 
       | Artificial speech is to human speech what typography is to
       | handwriting.
       | 
       | For example espeak is by far my number one choice for reading
       | anything, because the voice models it uses can be sped up to 1k
       | wpm and still be understandable. This is basically a superpower
       | when skimming boring documentation of any type. Throw in basic
       | tesseract OCR and in a 45 minute sitting I can go through 30k
       | words of any document that can be displayed on a computer screen.
       | 
       | It's not that I'm stuck with a terrible robotic voice, it's that
       | I don't want anything "better" in the same way that I don't see
       | much value going past the command line for most tools when you
       | can use ncurses.
        
         | simion314 wrote:
         | Same here, so for accessibility reasons current voices are good
         | enough, but I admit at the beginning I lost a lot of time
         | trying to find good voices, until I trained myself with faster
         | speeds.
         | 
         | So probably most people here researched this topic not for
         | accessibility reasons but for "commercial" stuff like creating
         | some kind of service where chat bots could speak to you or
         | transcribe articles for some regular people(without eye
         | problems) to listen to them.
        
         | rockemsockem wrote:
         | Would this be true for things like audiobooks and particularly
         | interesting long-form articles on the web that you actually
         | want to listen to as opposed to "getting through"?
        
         | toast0 wrote:
         | It depends on the use case. For a lot of people who don't use
         | TTS directly, we're exposed to it through public announcement
         | systems on transit/airports and phone systems, or sometimes
         | voice verification codes.
         | 
         | More natural speech patterns would be useful in those venues.
        
           | fiddlerwoaroof wrote:
           | I wonder if people will hit an uncanny valley here: my
           | experience with video animation was that sometime around "The
           | Polar Express" animated movie makers realized that audiences
           | didn't really want more and more realistic animation.
        
             | toast0 wrote:
             | I think there's certainly a good enough point, yeah. But
             | last time I was on a CalTrain platform, their TTS
             | announcement still couldn't pronounce CalTrain, unless it's
             | expected to be pronounced call train.
        
       | okamiueru wrote:
       | Having played around with flite a decade ago, and at that time
       | feeling that it was already then nowhere close to the fidelity of
       | other speech synthesis examples. I find it surprising that there
       | still isn't anything better than festival/flite? It sounded then
       | like a clunky robot, and still does today. Surely some of the
       | many research projects have released their work as open source?
       | 
       | Work like, say https://arxiv.org/abs/1806.04558 [paper]
       | 
       | https://github.com/CorentinJ/Real-Time-Voice-Cloning [repo]
        
         | danuker wrote:
         | The HTS voice from NIT recommended in the article
         | (voice_cmu_us_slt_arctic_hts) actually sounds much better than
         | the clunky robot from a decade ago. Hear it here:
         | 
         | https://youtu.be/MmcLFJQpv2o?t=85
         | 
         | Edit: or on the online demo; select "HMM-based method (HTS
         | 2011) - Combilex" > "SLT (English American female)".
         | 
         | https://www.cstr.ed.ac.uk/projects/festival/onlinedemo.html
        
           | okamiueru wrote:
           | It does indeed sound much better yes. But, that voice was
           | already there a decade ago. It's not... hm. Let me just say
           | that I don't wish to disparage the work done on those
           | projects, as I do think it is great. Maybe it better
           | illustrates my point by taking a listen to this video which
           | showcases the project I mentioned, as machine learning
           | techniques have progress immensely the last decade:
           | https://www.youtube.com/watch?v=-O_hYhToKoA
           | 
           | There are of course great benefits to something simple to
           | use. I remember cross-compiling flite to run on a custom
           | android/windows/linux project to generate voice lines
           | intended for a in-game robot companion (nothing came of it
           | though) based on SDL. It probably would not be nearly as
           | feasible to do the same for some dependency-heavy machine
           | learning library.
           | 
           | Now, I haven't done any research to find better examples of
           | projects. I was just surprised how identical the article
           | describes the options, to what was available 12 years ago.
        
         | usui wrote:
         | Yes, I think that consumer-level state-of-the-art speech
         | synthesis is still pretty far from acceptable. Amazon Polly
         | doesn't sounds too great and presumably that should have more
         | than enough big data to leverage and cloud computign to work
         | with.
         | 
         | https://aws.amazon.com/polly/
         | https://www.youtube.com/watch?v=00D0YZ9GQX4
         | 
         | Either we're just not there yet technologically (hard to
         | believe), or there isn't a will to make good speech synthesis
         | available to commoners.
        
           | inside_out_life wrote:
           | Some of the Amazon's voices sound amazing to me, I've
           | actually tested few on them and people couldn't tell they're
           | synthetic. Watson's voices are nice too. (AFAIK Amazon bought
           | Polish company IVONA for their Polly TTS system, which was
           | long regarded to be one of the best)
        
             | ObscureScience wrote:
             | How do you think Readspeakers voices compare?
             | https://www.readspeaker.com
        
           | danuker wrote:
           | I find these voices much more intelligible than say, Stephen
           | Hawking's TTS.
        
         | ClawsOnPaws wrote:
         | I thought I'd just add another fun fact/data point here. This
         | is obviously my personal opinion. I have to use TTS to use my
         | computer with a screen reader, and for that, I mostly prefer
         | more synthetic speech. When I read long form text like books,
         | articles, etc. I do prefer more natural voices, but for doing
         | actual work like reading code or simply using user interfaces,
         | I like the predictability of more synthetic/algorithmic speech.
         | Apple added the neural Siri voices to the new VoiceOver. They
         | sound incredible but the quality of the voice also brings
         | latency with it. Something like ESpeak is much, much more
         | performant and predictable, and it speeds up much better. I use
         | my TTS at a very fast rate and I find that the more natural a
         | voice, the harder it is to understand at that speech rate.
         | Neural voices speak the same phrase of text differently every
         | time it's uttered. Slightly different intonation, slightly
         | different speech rhythm. This makes it hard to listen out for
         | patterns. So for me there's definitely still a place for
         | synthetic speech.
        
           | mwcampbell wrote:
           | In fact (as I'm sure you know), one of the most beloved
           | speech synthesizers among English-speaking blind users is a
           | closed-source product called ETI-Eloquence that has been
           | basically dead for nearly 20 years. (It was ported to Android
           | several years ago, but that port was discontinued because
           | they couldn't update it for 64-bit.) No recent speech
           | synthesizer has quite matched its consistent intelligibility,
           | particularly at high speeds. espeak-ng comes close, but it
           | has a bad reputation (mostly, I think, leftover from earlier
           | versions of espeak that really weren't very good).
           | 
           | Edit: Sample of ETI-Eloquence at my preferred speed:
           | https://mwcampbell.us/audio/eloquence-sample-2021-09-25.mp3
           | (yes, it mispronounces "espeak")
           | 
           | Edit 2: To elaborate on what I mean by "mostly dead": In 2009
           | I was tasked with adding support for ETI-Eloquence to a
           | Windows screen reader I developed. At that time, Nuance was
           | still selling Eloquence to companies like the one I worked
           | for back then. When I got the SDK, the timestamps on the
           | files, particularly the main DLLs, were from 2002. As far as
           | I know, an updated SDK for Windows was never released. I'm
           | thankful for Windows's legendary emphasis on backward
           | compatibility, particularly compared to Apple platforms and
           | even Android.
           | 
           | Finally, a sample of espeak-ng (in the NVDA screen reader) at
           | my preferred speed: https://mwcampbell.us/audio/espeak-ng-
           | sample-2021-09-25.mp3 I use the default British pronunciation
           | even though I'm American, because the American pronunciation
           | is noticeably off.
        
             | ClawsOnPaws wrote:
             | > In fact (as I'm sure you know), one of the most beloved
             | speech synthesizers among English-speaking blind users is a
             | closed-source product called ETI-Eloquence that has been
             | basically dead for nearly 20 years.
             | 
             | This is exactly the speech synthesizer I use daily. I've
             | gotten so used to it over the years that switching away
             | from it is painful. On Apple platforms, though, using it is
             | not an option. So I use Karen. Used to use Alex, but Karen
             | appears to be slightly more responsive and tries to do less
             | human stuff when reading. Responsiveness is a very
             | important factor, actually. Probably more so than people
             | might realize. Eloquence and ESpeak react pretty much
             | instantly whereas other voices might take 100 MS or so.
             | This is a very big deal for me. Just like how one would
             | like instant visual feedback on their screen, it's the same
             | for me with speech. The less latency, the better. My
             | problem with ESpeak is that it sounds very rough and
             | metallic whereas Eloquence has a much warmer sound to it. I
             | pitch mine down slightly to get an even warmer sound. Being
             | pleasant on the ears is super important if you listen to
             | the thing many, many hours a day.
        
               | mwcampbell wrote:
               | I agree with you that Eloquence sounds warmer than
               | eSpeak. I wish there was an open-source speech
               | synthesizer comparable to Eloquence or even DECtalk. That
               | approach to speech synthesis is old enough now that I'm
               | sure there are published algorithms whose patents have
               | expired. The problem, of course, would be funding the
               | work on a good open-source implementation.
        
               | app4soft wrote:
               | What about _RHVoice_?[0,1]
               | 
               | [0] https://github.com/RHVoice/RHVoice
               | 
               | [1] https://rhvoice.org/en-voices/
               | 
               | [2] https://f-droid.org/en/packages/com.github.olga_yakov
               | leva.rh...
        
             | machawinka wrote:
             | That is a bit too fast for me. I am surprised your brain
             | can process at such high speed. I guess it is a matter of
             | practice.
        
           | chrismorgan wrote:
           | I've heard exactly this from a couple of blind people I've
           | interacted with too.
        
           | machawinka wrote:
           | Great insight, it is monotonous but it really helps to be in
           | the flow for a few hours of productive work.
        
         | Blikkentrekker wrote:
         | I don't understand this approach when audio deepfakes exist
         | that can quite realistically make Ayn Rand read arbitrary
         | texts[1]. -- Is it simply a matter of processing power?
         | 
         | [1]:
         | https://www.youtube.com/watch?v=hDVuh4A-q3Q&ab_channel=Vocal...
        
       | miki123211 wrote:
       | The current state of open source )or even freeware) speech
       | synthesis is pretty sad, to be honest.
       | 
       | You have eSpeak, which is GPL V3, so including it in your own
       | software is a problem. RH Voice can be compiled without GPL code,
       | but its language support is pretty limited. There's also SAM,
       | which is incredibly easy to port and incredibly light on
       | resources, but its licensing status is unknown, it's English only
       | and it just sounds bad, even to somebody used to robotic synths.
       | 
       | If you're developing for a popular platform, it probably has
       | something built-in, but if you're developing for embedded, you
       | need to pay thousands of dollars to Cerrence (formerly Nuance) to
       | even get started.
        
         | simion314 wrote:
         | If you make a product then why not run the GPL code as a
         | background application and you send it commands on what to
         | speak? It would be fair to contribute back any improvements if
         | you would able to add to the GPL stuff.
        
         | sigg3 wrote:
         | > You have eSpeak, which is GPL V3, so including it in your own
         | software is a problem.
         | 
         | Can't you just include it like a separate module and provide
         | any improvements to it specifically upstream?
        
           | miki123211 wrote:
           | If you're doing this on Linux, you get IPC-related (or worse,
           | process-creation-related) latency. This is not a problem for
           | i.e. occasional weather announcements, but is a big issue
           | when . creating things for the blind. If the speech
           | synthesizer speaks each time you press a key, and you
           | generally need to know what it said to. decide if you want to
           | scroll further or open the focused item, every bit of latency
           | matters. That's one of the reasons why blind people prefer
           | robotic-sounding speech synthesizers; they're usually less
           | CPU-intensive, which increases responsiveness.
           | 
           | If the device you're developing for uses some proprietary
           | firmware, a custom module might not even be an option.
        
         | [deleted]
        
         | mwcampbell wrote:
         | Do you happen to know if ETI-Eloquence is now owned by Cerence,
         | or if Microsoft got it with the Nuance acquisition? I'm afraid
         | I was a little too eager to suggest that Microsoft open-source
         | Eloquence when the news of the Nuance acquisition first came
         | out.
        
           | miki123211 wrote:
           | Vocalizer is definitely Cerence, that I'm sure of. All the
           | automotive stuff, which Vocalizer is a part of, was spun off
           | before the acquisition.
           | 
           | As an aside, because of Vocalizer's use in automotive, it
           | will probably be the only high-ish quality speech engine that
           | won't become fully cloud-based. VFO's claims about the
           | continued use of Vocalizer in JAWS seem to confirm that.
           | 
           | Regarding Eloquence itself, its status is not really known. I
           | would be extremely surprised if it was owned by Microsoft,
           | though. There's a hypothesis that nobody really knows who
           | actually owns it, there were multiple companies that assisted
           | in its development, including IBM. The product was so
           | unimportant to Nuance these days that they might not even
           | have considered it when doing the spinoff, leaving its
           | ownership uncertain. If this hypothesis is untrue, though,
           | I'd strongly suspect that Cerence is the owner, not
           | Microsoft.
        
       | dosman33 wrote:
       | I did a bunch of work with TTS about 15 years ago for a project
       | and had landed on Festival and the Nitech voices as the best free
       | option at that time. It's interesting that this seems to still be
       | the best free non-cloud option available.
       | 
       | What a lot of people don't realize is that Festival is intended
       | for creating new TTS voices based on your own voice. The fact
       | that it generates TTS is an artifact of it's main function. I've
       | never messed with that functionality myself but I always wonder
       | if someone could train a synthetic voice to sound better with a
       | larger sample set. The Nitech voices are definitely better so
       | it's certainly possible to encourage Festival do a better job.
        
       | jmiskovic wrote:
       | A nice enhancement for the system is having TTS read out the
       | currently selected text, triggered by a key shortcut.
       | 
       | I tried festival and it too complicated and my version was too to
       | run the better voices model.
       | 
       | Instead I've used this repo to use upgraded flite:
       | https://github.com/kastnerkyle/hmm_tts_build/
       | 
       | I have mapped keyboard shortcuts Win+1 for normal speed, Win+2
       | for faster and Win+3 for really fast reading speed. I can use it
       | while reading, to enhance my focus. Neat.
        
       | synesthesiam wrote:
       | I created Larynx (https://github.com/rhasspy/larynx) to address
       | shortcomings I saw in Linux speech synthesis:
       | 
       | * Licensing (MIT)
       | 
       | * Quality (judge for yourself: https://rhasspy.github.io/larynx/)
       | 
       | * Speed (faster than real-time on amd64/aarch64)
       | 
       | * Voices/language support (9 languages, 50 voices)
       | 
       | I'm working now on integrating Larynx with speech-
       | dispatcher/Orca. The next version of Larynx will also support a
       | subset of SSML :)
        
         | tailspin2019 wrote:
         | Some of those sound really good.
         | 
         | I was going to comment that you didn't have any en_gb listed,
         | but it seems there's a bunch under en_us :)
         | 
         | Some rather good brit'ish accents in there me old mate!
        
           | synesthesiam wrote:
           | Thanks! They seemed to work fine with en_us phonemes, so I
           | haven't created a separate en_gb set yet. Maybe someday :)
        
         | FeepingCreature wrote:
         | Sweet, I've been hoping for good Linux TTS!
        
         | phkahler wrote:
         | Can it run on a raspberry pi?
        
           | synesthesiam wrote:
           | Yes! There's a Docker image and Debian package for both
           | 32-bit and 64-bit ARM. The 64-bit version is significantly
           | faster (especially with low quality set).
        
         | dm319 wrote:
         | Thought I'd give this a go, but getting lots errors along the
         | lines of 'Expected shape from model of {...} does not match
         | actual shape of {...} for output audio. Tried the debian and
         | python methods of installation on an AMD Ryzen X13.
         | 
         | EDIT: despite those errors I can create output.wav. However,
         | interactive mode crashes with "No such file or directory:
         | 'play'".
        
           | synesthesiam wrote:
           | The shape warnings don't seem to matter (something to do with
           | the onnx runtime). Interactive mode needs sox installed or
           | for you to specify a --play-command
        
       | hjek wrote:
       | Anyone knows how to get that voice to work in Reader Mode in
       | Firefox on Debian?
        
       ___________________________________________________________________
       (page generated 2021-09-25 23:01 UTC)