[HN Gopher] Speech Synthesis on Linux (2020)
___________________________________________________________________
Speech Synthesis on Linux (2020)
Author : ducktective
Score : 137 points
Date : 2021-09-25 08:42 UTC (14 hours ago)
(HTM) web link (darkshadow.io)
(TXT) w3m dump (darkshadow.io)
| wiz21c wrote:
| Listening to the 2001 examples (I'm sorry Dave...) I wonder:
| would it be possible to train an IA to copy a voice based only on
| a few samples. It'd had to "model" the voice on a few minutes of
| speech only... But I'd love my computer to use HAL's voice for
| sure !
| TylerLives wrote:
| I followed the instructions and I'm getting:
| Warning: HTS_fopen: Cannot open hts/htsvoice. aplay:
| main:666: bad speed value 0 aplay: main:666: bad speed
| value 0 nil
|
| I'm using Pop OS 20.04(based on the same version of Ubuntu) and
| Festival 2.5.0
| frumiousirc wrote:
| Perhaps these lines from Debian 11's /etc/festival.scm are not
| added to your install and may help? ;;
| Debian-specific: Use aplay to play audio (Parameter.set
| 'Audio_Command "aplay -q -c 1 -t raw -f s16 -r $SR $FILE")
| (Parameter.set 'Audio_Method 'Audio_Command)
|
| In any case, I do not see this error on Debian.
|
| I did however go through the motion of installing the nitech
| voices _before_ reading that they only work with older versions
| of festival. Doh!
| superkuh wrote:
| Yep. This one of of the primary reasons I keep a 2010-era
| Ubuntu 10.04 box around: a working platform for festival 1.9x
| and high quality TTS voices. Modern distros packing 2.x are a
| big step backwards. I've tried to get things working as well
| as they did in Ubuntu 10.04 on Ubuntu 14.04 and it's not
| really possible; even worse on Debian 10/11.
| TylerLives wrote:
| Those lines were already in the config file. I'll try
| compiling Festival from source when I get home. Some people
| say that worked for them.
| felixr wrote:
| https://github.com/coqui-ai/TTS the continuation of Mozilla TTS
| produces quite nice results I you pick the right models
| tomcumming wrote:
| How far away are we from a debian package that will work with
| speechd?
| pabs3 wrote:
| A real Debian package is unlikely, to start with Debian
| doesn't have a GPU farm for retraining the model from the
| source data, then probably the training requires proprietary
| GPU drivers, and probably a subset of the source data is
| proprietary.
|
| https://salsa.debian.org/deeplearning-team/ml-policy
|
| Stuffing an existing model into a .deb is of course fairly
| easy.
| codetrotter wrote:
| Speaking of retraining, I think it's also potentially a bit
| hairy with regards to reproducible builds don't you think?
| My impression is that machine learning models are often
| initialized with random values before training begins. And
| some of them may use additional random data while training
| as well, I think.
| DiogenesKynikos wrote:
| You can fix the seed for the pseudorandom number
| generator.
| karussell wrote:
| I can confirm this. The setup was also relative easy for me.
| butz wrote:
| How to get some decent sounding English voices into Firefox, to
| use in reader view and with Speech API? By default on Fedora 34
| Firefox offers hundreds variations (?) of the same sounding
| robotic voice.
| ajot wrote:
| I had the same question some months ago, as it helps me to
| focus and read long articles. You have to do some fiddling but
| it's doable.
|
| https://askubuntu.com/questions/953509/how-can-i-change-the-...
| pizza wrote:
| Cool article, I do like the extensibility provided via the unix
| philosophy. Another thing you can do nowadays is use off-the-
| shelf deep neural networks to do tts eg
|
| https://github.com/NVIDIA/tacotron2
|
| https://github.com/mozilla/TTS
|
| https://github.com/CorentinJ/Real-Time-Voice-Cloning
|
| https://github.com/coqui-ai/TTS
|
| They're not all easy to setup however
| ducktective wrote:
| Why so many ML-based projects don't release a binary or
| package?
| echelon wrote:
| An overwhelming number of reasons.
|
| Because you need data, trained models, etc.
|
| Because data scientists aren't typically product people or
| software engineers with UX in mind.
|
| Because ML packages are brittle and tied to specific hardware
| configurations.
|
| Because the ML world is evolving rapidly. It's quick, dirty,
| and messy.
|
| View these as stepping stones for research and product
| development.
|
| (I created https://vo.codes using a lot of these, fwiw, in an
| attempt to make it easy.)
| Blikkentrekker wrote:
| Because they wouldn't know the specific a.b.i. of your system
| which on many system changes in a rolling way as well.
|
| They would have to release a great many different ones or
| alternatively bundle all libraries with it.
| moron4hire wrote:
| That's why all major software releases exclusive as source
| code /s
| Blikkentrekker wrote:
| They have the time and money to research this issue on
| every system, and they typically do bundle libraries.
|
| I once saw a comparison with _LibreOffice_ that showed
| that the the package _Debian_ itself provided was 20% of
| the size of the package _LibreOffice_ provided targeting
| _Debian_ , -- which would not receive the same benefits
| of security bugfixes to libraries, but of course also not
| the same problems that often arise on _Debian_ when they
| arrogantly patch libraries they barely understand and
| create their own unique security problems.
| amelius wrote:
| This article would be better if they provided some speech
| examples. Demonstrate to the reader what they will get before
| they go through all the trouble of installing the software.
| 57844743385 wrote:
| I did a lot of work researching all the available text to speech
| systems a couple of years ago.
|
| The cloud based systems from Google, Microsoft, Amazon and IBM
| are much better than anything else, and within them, the neural
| network based systems, which appear to be a sort of different
| product category, are far and away the best of all. The neural
| voices are approaching natural voice intonation and have an
| almost believable ability to read text.
|
| The ones that sounded most natural were IBM Watson and Googles
| neural voices.
|
| Amazon Polly appeared to be the furthest behind of all the cloud
| systems.... a really average sounding product.
|
| Of the local TTS systems, the one built into MacOS sounds the
| best... but they were all very average at best. All the linux
| ones frankly sounded like garbage relative to the state of the
| art.
|
| Things might have advanced with the cloud systems over the past
| couple of years but I didn't get the impression the cloud
| companies were putting much effort into research and development.
| a9h74j wrote:
| I had reason to sample the IBM performance recently. It is
| imressive. Do you know if NN based systems have been trained
| on, say, audio books for which text is also available?
| bluebirdfirewin wrote:
| I searched for a TTS service recently and found wellsaidlabs.
| It's a saas product but the quality is astonishing. It's also
| fast to render the audio, approximately 2 times the length of
| the audio file. Here is an article of the mit technology review
| magasine about it
| https://www.technologyreview.com/2021/07/09/1028140/ai-voice...
| smcameron wrote:
| There's also pico2wave (libttspico), which, to me, with the
| "-l=en-GB" flag, sounds the best _by far_ of any offline TTS that
| I 've tried.
|
| You can hear it in this video:
| https://www.youtube.com/watch?v=tfcme7maygw&t=131s
| giuseppeciuni wrote:
| I agree, I use Pico2wave too after testing other TTS and
| pico2wave has the best voice in offline systems. I use it
| combined with home assistant: whenever a window trigger is
| fired, pico2wave generate a wav file and it is read by aplay
| command and transmitted to an 90' stereo HI-FI. The result is:
| the window x it is opening because x.
|
| The italian voice sounds great
| [deleted]
| kongin wrote:
| I find that people who have never used text to speech think that
| the closer it is to real speech the better it is.
|
| Which is simply not the case.
|
| Artificial speech is to human speech what typography is to
| handwriting.
|
| For example espeak is by far my number one choice for reading
| anything, because the voice models it uses can be sped up to 1k
| wpm and still be understandable. This is basically a superpower
| when skimming boring documentation of any type. Throw in basic
| tesseract OCR and in a 45 minute sitting I can go through 30k
| words of any document that can be displayed on a computer screen.
|
| It's not that I'm stuck with a terrible robotic voice, it's that
| I don't want anything "better" in the same way that I don't see
| much value going past the command line for most tools when you
| can use ncurses.
| simion314 wrote:
| Same here, so for accessibility reasons current voices are good
| enough, but I admit at the beginning I lost a lot of time
| trying to find good voices, until I trained myself with faster
| speeds.
|
| So probably most people here researched this topic not for
| accessibility reasons but for "commercial" stuff like creating
| some kind of service where chat bots could speak to you or
| transcribe articles for some regular people(without eye
| problems) to listen to them.
| rockemsockem wrote:
| Would this be true for things like audiobooks and particularly
| interesting long-form articles on the web that you actually
| want to listen to as opposed to "getting through"?
| toast0 wrote:
| It depends on the use case. For a lot of people who don't use
| TTS directly, we're exposed to it through public announcement
| systems on transit/airports and phone systems, or sometimes
| voice verification codes.
|
| More natural speech patterns would be useful in those venues.
| fiddlerwoaroof wrote:
| I wonder if people will hit an uncanny valley here: my
| experience with video animation was that sometime around "The
| Polar Express" animated movie makers realized that audiences
| didn't really want more and more realistic animation.
| toast0 wrote:
| I think there's certainly a good enough point, yeah. But
| last time I was on a CalTrain platform, their TTS
| announcement still couldn't pronounce CalTrain, unless it's
| expected to be pronounced call train.
| okamiueru wrote:
| Having played around with flite a decade ago, and at that time
| feeling that it was already then nowhere close to the fidelity of
| other speech synthesis examples. I find it surprising that there
| still isn't anything better than festival/flite? It sounded then
| like a clunky robot, and still does today. Surely some of the
| many research projects have released their work as open source?
|
| Work like, say https://arxiv.org/abs/1806.04558 [paper]
|
| https://github.com/CorentinJ/Real-Time-Voice-Cloning [repo]
| danuker wrote:
| The HTS voice from NIT recommended in the article
| (voice_cmu_us_slt_arctic_hts) actually sounds much better than
| the clunky robot from a decade ago. Hear it here:
|
| https://youtu.be/MmcLFJQpv2o?t=85
|
| Edit: or on the online demo; select "HMM-based method (HTS
| 2011) - Combilex" > "SLT (English American female)".
|
| https://www.cstr.ed.ac.uk/projects/festival/onlinedemo.html
| okamiueru wrote:
| It does indeed sound much better yes. But, that voice was
| already there a decade ago. It's not... hm. Let me just say
| that I don't wish to disparage the work done on those
| projects, as I do think it is great. Maybe it better
| illustrates my point by taking a listen to this video which
| showcases the project I mentioned, as machine learning
| techniques have progress immensely the last decade:
| https://www.youtube.com/watch?v=-O_hYhToKoA
|
| There are of course great benefits to something simple to
| use. I remember cross-compiling flite to run on a custom
| android/windows/linux project to generate voice lines
| intended for a in-game robot companion (nothing came of it
| though) based on SDL. It probably would not be nearly as
| feasible to do the same for some dependency-heavy machine
| learning library.
|
| Now, I haven't done any research to find better examples of
| projects. I was just surprised how identical the article
| describes the options, to what was available 12 years ago.
| usui wrote:
| Yes, I think that consumer-level state-of-the-art speech
| synthesis is still pretty far from acceptable. Amazon Polly
| doesn't sounds too great and presumably that should have more
| than enough big data to leverage and cloud computign to work
| with.
|
| https://aws.amazon.com/polly/
| https://www.youtube.com/watch?v=00D0YZ9GQX4
|
| Either we're just not there yet technologically (hard to
| believe), or there isn't a will to make good speech synthesis
| available to commoners.
| inside_out_life wrote:
| Some of the Amazon's voices sound amazing to me, I've
| actually tested few on them and people couldn't tell they're
| synthetic. Watson's voices are nice too. (AFAIK Amazon bought
| Polish company IVONA for their Polly TTS system, which was
| long regarded to be one of the best)
| ObscureScience wrote:
| How do you think Readspeakers voices compare?
| https://www.readspeaker.com
| danuker wrote:
| I find these voices much more intelligible than say, Stephen
| Hawking's TTS.
| ClawsOnPaws wrote:
| I thought I'd just add another fun fact/data point here. This
| is obviously my personal opinion. I have to use TTS to use my
| computer with a screen reader, and for that, I mostly prefer
| more synthetic speech. When I read long form text like books,
| articles, etc. I do prefer more natural voices, but for doing
| actual work like reading code or simply using user interfaces,
| I like the predictability of more synthetic/algorithmic speech.
| Apple added the neural Siri voices to the new VoiceOver. They
| sound incredible but the quality of the voice also brings
| latency with it. Something like ESpeak is much, much more
| performant and predictable, and it speeds up much better. I use
| my TTS at a very fast rate and I find that the more natural a
| voice, the harder it is to understand at that speech rate.
| Neural voices speak the same phrase of text differently every
| time it's uttered. Slightly different intonation, slightly
| different speech rhythm. This makes it hard to listen out for
| patterns. So for me there's definitely still a place for
| synthetic speech.
| mwcampbell wrote:
| In fact (as I'm sure you know), one of the most beloved
| speech synthesizers among English-speaking blind users is a
| closed-source product called ETI-Eloquence that has been
| basically dead for nearly 20 years. (It was ported to Android
| several years ago, but that port was discontinued because
| they couldn't update it for 64-bit.) No recent speech
| synthesizer has quite matched its consistent intelligibility,
| particularly at high speeds. espeak-ng comes close, but it
| has a bad reputation (mostly, I think, leftover from earlier
| versions of espeak that really weren't very good).
|
| Edit: Sample of ETI-Eloquence at my preferred speed:
| https://mwcampbell.us/audio/eloquence-sample-2021-09-25.mp3
| (yes, it mispronounces "espeak")
|
| Edit 2: To elaborate on what I mean by "mostly dead": In 2009
| I was tasked with adding support for ETI-Eloquence to a
| Windows screen reader I developed. At that time, Nuance was
| still selling Eloquence to companies like the one I worked
| for back then. When I got the SDK, the timestamps on the
| files, particularly the main DLLs, were from 2002. As far as
| I know, an updated SDK for Windows was never released. I'm
| thankful for Windows's legendary emphasis on backward
| compatibility, particularly compared to Apple platforms and
| even Android.
|
| Finally, a sample of espeak-ng (in the NVDA screen reader) at
| my preferred speed: https://mwcampbell.us/audio/espeak-ng-
| sample-2021-09-25.mp3 I use the default British pronunciation
| even though I'm American, because the American pronunciation
| is noticeably off.
| ClawsOnPaws wrote:
| > In fact (as I'm sure you know), one of the most beloved
| speech synthesizers among English-speaking blind users is a
| closed-source product called ETI-Eloquence that has been
| basically dead for nearly 20 years.
|
| This is exactly the speech synthesizer I use daily. I've
| gotten so used to it over the years that switching away
| from it is painful. On Apple platforms, though, using it is
| not an option. So I use Karen. Used to use Alex, but Karen
| appears to be slightly more responsive and tries to do less
| human stuff when reading. Responsiveness is a very
| important factor, actually. Probably more so than people
| might realize. Eloquence and ESpeak react pretty much
| instantly whereas other voices might take 100 MS or so.
| This is a very big deal for me. Just like how one would
| like instant visual feedback on their screen, it's the same
| for me with speech. The less latency, the better. My
| problem with ESpeak is that it sounds very rough and
| metallic whereas Eloquence has a much warmer sound to it. I
| pitch mine down slightly to get an even warmer sound. Being
| pleasant on the ears is super important if you listen to
| the thing many, many hours a day.
| mwcampbell wrote:
| I agree with you that Eloquence sounds warmer than
| eSpeak. I wish there was an open-source speech
| synthesizer comparable to Eloquence or even DECtalk. That
| approach to speech synthesis is old enough now that I'm
| sure there are published algorithms whose patents have
| expired. The problem, of course, would be funding the
| work on a good open-source implementation.
| app4soft wrote:
| What about _RHVoice_?[0,1]
|
| [0] https://github.com/RHVoice/RHVoice
|
| [1] https://rhvoice.org/en-voices/
|
| [2] https://f-droid.org/en/packages/com.github.olga_yakov
| leva.rh...
| machawinka wrote:
| That is a bit too fast for me. I am surprised your brain
| can process at such high speed. I guess it is a matter of
| practice.
| chrismorgan wrote:
| I've heard exactly this from a couple of blind people I've
| interacted with too.
| machawinka wrote:
| Great insight, it is monotonous but it really helps to be in
| the flow for a few hours of productive work.
| Blikkentrekker wrote:
| I don't understand this approach when audio deepfakes exist
| that can quite realistically make Ayn Rand read arbitrary
| texts[1]. -- Is it simply a matter of processing power?
|
| [1]:
| https://www.youtube.com/watch?v=hDVuh4A-q3Q&ab_channel=Vocal...
| miki123211 wrote:
| The current state of open source )or even freeware) speech
| synthesis is pretty sad, to be honest.
|
| You have eSpeak, which is GPL V3, so including it in your own
| software is a problem. RH Voice can be compiled without GPL code,
| but its language support is pretty limited. There's also SAM,
| which is incredibly easy to port and incredibly light on
| resources, but its licensing status is unknown, it's English only
| and it just sounds bad, even to somebody used to robotic synths.
|
| If you're developing for a popular platform, it probably has
| something built-in, but if you're developing for embedded, you
| need to pay thousands of dollars to Cerrence (formerly Nuance) to
| even get started.
| simion314 wrote:
| If you make a product then why not run the GPL code as a
| background application and you send it commands on what to
| speak? It would be fair to contribute back any improvements if
| you would able to add to the GPL stuff.
| sigg3 wrote:
| > You have eSpeak, which is GPL V3, so including it in your own
| software is a problem.
|
| Can't you just include it like a separate module and provide
| any improvements to it specifically upstream?
| miki123211 wrote:
| If you're doing this on Linux, you get IPC-related (or worse,
| process-creation-related) latency. This is not a problem for
| i.e. occasional weather announcements, but is a big issue
| when . creating things for the blind. If the speech
| synthesizer speaks each time you press a key, and you
| generally need to know what it said to. decide if you want to
| scroll further or open the focused item, every bit of latency
| matters. That's one of the reasons why blind people prefer
| robotic-sounding speech synthesizers; they're usually less
| CPU-intensive, which increases responsiveness.
|
| If the device you're developing for uses some proprietary
| firmware, a custom module might not even be an option.
| [deleted]
| mwcampbell wrote:
| Do you happen to know if ETI-Eloquence is now owned by Cerence,
| or if Microsoft got it with the Nuance acquisition? I'm afraid
| I was a little too eager to suggest that Microsoft open-source
| Eloquence when the news of the Nuance acquisition first came
| out.
| miki123211 wrote:
| Vocalizer is definitely Cerence, that I'm sure of. All the
| automotive stuff, which Vocalizer is a part of, was spun off
| before the acquisition.
|
| As an aside, because of Vocalizer's use in automotive, it
| will probably be the only high-ish quality speech engine that
| won't become fully cloud-based. VFO's claims about the
| continued use of Vocalizer in JAWS seem to confirm that.
|
| Regarding Eloquence itself, its status is not really known. I
| would be extremely surprised if it was owned by Microsoft,
| though. There's a hypothesis that nobody really knows who
| actually owns it, there were multiple companies that assisted
| in its development, including IBM. The product was so
| unimportant to Nuance these days that they might not even
| have considered it when doing the spinoff, leaving its
| ownership uncertain. If this hypothesis is untrue, though,
| I'd strongly suspect that Cerence is the owner, not
| Microsoft.
| dosman33 wrote:
| I did a bunch of work with TTS about 15 years ago for a project
| and had landed on Festival and the Nitech voices as the best free
| option at that time. It's interesting that this seems to still be
| the best free non-cloud option available.
|
| What a lot of people don't realize is that Festival is intended
| for creating new TTS voices based on your own voice. The fact
| that it generates TTS is an artifact of it's main function. I've
| never messed with that functionality myself but I always wonder
| if someone could train a synthetic voice to sound better with a
| larger sample set. The Nitech voices are definitely better so
| it's certainly possible to encourage Festival do a better job.
| jmiskovic wrote:
| A nice enhancement for the system is having TTS read out the
| currently selected text, triggered by a key shortcut.
|
| I tried festival and it too complicated and my version was too to
| run the better voices model.
|
| Instead I've used this repo to use upgraded flite:
| https://github.com/kastnerkyle/hmm_tts_build/
|
| I have mapped keyboard shortcuts Win+1 for normal speed, Win+2
| for faster and Win+3 for really fast reading speed. I can use it
| while reading, to enhance my focus. Neat.
| synesthesiam wrote:
| I created Larynx (https://github.com/rhasspy/larynx) to address
| shortcomings I saw in Linux speech synthesis:
|
| * Licensing (MIT)
|
| * Quality (judge for yourself: https://rhasspy.github.io/larynx/)
|
| * Speed (faster than real-time on amd64/aarch64)
|
| * Voices/language support (9 languages, 50 voices)
|
| I'm working now on integrating Larynx with speech-
| dispatcher/Orca. The next version of Larynx will also support a
| subset of SSML :)
| tailspin2019 wrote:
| Some of those sound really good.
|
| I was going to comment that you didn't have any en_gb listed,
| but it seems there's a bunch under en_us :)
|
| Some rather good brit'ish accents in there me old mate!
| synesthesiam wrote:
| Thanks! They seemed to work fine with en_us phonemes, so I
| haven't created a separate en_gb set yet. Maybe someday :)
| FeepingCreature wrote:
| Sweet, I've been hoping for good Linux TTS!
| phkahler wrote:
| Can it run on a raspberry pi?
| synesthesiam wrote:
| Yes! There's a Docker image and Debian package for both
| 32-bit and 64-bit ARM. The 64-bit version is significantly
| faster (especially with low quality set).
| dm319 wrote:
| Thought I'd give this a go, but getting lots errors along the
| lines of 'Expected shape from model of {...} does not match
| actual shape of {...} for output audio. Tried the debian and
| python methods of installation on an AMD Ryzen X13.
|
| EDIT: despite those errors I can create output.wav. However,
| interactive mode crashes with "No such file or directory:
| 'play'".
| synesthesiam wrote:
| The shape warnings don't seem to matter (something to do with
| the onnx runtime). Interactive mode needs sox installed or
| for you to specify a --play-command
| hjek wrote:
| Anyone knows how to get that voice to work in Reader Mode in
| Firefox on Debian?
___________________________________________________________________
(page generated 2021-09-25 23:01 UTC)