[HN Gopher] Google open-sources the Lyra audio codec
___________________________________________________________________
Google open-sources the Lyra audio codec
Author : chmaynard
Score : 137 points
Date : 2021-04-06 16:20 UTC (6 hours ago)
(HTM) web link (opensource.googleblog.com)
(TXT) w3m dump (opensource.googleblog.com)
| Thaxll wrote:
| Can't wait to try that in ffmpeg!
| unixhero wrote:
| I used to be fond of Google products.
| chintan wrote:
| Now what could go wrong? 4 dots becomes 3 dots...[1]
|
| 1). Silicon Valley - Finale S6E7
| https://www.youtube.com/watch?v=48Y77jSSHGU
| ncmncm wrote:
| A more useful system would take Opus-compressed data as input and
| feature-extract that, presumably faster than this thing. Bonus
| for not requiring a proprietary library like
| libsparse_inference.so.
|
| Also, instead of encoding independent 40ms segments, it should be
| much better to encode 10ms segments given the previous 30ms.
| ProAm wrote:
| Isn't Lyra the name of Facebook's cryptocurrency as well? I
| cannot remember if that project was shelved.
| mgraczyk wrote:
| Are you thinking of Libra, now called Diem?
|
| https://en.wikipedia.org/wiki/Diem_(digital_currency)
| ProAm wrote:
| Yes I was. Thanks, I had them mixed up.
| rektide wrote:
| There's huge wins but the grandiosity of "enabling voice calls"
| is grating. I don't think this will open many users to voice
| communication. It will reduce data-costs in a way that has an
| impact on a significant amount of people's bottom line. But I
| feel manipulated with the current headline, and by the long
| extended lack of ability to mix the very real hope with some
| measure of humility.
| [deleted]
| p1mrx wrote:
| This seems kind of unnecessary, compared to Opus at ~10 kbps. If
| you're sending IPv6+UDP in 40 ms chunks, that's 9.6 kbps just
| from the packet headers (25 Hz * 40+8 bytes).
|
| When the voice payload is smaller than the packet headers, you're
| well into diminishing returns territory.
| posguy wrote:
| Opus at 8Kbps sounds better, and commodity, inexpensive
| hardware like the Grandstream HT802 Analog Telephone Adapter
| supports this codec today (along with any cheap Android phone).
|
| Lyra as it stands today will not support anything outside of
| x86-64 and ARM64 without rewriting the proprietary kernel it
| relies on.
| pjc50 wrote:
| One thing I'm slightly worried about "machine learning" in
| compression rather than conventional everything-is-sines
| mathematical approaches is the possibility of odd nonlinear
| errors. Remember the photocopier that worked by OCR and would
| occasionally mis-transcribe numbers?
|
| I don't mind compressing a phoneme to <unintelligible> as much as
| I would mind it compressing it to a clearly audible _different_
| phoneme.
| 1-6 wrote:
| One day, voice cloning may become so powerful that only word
| data and intonations will become part of the datastream. There
| could be various 'layers' in which encodes/decodes can occur.
| Voice Cloning would be at the very top of the stack.
| minikites wrote:
| >Remember the photocopier that worked by OCR and would
| occasionally mis-transcribe numbers?
|
| For those who don't remember:
| http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_...
| smnrchrds wrote:
| This one too:
|
| https://petapixel.com/2020/08/17/gigapixel-ai-accidentally-a...
| tyingq wrote:
| >photocopier that worked by OCR
|
| The interesting bit was that it wasn't supposed to work by
| OCR...that had been deliberately turned off. The compression
| was too clever.
| cbhl wrote:
| [disclaimer: Personal opinion, not that of my employer.]
|
| I had a coworker play me before/after of an early version of
| the codec "babbling" and it was definitely uncanny valley. It
| looks like some work has been done on the problem since then.
|
| The second paper linked in the README.md of the repo talks
| about talks about a few strategies to reduce 'babbling' or
| 'babble'. For your reference, here's the citation and the link
| to the PDF.
|
| Denton, T., Luebs, A., Lim, F. S., Storus, A., Yeh, H., Kleijn,
| W. B., & Skoglund, J. (2021). Handling Background Noise in
| Neural Speech Generation. arXiv preprint arXiv:2102.11906.
|
| https://arxiv.org/pdf/2102.11906.pdf
| rexreed wrote:
| The OCR issue was the first thing I thought about. Machine
| learning is probabilistic, not deterministic, so in the case of
| S being converted to 5 (or 6 to 8, etc.), which definitely
| impacts numerical data in the case of the OCR stuff, we can
| expect similar voice mis-classifications. Perhaps "You're fine"
| might get misclassified as "you're fired".
| thaumasiotes wrote:
| > Remember the photocopier that worked by OCR and would
| occasionally mis-transcribe numbers?
|
| That was perfectly ordinary compression?
|
| The phenomenon is all over the place, most visible in
| autocorrect.
| bayindirh wrote:
| It was ordinary compression, something called JBIG2. It did
| not mistranscribe, but mark slightly different number or
| character blocks as same, resulting replaced parts in images.
|
| In other words, its match tolerance is a bit too lax, so it
| get poisoned by blocks in its own dictionary, thinking it
| already has the blocks for things it had just scanned.
|
| More details can be found in [0] and [1].
|
| [0]: https://www.theregister.com/2013/08/06/xerox_copier_flaw
| _mea...
|
| [1]: http://www.dkriesel.com/en/blog/2013/0802_xerox-
| workcentres_...?
| Wowfunhappy wrote:
| Yes! This is why I always turn off autocorrect! It's true
| that I absolutely make more typos without it, but at least
| they're obvious as typos, and not different words that
| potentially change the meaning of the sentence.
| beagle3 wrote:
| Are you aware that the same exact uncompressed recording sounds
| different depending on context? This is known as the McGurk
| effect.
|
| Very worth your two minutes if you're not yet familiar with the
| effect: https://www.youtube.com/watch?v=2k8fHR9jKVM
| dqv wrote:
| This already happens with existing compression algorithms.
| Certain vowel sounds get collapsed, so someone will say, for
| example, "66" and it will come out on the other side as "6".
| Very annoying because you can't exactly coach a layperson on
| how to talk "the right way" to not trigger this vowel collapse.
| bobthechef wrote:
| > you can't exactly coach a layperson on how to talk "the
| right way" to not trigger this vowel collapse
|
| I've never noticed. At any rate, we should not coach people
| to adapt to technology in this way. It is Procrustean and
| anti-human and unnecessarily places a burden on people that
| belongs to the software and the developer.
| nyanpasu64 wrote:
| For what it's worth, amateur radio operators already have
| specialized rules and techniques for speech, to improve
| clarity over a muffled noisy analog radio channel.
| tyingq wrote:
| > how to talk "the right way"
|
| Not suggesting it as a fix, but this did remind me of the
| military phonetic alphabet, which includes numbers too.
|
| 3 is "tree", 4 is "fow er", 5 is "fife", 9 is "niner". The
| rest of the numbers are mostly as-is, but you'll hear very
| deliberate enunciation, like "Zee Row" for 0.
| WaitWaitWha wrote:
| whiskey hotel yankee delta oscar india hotel alpha victor
| echo tango oscar sierra papa echo alpha kilo tango hotel
| echo lima alpha november golf uniform alpha golf echo oscar
| foxtrot tango hotel echo mike alpha charlie hotel india
| november echo ? tango hotel alpha tango india sierra india
| november sierra alpha november echo!
| tyingq wrote:
| | perl -pe 's/(\w)\w+/\1/g'
| toast0 wrote:
| Humans adapt a whole hell of a lot easier than machines.
|
| Sure, it would be nice to have clean high bandwidth, low
| latency voice channels to everywhere so you could drop
| pins and expect the other side to hear it. Unfortunately,
| high bandwidth never really happened, and some places
| never ran land lines to everyone's home, and nobody wants
| to pay the high price of circuit switched voice when
| packet switched voice mostly works good enough and is
| enormously cheaper.
| posguy wrote:
| But is Lyra a significant improvement over modern Opus at
| 8Kbps? You can buy a Grandstream HT802 for ~$30 and its
| DSP can decode Opus today, whereas Lyra will require
| orders of magnitude more power to decode while providing
| much worse reproduction accuracy.
| est31 wrote:
| Back when Lyra was announced [0], I listened to the released
| samples and it changed an "m" sound to an "l" sound.
|
| [0]: https://news.ycombinator.com/item?id=26309553
| jagger27 wrote:
| I find that Lyra sounds good at first but it can chop off hard
| consonants in certain scenarios. It sort of sounds like slightly
| slurred speech. Anyone else getting that impression from their
| samples?
| ent101 wrote:
| Discontinued in 3... 2... 1...
| devops000 wrote:
| Give them at least 3-4 years ;)
| squarefoot wrote:
| Doesn't seem that better compared to Codec2 which is already
| fully Open Source (LGPL), even taking into account that Codec2's
| examples originals are already of much worse quality than the
| ones on Lyra's website. I'd be curious to hear both working on
| the same set of audio samples.
|
| https://www.rowetel.com/?page_id=452
| Seirdy wrote:
| Agreed; codec2 doesn't alter speech as aggressively, require
| proprietary components, or have as strong a connection to
| Google.
| devops000 wrote:
| 404: NOT FOUND
|
| https://basis-universal-webgl.vercel.app/texture/
|
| Where else I can see a demo?
| dang wrote:
| Recent past threads on this:
|
| _Lyra audio codec enables high-quality voice calls at 3 kbps
| bitrate_ - https://news.ycombinator.com/item?id=26300229 - March
| 2021 (198 comments)
|
| _Lyra: A New Very Low-Bitrate Codec for Speech Compression_ -
| https://news.ycombinator.com/item?id=26279891 - Feb 2021 (25
| comments)
|
| Is there significant new information here?
| https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so...
|
| Edit: it seems the SNI is the open-sourcing. I've changed the
| title to say that now. Corporate press releases are generally an
| exception to HN's rules about titles and original sources:
| https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor....
| miohtama wrote:
| If I remember correctly the original landline audio was 64kpbs,
| 8000 Hz. So Lyra is 1/20 of this. And probably still sounds
| better.
| posguy wrote:
| PCMU/PCMA (G.711m and G.711a) are not original landline
| quality audio, but rather what Bell Systems felt they could
| get away with passing off as a toll quality call in 1972.
|
| Lyra will likely sound better, but the reproduction accuracy
| is apt to be quite a bit poorer as many others have
| commented. G.711 was created to require nearly no processing
| (its nearly raw PCM data from a sound card after all) while
| operating at reasonable bitrates, Lyra looks much more
| computationally intensive and will likely only run on
| smartphones in the next few years.
|
| Edit: Is Lyra a significant improvement over modern Opus at
| 8Kbps? You can buy a Grandstream HT802 analog telephone
| adapter for ~$30 and its DSP can decode Opus today, whereas
| Lyra will require orders of magnitude more power to decode
| while providing much worse reproduction accuracy.
| YarickR2 wrote:
| Original landline audio was/is analog
| [deleted]
| colanderman wrote:
| For reference, analog landlines specified 24 dB SNR [1] and
| 300-3300 Hz passband [2], giving ~24 kbps information rate.
|
| [1] https://www.tschmidt.com/writings/POTS_Modem_Impairment
| s.htm
|
| [2]
| https://en.wikipedia.org/wiki/Plain_old_telephone_service
|
| [3] https://en.wikipedia.org/wiki/Shannon%E2%80%93Hartley_t
| heore...
| sandGorgon wrote:
| is the training code opensource ?
| levosmetalo wrote:
| I hope this never takes off.
|
| This whole machine learning, optimization etc, story, but the end
| goal is that Google can easily transcribe your voice calls and
| store it as text. Then it can apply all shady practices that it
| previously was too expensive to do because storing voice and
| extracting information from it required huge storage costs and
| actual human labour.
|
| Or worst, just imagine what some government you don't trust could
| do with all those voice call transcripts.
| rektide wrote:
| This will make voices radically more correlatable, most likely.
| It's a more effective model for voice, it has run endless
| regressions & found better patterns to model human sounds upon.
| That could well make processing & comparing pieces of speech
| data less computationally expensive.
|
| I don't see much relation to surveillance & transcription
| issues. This technology does not, would not change the field of
| battle significantly, if such a battle were about. Which it
| probably is, in some countries, perhaps even applying to
| Google-touched, -relayed, or Google-held data.
| sreekotay wrote:
| I mean...this is them open sourcing it?
| mgraczyk wrote:
| This codec has nothing to do with what you're worried about.
| There's no current technical limitation preventing what you're
| describing. Google doesn't do it because it makes no sense for
| their business and because your phone calls aren't routed
| through Google's servers. Governments outside the US are
| already doing it.
| cbdumas wrote:
| Anyone listening to the sample audio linked to in the article
| should read this note from the last time this was discussed on
| HN: https://news.ycombinator.com/item?id=26309787
|
| Summary: the Lyra audio samples are louder which muddies the
| comparison
| pessimizer wrote:
| I've been waiting for an audio codec that could actually silently
| change the words I've said.
|
| https://www.zdnet.com/article/xerox-scanners-alter-numbers-i...
| plzbo wrote:
| Link to the website of the person who found the error in the
| first place: http://www.dkriesel.com/en/blog/2013/0802_xerox-
| workcentres_...
| sreekotay wrote:
| In practical terms, very impressive. Anyone know what latency is
| like? Feels a domain where people who have not experienced low
| latency full duplex cannot fully appreciate why voice has faded
| in everyday life...
| Ajedi32 wrote:
| Sounds like at least +40ms of latency:
|
| > features, are extracted in chunks of 40ms, then compressed
| and sent over the network
| markstos wrote:
| Notable that the two post authors sign it with " - Chrome",
| indicating I presume they are Chrome team members.
| londons_explore wrote:
| Google misses the mark here...
|
| Bad internet connectivity in the developing world isn't "only
| 56kbps" as some people think.
|
| It's "random bursts of fast with random 30 second gaps of no
| connectivity at all". It's routed through 3 layers of proxies and
| firewalls which block random stuff and not others, while
| disconnecting long running connections.
|
| Oh, and it'll be expensive per MB.
|
| To that end, Lyra helps with the expense of a data connection,
| but is unusable for long voice calls. What would help more is a
| text chat system like WhatsApp.
|
| Oh right - WhatsApp is already wildly popular in most of the
| developing world for mostly this reason.
| herodoturtle wrote:
| Heya, please could you unpack your reasoning a little bit more?
|
| You said:
|
| > WhatsApp is already wildly popular in most of the developing
| world for mostly this reason.
|
| I can't speak for the majority of the developing world, but
| here in South Africa, WhatsApp is indeed the predominant
| communications app.
|
| That being said, WhatsApp voice calls are also used here quite
| a bit.
|
| So with that in mind, and reading from the article:
|
| > Lyra compresses raw audio down to 3kbps for quality that
| compares favourably to other codecs
|
| To me 3kbps sounds pretty great, and might actually work out
| cheaper / better than one might imagine.
|
| So I'm just wondering, how does WhatsApp voice call data usage
| compare to Lyra?
|
| Also whilst South Africa is indeed a developing country (where,
| among other things, the price of data is proportionately high
| relative to average household income), the cellular network
| infrastructure is excellent.
|
| So I don't think the random bursts of connectivity you describe
| are as big of an issue here, whereas the price of data most
| certainly is.
|
| In which case, I can definitely see a market for Lyra (assuming
| the 3kbps is indeed vastly superior to WhatsApp's data usage
| for a voice call).
|
| Hope that makes sense but I'd be happy to extrapolate a little
| further :-)
| villasv wrote:
| > Oh right - WhatsApp is already wildly popular in most of the
| developing world for mostly this reason.
|
| Not only that, but carriers will often advertise plans with
| "unlimited Internet for Facebook and WhatsApp" (a punch in the
| face of net neutrality).
|
| So not only WhatsApp has more impact with audio messages when
| audio calls are too unstable, audio calls already substitute
| the bulk of phone calls even for people who have shitty data
| plans.
|
| This is what my carrier says on their most basic offering:
|
| > What does WhatsApp Unlimited mean?
|
| > The benefit is granted automatically, without the need for
| activation. And the use of the app is unlimited to send
| messages, audios, photos, videos, in addition to making voice
| calls. Only video calls that are discounted from the internet
| package, as well as access to external links.
| setr wrote:
| In the middle east I noticed a baffling-to-me usage of
| whatsapp: people were simply exchanging voice messages back and
| forth instead of calling. [0]
|
| Presumably for exactly the reason you've stated.
|
| [0] I later tried it myself with a friend, but you end up
| losing the benefits of both worlds -- you can't search or
| review old messages effectively (as you would text), and its
| significantly slower than calling.
| throwaway81523 wrote:
| Another reason for end-to-end speech encryption: to keep your
| cleartext voice signal away from these overaggressive codecs
| changing the words. I can understand the need for a super low
| bandwidth codec on top of Mt. Everest, but 64 kbit PCM was good
| enough for our grandparents' landlines (or 13 kbit GSM for their
| mobiles) and it's good enough for us.
| LeoPanthera wrote:
| What a spectacular failure of imagination. Why change anything
| ever, right? I supposed dial-up modems were good enough for you
| too.
|
| Everyone is imagining that codecs like this will "change your
| words" but no-one has provided examples of that _actually
| happening_. I don 't believe it.
| robert_foss wrote:
| Encoding takes >40ms? Opus takes 5-26.5ms. Apparently 150ms[1] is
| the generally accepted upper bound for call latency.
|
| I think the article could do with some
| bandwidth/quality/latency/power comparisons to other codecs.
|
| [1] https://en.wikipedia.org/wiki/Latency_(audio)
| Bedon292 wrote:
| I don't think it is discussing encoding time in the article, it
| says "features are extracted in chunks of 40ms". My reading is
| that its breaking down the speech into 40ms chunks, compressing
| it, and sending that.
| 0b01 wrote:
| But since the buffer size has to be 40ms then so the minimum
| latency is 40ms
| ksec wrote:
| >These speech attributes, also called features, are extracted
| in chunks of 40ms, then compressed and sent over the network.
|
| So while Encoding doesn't take 40ms, the latency + encoding
| will indeed be 40ms+.
|
| 150ms is the End to End Latency, which is basically everything
| from Encoding + Network + Decoding. We cant beat the speed of
| light on our fibre network. We can certainly do something with
| Encoding and Decoding. And Lyra doesn't seems to help with that
| case here. Something I pointed out last time Lyra was on HN.
|
| I think Opus default to 20ms with option of 10ms slot (
| excluding Encoding speed ) at the expense of quality. What we
| really need is higher bitrate, lower latency and higher quality
| codec. Which is sort of the exact opposite of what Lyra is
| offering.
| RL_Quine wrote:
| > _We cant beat the speed of light on our fibre network._
|
| Speed of light in what? We can absolutely be faster than
| fibre optics, which are quite slow relatively speaking
| (2/3rds that of light in a vacuum).
| ksec wrote:
| We wont be replacing Glass Fibre with Vacuum Fibre anytime
| soon. And I have been following this tech for long, but I
| do wish I am very wrong.
| BlueTemplar wrote:
| Starlink ?
| tymekpavel wrote:
| Satellite links are orders of magnitude slower than
| fiber.
| blendergeek wrote:
| > Satellite links are orders of magnitude slower than
| fiber.
|
| Minimum end-to-end latency for communications from
| opposite points of the earth is much lower for Starlink
| style LEO satellites than for fiber.
| ksec wrote:
| Which is only in the case of "opposite points of the
| earth", otherwise you are just adding ~700KM of distance
| between two point. The point is even if we have perfect
| Speed of light Data Transfer over a direct line, we are
| fundamentally limited by it and nothing can be done. But
| Encoding, Decoding, Time Slots and quality are everything
| that we have control of and should be look into more
| seriously.
| BlueTemplar wrote:
| Aren't they still heavily expected to feature in
| connecting that "next billion" ?
| tymekpavel wrote:
| Yes, because they are convenient for other reasons (don't
| require infrastructure over land) which makes them
| suitable for connecting rural areas where it doesn't make
| sense to run fiber. But fiber will always be the fastest
| you can get, and if you get fiber in a vacuum, you could
| theoretically achieve near-speed of light communication.
| Satellites won't get you anywhere close to that, even if
| you use lasers, because there is always atmospheric
| disturbances that introduce latency.
| wmf wrote:
| Internet latency is much higher than it could be, even
| using fiber: https://arxiv.org/abs/1811.10737
|
| And adding an HFT-style microwave backbone could reduce
| Internet latency even more:
| https://arxiv.org/abs/1809.10897
| azinman2 wrote:
| Ya I was just coming here to say the same thing. 40ms _just in
| the codec_ feels like a lot. Because that's not even including
| time to pull in audio from the hardware (could be 20ms or more
| in Android devices), time to upload, and time to have it across
| the Internet, and then time to decode + play on the receiver.
| That adds up pretty quickly. I'm guessing 40ms was chosen
| because it is some sweet spot of having enough data to get a
| worthwhile compression on, but it's one of these things where
| technology, however impressive it might be, is slowly giving us
| a worse experience over time in the pursuit of digitization.
| robert_foss wrote:
| From my understanding the 40ms is just the feature extraction
| part. The encoding also does quantization, which surely adds
| to this number.
| stefan_ wrote:
| The favorite way to cheat compression contests. Buffer more
| data, get more compression.
| te0006 wrote:
| "Please note that there is a closed-source kernel used for math
| operations that is linked via a shared object called
| libsparse_inference.so. We provide the libsparse_inference.so
| library to be linked, but are unable to provide source for it.
| This is the reason that a specific toolchain/compiler is
| required.* - README
| ncmncm wrote:
| [update: proprietary .so]
|
| They should re-implement the needed bits of libsparse_inference
| before releasing this thing. Otherwise it's just a distraction.
|
| Probably they should get it building with something other than
| Bazel, too.
| danaliv wrote:
| It's not a kernel module, it's a compute kernel. Nothing to
| do with operating systems. They provide versions for android-
| arm64 and linux-x86_64.
|
| The fine README says it builds and runs on Ubuntu 20.04.
| posguy wrote:
| Ah, so Lyra today will not work on RISC-V, i386, Power,
| MIPS, lower end or older ARM chips like the Allwinner H3
| (very popular in Single Board Computers) and any other new
| architecture that comes out?
| skybrian wrote:
| Yes, that will have to be removed as part of the effort of
| porting it to new platforms.
| jacobn wrote:
| Sounds like NVIDIA's Maxine [1], but for voice?
|
| 1: https://developer.nvidia.com/maxine
| flohofwoe wrote:
| Why is the demo link towards the bottom of the post pointing to
| the Basis Universal repository (which is a texture compressor)?
|
| https://github.com/BinomialLLC/basis_universal/tree/master/w...
|
| Copy-pasta error, or did they run the post through Lyra? ;)
| astlouis44 wrote:
| This is going to be VERY useful for WebXR social platforms.
___________________________________________________________________
(page generated 2021-04-06 23:00 UTC)