[HN Gopher] Lyra audio codec enables high-quality voice calls at...
___________________________________________________________________
Lyra audio codec enables high-quality voice calls at 3 kbps bitrate
Author : homarp
Score : 109 points
Date : 2021-03-01 06:30 UTC (16 hours ago)
(HTM) web link (www.cnx-software.com)
(TXT) w3m dump (www.cnx-software.com)
| sophiebits wrote:
| Previous discussion:
| https://news.ycombinator.com/item?id=26279891
| skyde wrote:
| this is super impressive
| walrus01 wrote:
| Microsoft also just announced a 6kbps advanced audio codec:
|
| https://techcommunity.microsoft.com/t5/microsoft-teams-blog/...
| trevorishere wrote:
| I'd be curious to know how this compares to Microsoft's Satin
| codec, now used in Teams, which is ML-driven.
| londons_explore wrote:
| This is blogspam that doesn't add meaningfully to the
| original[1].
|
| [1]: https://ai.googleblog.com/2021/02/lyra-new-very-low-
| bitrate-...
| jimktrains2 wrote:
| I wonder how this compares with codec2[0] which is decent at
| around 3kbps and can go even lower.
|
| [0] https://en.m.wikipedia.org/wiki/Codec_2
| motiejus wrote:
| Nothing about licensing or patents. I assume the worst (read:
| unusable for small businesses)?
|
| 10+ years ago I worked in a small voip shop, where we had very
| high quality (low jitter), but low bandwidth connection. I
| researched many codecs of the time (2010-ish).
|
| We liked speex, because it can be used "without strings
| attached". Also, I can choose the quality depending on the
| bandwidth. Although for low bandwidth g729 was better. Which we
| couldn't use because of royalties (but allowed myself to test
| it).
|
| We chose alaw/ulaw when bandwidth was not a concern, and speex
| when it was.
|
| Since it does not mention usability outside of google, I also
| find this comparison unfair or incomplete: if you are comparing a
| proprietary codec, compare it to g729. If you are comparing a
| codec to speex, it should be open/free.
|
| Edit: grammar
| lights0123 wrote:
| I wouldn't be as pessimistic--they invented VP9 and were
| important to AV1, and continue to promote both and use them
| everywhere.
| marcodiego wrote:
| With good enough licensing it could possibly replace speex.
|
| It is sad that we have to think about licensing and patents of
| technologies instead of only how good or advanced they are.
| 1996 wrote:
| > Nothing about licensing or patents. I assume the worst (read:
| unusable for small businesses)?
|
| If there's a free software implementation, and a company
| offering the service based in the EU (or shop around and find
| any other jurisdiction where software patents don't matter),
| it's often YOLO - but call that "legal arbitrage" if you want
| to sound fancy :)
| kreetx wrote:
| What are the current choices for "CD quality" speech
| compression (lossy but indiscernible) at the moment? Just had a
| discussion with a friend of keeping an always-on speech
| recorder on and wondered about disk space consumption.
| Aloha wrote:
| I'm over here preferring g711/ulaw because I prefer the hard
| roll off at 4kc.
| bscphil wrote:
| These days the correct comparison would be to Opus, which is
| similarly unencumbered and performs fantastically at low
| bitrates (and has a speech specific mode for even lower
| bitrates, because it's a hybrid of two codecs). It's also
| extremely low latency, so there's now no reason to accept
| trade-offs. (For the same bitrate as alaw/ulaw, you can get
| high quality full band music with Opus.)
|
| These days it's more or less the standard for realtime voice.
| WebRTC uses it, most of the popular realtime voice applications
| use it as well, as does Signal.
| etaioinshrdlu wrote:
| Important to remember that this type of codec can be used as a
| backup for higher-bandwidth codecs. You don't necessarily need to
| hear it's artifacts all the time. The higher level codec also
| only needs to encode the differences between the prediction and
| groundtruth. The same thinking applies to video especially of
| faces. Neural nets are a huge leap forward for this type of data
| compression and will likely be used pretty much everywhere in the
| future with great success.
| hatsunearu wrote:
| http://www.rowetel.com/wordpress/?page_id=452
|
| Take a look at this too. Also runnable on low power devices. And
| there was some work of using AI to enhance the codec2 encoded
| bits too.
| ddevault wrote:
| Yeah, with Codec 2 setting the gold standard, I don't find this
| very impressive. I find this more intelligible at one-third of
| the Lyra bitrate:
|
| http://www.rowetel.com/downloads/codec2/hts2a_1300.wav
|
| Codec 2 does a better job of isolating the parts of sound which
| are most necessary to intelligible speech, without necessarily
| caring too much about preserving the original qualities of the
| speaker's voice or environment.
|
| Fun fact: Codec 2 can be used to transmit voice over IRC:
|
| https://github.com/asiekierka/voirc
| londons_explore wrote:
| I had to listen to your wav sample 4 times before
| understanding what it was saying... To me, that isn't
| intelligible... Perhaps with practice one could learn to
| understand it, but that isn't really what I want from my
| audio codec.
| vidarh wrote:
| The biggest challenge with evaluating all of these, is that
| once you've listened to a comprehensible version of one of
| these samples, they all sound more intelligible. I had
| problems with the example too. After hearing the original
| it's now easy. It makes it really hard to properly assess
| the intelligibility for developers without decent sized
| panels of people to help evaluating them.
| alvarlagerlof wrote:
| Completely unrelated but damn so they need to update the
| illustration at the top of that page. It's hideous.
| [deleted]
| faebi wrote:
| Maybe in the future, all we need is a speech example, some AI and
| the continious transmission of text for low data voice
| transmission?
| dheera wrote:
| I think this will be the rough direction, but not exactly text,
| rather some other efficient, machine-readable embedding of
| speech that is also able to carry tone and rhythm effectively
| and pronounciation accurately and unambiguously.
| cecja wrote:
| Why speak then?
| wmf wrote:
| Basically yes. "Features, or distinctive speech attributes,
| are extracted from speech every 40ms and are then compressed
| for transmission. The features themselves are log mel
| spectrograms, a list of numbers representing the speech
| energy in different frequency bands, which have traditionally
| been used for their perceptual relevance because they are
| modeled after human auditory response."
| viraptor wrote:
| Only if you're ready to kill all the intonation nuance. If
| you're ok with that, why not stick to just reading text? At
| least we can use emotes in there.
| temp-dude-87844 wrote:
| When Google's announcement [1] was posted a few days ago, I
| listened to their samples and heard an odd effect in the
| "chocolate bread" sample (the video chat example) [1], which is
| not mirrored in this article.
|
| On that sample, I felt [2] that the Lyra version exaggerates the
| pronunciation of the phrase 'with chocolate' in a way that
| meaningfully differs from the speaker's original. It weakens the
| voiced 'th' to nothingness, and overshoots both the lead
| consonant and first vowel of 'choc', and then proceeds to wash
| the entire rest of the sentence with a peculiar brightened voice
| that's high, lacks consonant definition, and is close to ringing.
|
| I'm guessing it's actually style transfer, because though the
| result sounds not much like the speaker's original, the result is
| reminiscent of the speech pattern and accent that people with
| East Asian and Southeast Asian ancestry adopt when speaking
| American English. It was surprising, given that the speaker
| doesn't sound like that in the original. I wonder if others hear
| this too.
|
| While Lyra sounds richer and wider-band than Opus or Speex at
| these bitrates, the degradations and artifacts of those codecs
| are universally recognized (through years of familiarity with
| telephones) as compression artifacts and not innate features of
| the speaker themselves. Therefore listeners can be expected to be
| sympathetic to the quality issues and not attribute the whole of
| the sound on the speaker's person.
|
| If AI-trained voice synthesizer codecs become the norm, and it
| performs well on most speakers, that expectation will go away,
| and the resulting audio will be attributed wholly to the speaker.
| That increases the impact of mistakes and misrepresentations
| introduced by the codec, unbeknowst to the speaker and listener.
|
| [1] https://ai.googleblog.com/2021/02/lyra-new-very-low-
| bitrate-...
|
| [2] https://news.ycombinator.com/item?id=26282519
| BugsJustFindMe wrote:
| > _' with chocolate' in a way that meaningfully differs from
| the speaker's original. It weakens the voiced 'th' to
| nothingness_
|
| I honestly don't hear a 'th' in the original.
|
| > _It was surprising, given that the speaker doesn 't sound
| like that in the original._
|
| I disagree. Note that the speaker says "these bread". The three
| possibilities for those two words--"these bread", "thiiiis
| bread", and "these breads" with a dropped "s"--would all be
| weird things for a native english speaker to say for different
| reasons relating to either wrong pronunciation of "this" or
| "breads" or the fact that bread is its own collective noun and
| therefore we typically require separate qualifiers like "these
| buns" or "these loaves" when separating multiple individual
| "pieces" (another) into a non-collective. We ask for "some
| bread" or "a piece of bread", but we don't say "a bread" or
| "some breads" unless we are discussing categorical types of
| bread ("ciabatta and rye are breads") rather than instances of
| such, and only one type of bread is represented in the video.
|
| The Lyra reproduction has a band-pass filtered quality to it,
| but I find it still remarkably representative of the reference.
| cbdumas wrote:
| I agree completely, I've listened to the reference sample
| probably ten times now and I can only hear /wI/
| scotty79 wrote:
| "When a man looks for something beyond his reach ..."
|
| The word "looks" sounds completely wrong for me with Lyra. To
| the point of completely not understanding what this word is
| supposed to be (first example with your [1] link).
| est31 wrote:
| For me "looks" sounds fine but the word before, "Man", sounds
| like "Lan". So to me the opus sample sounds more
| understandable. Even though the "quality" of Lyra is better,
| that shouldn't be the score to optimize for, but fidelity of
| the compression. It's not helpful if the compression
| algorithm generates a beautiful flower from a flower image
| but it's a red flower instead of a blue one like the
| original. Gives me Xerox vibes...
| ampdepolymerase wrote:
| Are the speech models sufficiently generic across all
| languages?
| cityzen wrote:
| I read your comment before I watched that video and I can't
| stop laughing. It sounds ridiculous!
| [deleted]
| [deleted]
| rectang wrote:
| What I want to know is whether Lyra takes any longer to encode
| than the alternatives.
|
| Because as far as I can tell, nobody cares in the slightest about
| latency.
|
| Phone calls are getting to be like writing postcards to each
| other. Speak in a whole paragraph. Wait several seconds for the
| latency to clear. Then the other party responds with a whole
| paragraph, waits several seconds for the latency to clear...
|
| Improvements to fidelity are nice-to-have, but I would like some
| real-time in my real-time communications, please.
| kixiQu wrote:
| In the article,
|
| > This trick enables Lyra to not only run on cloud servers, but
| also on-device on mid-range phones in real time (with a
| processing latency of 90ms, which is in line with other
| traditional speech codecs).
|
| Does that not cover it?
| regularfry wrote:
| It covers it, but it's not exactly brilliant. 200ms is the
| point at which conversation breaks down. If half that budget
| has gone on the codec, not much has to happen on the wire for
| it to be noticeable.
| jtsiskin wrote:
| The "mid-range phones" has me suspicious, I wish they defined
| that better. And 90ms is much higher than what Opus is
| supposed to achieve
| reaperducer wrote:
| I had the pleasure of using a real landline just before the
| pandemic. Honest wire-to-wire connection between two ranches,
| so no silly VOIP steps between.
|
| It was fantastic.
|
| You don't appreciate how much latency is destroying our ability
| to communicate verbally until you go back to the old way.
|
| One example is arguing. It's no wonder people used to be able
| to argue with one another on a telephone. You could raise your
| voice and still hear the other side and adjust your speech in
| real time. Today it's just one party shouting over the other to
| drown the opponent out.
| rectang wrote:
| Between miserable latency, not-so-great fidelity, and the
| fecklessness of phone companies in the face of the robocall
| epidemic, I have come to hate phone calls.
|
| I'm rooting for something to replace phone communications.
| Any chance that Matrix can do better on any of those fronts?
| Especially on fidelity and latency since they're germane to
| the high-level subject of this discussion.
| Aloha wrote:
| A cell phone tbh, is about the same latency as a landline in
| most end to end call circumstances. Latency only really is
| noticeable when better than 600ms. (And only a real problem
| over 1000)
| ubercow13 wrote:
| Imagine trying to have a face-to-face conversation with
| 600ms of latency...
| lynndotpy wrote:
| I think latency is noticeable at even lower values. As a
| basic example, try to sing a song with someone over a voice
| call. Consider using Airpods or similar bluetooth
| headphones to make it more apparent.
| kwindla wrote:
| I disagree with these numbers, in general. Though of course
| "noticeable" is subjective and varies by use case as well
| as by person.
|
| For many people, end-to-end audio latency in a 1:1
| conversation becomes noticeable/annoying at 200ms. And in a
| multi-participant conversation, talking over each other
| becomes noticeably more common even at 100ms compared to
| 50ms.
| kevin_thibedeau wrote:
| Just disable VoIP or disconnect from WiFi.
| chrisseaton wrote:
| Isn't the actual phone network packet-switched and running
| over fibre optics now anyway? I don't think you can get a
| literally analog phone call anymore can you?
| kwindla wrote:
| This is a fantastic question. I agree with you that we're
| slowing boiling the frog (and the frog is ourselves) in
| accepting more and more latency in our real-time
| communications.
|
| I think the answer for Lyra is that latency is a concern, but
| maybe at this stage not as much of a concern as it could be.
| I'm only guessing, though based on this [0]:
|
| > The basic architecture of the Lyra codec is quite simple.
| Features, or distinctive speech attributes, are extracted from
| speech every 40ms and are then compressed for transmission.
|
| That sounds like the minimum frame size for Lyra is 40ms. For
| Opus (the audio codec used for most WebRTC applications), the
| default frame size is 20ms [1], and most implementations
| support frame sizes of 10ms [2].
|
| Of course, your favorite web browser might not default to 20ms
| frames for Opus. And by "most implementations" I meant Google
| Chrome. :-)
|
| [0] https://ai.googleblog.com/2021/02/lyra-new-very-low-
| bitrate-...
|
| [1] https://tools.ietf.org/html/rfc7587#section-6.1
|
| [2]
| https://chromium.googlesource.com/external/webrtc/+/HEAD/mod...
| toomim wrote:
| Google Chrome has a latency of 20ms to just repeat back audio
| _on the local device_.
|
| That is, with no networking, and no processing, it takes 20ms
| for any information to from microphone back out to speakers.
| boneitis wrote:
| I found the Lyra codecs in both examples easily the most
| difficult to comprehend, even compared against the scratchy
| Speex.
|
| Am I the only one? It is a little odd to me to see the praise
| here and on the previous discussion.
|
| To be fair, I am convinced I have APD (and, to be fair again, I
| have never got it checked out).
|
| E: Just realized there is a third example. Perhaps it is not as
| strong a statement due to Opus' doubled bitrate, but it is still
| far scratchier. Yet, it is more decipherable than the Lyra codec
| to me.
| formerly_proven wrote:
| I have to agree... Lyra has a tape flutter like effect and
| requires conscious effort to decipher. I can see how by an SNR
| and similar metric Lyra might outperform the others, but that
| then just goes to show that it's not a good proxy metric for
| perceived quality.
|
| Edit: In the third example it's a little bit closer, but Lyra
| again sounds like flutter and a mumbling speaker, while Opus
| sounds like a clear speaker with lots of noise.
| bscphil wrote:
| I personally wouldn't say "difficult to comprehend". I would
| say that the Lyra audio is "cleaner" but that the artifacts
| that there are, are louder and more annoying in Lyra. There's a
| very bad ringing effect and some flutter. If you personally
| find these artifacts distracting or confusing, I could very
| easily see the Lyra examples being harder to understand.
|
| I'm almost certain that Lyra has increased the volume on the
| first sample too. It's quite audible, although I haven't
| confirmed this with Audacity.
|
| Through good quality headphones, I actually find the Lyra
| artifacts rather piercing and think I'd pretty quickly get
| fatigued through having it in my ears over a long conversation.
| Maybe they would handle this better with a bit of a lowpass
| filter added.
| tyingq wrote:
| If the demos are actually representative, it does seem
| impressive. Could save a lot of bandwidth for VoiP if it replaced
| 8kb/s G729.
| lxgr wrote:
| Isn't VoIP at such low data rates already dominated by the
| overhead of UDP, IP and whatever lower layer? Multiplexing it
| with a low-bandwidth video stream would be possible, though.
|
| I was thinking this could be most relevant for something like
| digital wireless transmissions.
| nousermane wrote:
| To say the least, yeah. At 3kbps and 20ms framing, it's only
| 7.5 bytes of payload per frame.
|
| RTP, UDP, IP, and Ethernet overhead are what - 60-ish bytes?
| rjsw wrote:
| You might have PPPoE on top of that with another 8 bytes.
| zamadatix wrote:
| 60ish sounds right, though with Ethernet it's going to be
| padded to a minimum 64 bytes regardless. Might not matter
| depending what your bottleneck link actually uses though.
| tyingq wrote:
| G.729 is 21-30kbps with transport overhead, depending on a
| few factors. So shaving off 5kbps would still be meaningful.
| Or better quality at the same bandwidth might enable in-band
| DTMF or fax, neither works on G.729 now.
| lxgr wrote:
| In-band fax will certainly not work over a lossy voice
| codec, unless your fax modem is able to mimic human speech
| patterns.
| wrongdonf wrote:
| We are getting to the point where compression is so good, you
| aren't actually hearing the other person. Wild
| moonbug wrote:
| you'll never know if you're speaking to the Blight.
| Vadoff wrote:
| How come the clean reference wav file is 168KB, while the clean
| Lyra (@3kbps) wav file is significantly larger at 328KB?
| walrus01 wrote:
| Since a browser can't play lyra I think they took the lyra
| output and put it inside something lossless like a 44kHz stereo
| wav so that people can listen to it.
| bscphil wrote:
| When they converted back to wav from Lyra, they used a 32 bit
| 16 Khz wav instead of 16 bit 16 Khz wav like the source. The
| size of the Lyra file is almost exactly 2x as big as the
| reference.
|
| Note that this isn't cheating in any way, the source is the
| source, so it's just a quirk from their conversion process.
| Probably the tooling around Lyra is pretty rudimentary and the
| decoder could only output a 32 bit file.
| eznzt wrote:
| Is it any good for languages other than English?
| wmf wrote:
| "As with any ML based system, the model must be trained to make
| sure that it works for everyone. We've trained Lyra with
| thousands of hours of audio with speakers in over 70 languages
| using open-source audio libraries and then verifying the audio
| quality with expert and crowdsourced listeners. ... Lyra trains
| on a wide dataset, including speakers in a myriad of languages,
| to make sure the codec is robust to any situation it might
| encounter."
| kristofferR wrote:
| The Lyra examples are way harder to understand than the rest of
| them though. It injects sounds that wasn't there.
|
| It sounds like he said "Someuve" instead of the "Some" clearly
| audible in the other versions.
| walrus01 wrote:
| the 3kbps example here with the 'bread with chocolate filling
| inside' video is frankly amazing, how good it is compared to the
| original.
|
| https://ai.googleblog.com/2021/02/lyra-new-very-low-bitrate-...
|
| It is unfortunate that for now this appears to be proprietary,
| closed source and being treated as a google competitive advantage
| over others, unlike opus which is fully open.
| londons_explore wrote:
| If they add it to WebRTC as they suggest, it will get auto-
| included in nearly all videoconferencing applications (most use
| webrtc under the covers, and a simple git pull will get it
| included in the next release).
| walrus01 wrote:
| They briefly mention the existence and current role of the
| webrtc codecs, but I don't see where they suggest they intend
| to contribute it or open it up as a library others can use.
___________________________________________________________________
(page generated 2021-03-01 23:00 UTC)