[HN Gopher] Lyra audio codec enables high-quality voice calls at...
       ___________________________________________________________________
        
       Lyra audio codec enables high-quality voice calls at 3 kbps bitrate
        
       Author : homarp
       Score  : 109 points
       Date   : 2021-03-01 06:30 UTC (16 hours ago)
        
 (HTM) web link (www.cnx-software.com)
 (TXT) w3m dump (www.cnx-software.com)
        
       | sophiebits wrote:
       | Previous discussion:
       | https://news.ycombinator.com/item?id=26279891
        
       | skyde wrote:
       | this is super impressive
        
       | walrus01 wrote:
       | Microsoft also just announced a 6kbps advanced audio codec:
       | 
       | https://techcommunity.microsoft.com/t5/microsoft-teams-blog/...
        
       | trevorishere wrote:
       | I'd be curious to know how this compares to Microsoft's Satin
       | codec, now used in Teams, which is ML-driven.
        
       | londons_explore wrote:
       | This is blogspam that doesn't add meaningfully to the
       | original[1].
       | 
       | [1]: https://ai.googleblog.com/2021/02/lyra-new-very-low-
       | bitrate-...
        
       | jimktrains2 wrote:
       | I wonder how this compares with codec2[0] which is decent at
       | around 3kbps and can go even lower.
       | 
       | [0] https://en.m.wikipedia.org/wiki/Codec_2
        
       | motiejus wrote:
       | Nothing about licensing or patents. I assume the worst (read:
       | unusable for small businesses)?
       | 
       | 10+ years ago I worked in a small voip shop, where we had very
       | high quality (low jitter), but low bandwidth connection. I
       | researched many codecs of the time (2010-ish).
       | 
       | We liked speex, because it can be used "without strings
       | attached". Also, I can choose the quality depending on the
       | bandwidth. Although for low bandwidth g729 was better. Which we
       | couldn't use because of royalties (but allowed myself to test
       | it).
       | 
       | We chose alaw/ulaw when bandwidth was not a concern, and speex
       | when it was.
       | 
       | Since it does not mention usability outside of google, I also
       | find this comparison unfair or incomplete: if you are comparing a
       | proprietary codec, compare it to g729. If you are comparing a
       | codec to speex, it should be open/free.
       | 
       | Edit: grammar
        
         | lights0123 wrote:
         | I wouldn't be as pessimistic--they invented VP9 and were
         | important to AV1, and continue to promote both and use them
         | everywhere.
        
         | marcodiego wrote:
         | With good enough licensing it could possibly replace speex.
         | 
         | It is sad that we have to think about licensing and patents of
         | technologies instead of only how good or advanced they are.
        
         | 1996 wrote:
         | > Nothing about licensing or patents. I assume the worst (read:
         | unusable for small businesses)?
         | 
         | If there's a free software implementation, and a company
         | offering the service based in the EU (or shop around and find
         | any other jurisdiction where software patents don't matter),
         | it's often YOLO - but call that "legal arbitrage" if you want
         | to sound fancy :)
        
         | kreetx wrote:
         | What are the current choices for "CD quality" speech
         | compression (lossy but indiscernible) at the moment? Just had a
         | discussion with a friend of keeping an always-on speech
         | recorder on and wondered about disk space consumption.
        
         | Aloha wrote:
         | I'm over here preferring g711/ulaw because I prefer the hard
         | roll off at 4kc.
        
         | bscphil wrote:
         | These days the correct comparison would be to Opus, which is
         | similarly unencumbered and performs fantastically at low
         | bitrates (and has a speech specific mode for even lower
         | bitrates, because it's a hybrid of two codecs). It's also
         | extremely low latency, so there's now no reason to accept
         | trade-offs. (For the same bitrate as alaw/ulaw, you can get
         | high quality full band music with Opus.)
         | 
         | These days it's more or less the standard for realtime voice.
         | WebRTC uses it, most of the popular realtime voice applications
         | use it as well, as does Signal.
        
       | etaioinshrdlu wrote:
       | Important to remember that this type of codec can be used as a
       | backup for higher-bandwidth codecs. You don't necessarily need to
       | hear it's artifacts all the time. The higher level codec also
       | only needs to encode the differences between the prediction and
       | groundtruth. The same thinking applies to video especially of
       | faces. Neural nets are a huge leap forward for this type of data
       | compression and will likely be used pretty much everywhere in the
       | future with great success.
        
       | hatsunearu wrote:
       | http://www.rowetel.com/wordpress/?page_id=452
       | 
       | Take a look at this too. Also runnable on low power devices. And
       | there was some work of using AI to enhance the codec2 encoded
       | bits too.
        
         | ddevault wrote:
         | Yeah, with Codec 2 setting the gold standard, I don't find this
         | very impressive. I find this more intelligible at one-third of
         | the Lyra bitrate:
         | 
         | http://www.rowetel.com/downloads/codec2/hts2a_1300.wav
         | 
         | Codec 2 does a better job of isolating the parts of sound which
         | are most necessary to intelligible speech, without necessarily
         | caring too much about preserving the original qualities of the
         | speaker's voice or environment.
         | 
         | Fun fact: Codec 2 can be used to transmit voice over IRC:
         | 
         | https://github.com/asiekierka/voirc
        
           | londons_explore wrote:
           | I had to listen to your wav sample 4 times before
           | understanding what it was saying... To me, that isn't
           | intelligible... Perhaps with practice one could learn to
           | understand it, but that isn't really what I want from my
           | audio codec.
        
             | vidarh wrote:
             | The biggest challenge with evaluating all of these, is that
             | once you've listened to a comprehensible version of one of
             | these samples, they all sound more intelligible. I had
             | problems with the example too. After hearing the original
             | it's now easy. It makes it really hard to properly assess
             | the intelligibility for developers without decent sized
             | panels of people to help evaluating them.
        
           | alvarlagerlof wrote:
           | Completely unrelated but damn so they need to update the
           | illustration at the top of that page. It's hideous.
        
         | [deleted]
        
       | faebi wrote:
       | Maybe in the future, all we need is a speech example, some AI and
       | the continious transmission of text for low data voice
       | transmission?
        
         | dheera wrote:
         | I think this will be the rough direction, but not exactly text,
         | rather some other efficient, machine-readable embedding of
         | speech that is also able to carry tone and rhythm effectively
         | and pronounciation accurately and unambiguously.
        
           | cecja wrote:
           | Why speak then?
        
           | wmf wrote:
           | Basically yes. "Features, or distinctive speech attributes,
           | are extracted from speech every 40ms and are then compressed
           | for transmission. The features themselves are log mel
           | spectrograms, a list of numbers representing the speech
           | energy in different frequency bands, which have traditionally
           | been used for their perceptual relevance because they are
           | modeled after human auditory response."
        
         | viraptor wrote:
         | Only if you're ready to kill all the intonation nuance. If
         | you're ok with that, why not stick to just reading text? At
         | least we can use emotes in there.
        
       | temp-dude-87844 wrote:
       | When Google's announcement [1] was posted a few days ago, I
       | listened to their samples and heard an odd effect in the
       | "chocolate bread" sample (the video chat example) [1], which is
       | not mirrored in this article.
       | 
       | On that sample, I felt [2] that the Lyra version exaggerates the
       | pronunciation of the phrase 'with chocolate' in a way that
       | meaningfully differs from the speaker's original. It weakens the
       | voiced 'th' to nothingness, and overshoots both the lead
       | consonant and first vowel of 'choc', and then proceeds to wash
       | the entire rest of the sentence with a peculiar brightened voice
       | that's high, lacks consonant definition, and is close to ringing.
       | 
       | I'm guessing it's actually style transfer, because though the
       | result sounds not much like the speaker's original, the result is
       | reminiscent of the speech pattern and accent that people with
       | East Asian and Southeast Asian ancestry adopt when speaking
       | American English. It was surprising, given that the speaker
       | doesn't sound like that in the original. I wonder if others hear
       | this too.
       | 
       | While Lyra sounds richer and wider-band than Opus or Speex at
       | these bitrates, the degradations and artifacts of those codecs
       | are universally recognized (through years of familiarity with
       | telephones) as compression artifacts and not innate features of
       | the speaker themselves. Therefore listeners can be expected to be
       | sympathetic to the quality issues and not attribute the whole of
       | the sound on the speaker's person.
       | 
       | If AI-trained voice synthesizer codecs become the norm, and it
       | performs well on most speakers, that expectation will go away,
       | and the resulting audio will be attributed wholly to the speaker.
       | That increases the impact of mistakes and misrepresentations
       | introduced by the codec, unbeknowst to the speaker and listener.
       | 
       | [1] https://ai.googleblog.com/2021/02/lyra-new-very-low-
       | bitrate-...
       | 
       | [2] https://news.ycombinator.com/item?id=26282519
        
         | BugsJustFindMe wrote:
         | > _' with chocolate' in a way that meaningfully differs from
         | the speaker's original. It weakens the voiced 'th' to
         | nothingness_
         | 
         | I honestly don't hear a 'th' in the original.
         | 
         | > _It was surprising, given that the speaker doesn 't sound
         | like that in the original._
         | 
         | I disagree. Note that the speaker says "these bread". The three
         | possibilities for those two words--"these bread", "thiiiis
         | bread", and "these breads" with a dropped "s"--would all be
         | weird things for a native english speaker to say for different
         | reasons relating to either wrong pronunciation of "this" or
         | "breads" or the fact that bread is its own collective noun and
         | therefore we typically require separate qualifiers like "these
         | buns" or "these loaves" when separating multiple individual
         | "pieces" (another) into a non-collective. We ask for "some
         | bread" or "a piece of bread", but we don't say "a bread" or
         | "some breads" unless we are discussing categorical types of
         | bread ("ciabatta and rye are breads") rather than instances of
         | such, and only one type of bread is represented in the video.
         | 
         | The Lyra reproduction has a band-pass filtered quality to it,
         | but I find it still remarkably representative of the reference.
        
           | cbdumas wrote:
           | I agree completely, I've listened to the reference sample
           | probably ten times now and I can only hear /wI/
        
         | scotty79 wrote:
         | "When a man looks for something beyond his reach ..."
         | 
         | The word "looks" sounds completely wrong for me with Lyra. To
         | the point of completely not understanding what this word is
         | supposed to be (first example with your [1] link).
        
           | est31 wrote:
           | For me "looks" sounds fine but the word before, "Man", sounds
           | like "Lan". So to me the opus sample sounds more
           | understandable. Even though the "quality" of Lyra is better,
           | that shouldn't be the score to optimize for, but fidelity of
           | the compression. It's not helpful if the compression
           | algorithm generates a beautiful flower from a flower image
           | but it's a red flower instead of a blue one like the
           | original. Gives me Xerox vibes...
        
         | ampdepolymerase wrote:
         | Are the speech models sufficiently generic across all
         | languages?
        
         | cityzen wrote:
         | I read your comment before I watched that video and I can't
         | stop laughing. It sounds ridiculous!
        
       | [deleted]
        
       | [deleted]
        
       | rectang wrote:
       | What I want to know is whether Lyra takes any longer to encode
       | than the alternatives.
       | 
       | Because as far as I can tell, nobody cares in the slightest about
       | latency.
       | 
       | Phone calls are getting to be like writing postcards to each
       | other. Speak in a whole paragraph. Wait several seconds for the
       | latency to clear. Then the other party responds with a whole
       | paragraph, waits several seconds for the latency to clear...
       | 
       | Improvements to fidelity are nice-to-have, but I would like some
       | real-time in my real-time communications, please.
        
         | kixiQu wrote:
         | In the article,
         | 
         | > This trick enables Lyra to not only run on cloud servers, but
         | also on-device on mid-range phones in real time (with a
         | processing latency of 90ms, which is in line with other
         | traditional speech codecs).
         | 
         | Does that not cover it?
        
           | regularfry wrote:
           | It covers it, but it's not exactly brilliant. 200ms is the
           | point at which conversation breaks down. If half that budget
           | has gone on the codec, not much has to happen on the wire for
           | it to be noticeable.
        
           | jtsiskin wrote:
           | The "mid-range phones" has me suspicious, I wish they defined
           | that better. And 90ms is much higher than what Opus is
           | supposed to achieve
        
         | reaperducer wrote:
         | I had the pleasure of using a real landline just before the
         | pandemic. Honest wire-to-wire connection between two ranches,
         | so no silly VOIP steps between.
         | 
         | It was fantastic.
         | 
         | You don't appreciate how much latency is destroying our ability
         | to communicate verbally until you go back to the old way.
         | 
         | One example is arguing. It's no wonder people used to be able
         | to argue with one another on a telephone. You could raise your
         | voice and still hear the other side and adjust your speech in
         | real time. Today it's just one party shouting over the other to
         | drown the opponent out.
        
           | rectang wrote:
           | Between miserable latency, not-so-great fidelity, and the
           | fecklessness of phone companies in the face of the robocall
           | epidemic, I have come to hate phone calls.
           | 
           | I'm rooting for something to replace phone communications.
           | Any chance that Matrix can do better on any of those fronts?
           | Especially on fidelity and latency since they're germane to
           | the high-level subject of this discussion.
        
           | Aloha wrote:
           | A cell phone tbh, is about the same latency as a landline in
           | most end to end call circumstances. Latency only really is
           | noticeable when better than 600ms. (And only a real problem
           | over 1000)
        
             | ubercow13 wrote:
             | Imagine trying to have a face-to-face conversation with
             | 600ms of latency...
        
             | lynndotpy wrote:
             | I think latency is noticeable at even lower values. As a
             | basic example, try to sing a song with someone over a voice
             | call. Consider using Airpods or similar bluetooth
             | headphones to make it more apparent.
        
             | kwindla wrote:
             | I disagree with these numbers, in general. Though of course
             | "noticeable" is subjective and varies by use case as well
             | as by person.
             | 
             | For many people, end-to-end audio latency in a 1:1
             | conversation becomes noticeable/annoying at 200ms. And in a
             | multi-participant conversation, talking over each other
             | becomes noticeably more common even at 100ms compared to
             | 50ms.
        
         | kevin_thibedeau wrote:
         | Just disable VoIP or disconnect from WiFi.
        
           | chrisseaton wrote:
           | Isn't the actual phone network packet-switched and running
           | over fibre optics now anyway? I don't think you can get a
           | literally analog phone call anymore can you?
        
         | kwindla wrote:
         | This is a fantastic question. I agree with you that we're
         | slowing boiling the frog (and the frog is ourselves) in
         | accepting more and more latency in our real-time
         | communications.
         | 
         | I think the answer for Lyra is that latency is a concern, but
         | maybe at this stage not as much of a concern as it could be.
         | I'm only guessing, though based on this [0]:
         | 
         | > The basic architecture of the Lyra codec is quite simple.
         | Features, or distinctive speech attributes, are extracted from
         | speech every 40ms and are then compressed for transmission.
         | 
         | That sounds like the minimum frame size for Lyra is 40ms. For
         | Opus (the audio codec used for most WebRTC applications), the
         | default frame size is 20ms [1], and most implementations
         | support frame sizes of 10ms [2].
         | 
         | Of course, your favorite web browser might not default to 20ms
         | frames for Opus. And by "most implementations" I meant Google
         | Chrome. :-)
         | 
         | [0] https://ai.googleblog.com/2021/02/lyra-new-very-low-
         | bitrate-...
         | 
         | [1] https://tools.ietf.org/html/rfc7587#section-6.1
         | 
         | [2]
         | https://chromium.googlesource.com/external/webrtc/+/HEAD/mod...
        
           | toomim wrote:
           | Google Chrome has a latency of 20ms to just repeat back audio
           | _on the local device_.
           | 
           | That is, with no networking, and no processing, it takes 20ms
           | for any information to from microphone back out to speakers.
        
       | boneitis wrote:
       | I found the Lyra codecs in both examples easily the most
       | difficult to comprehend, even compared against the scratchy
       | Speex.
       | 
       | Am I the only one? It is a little odd to me to see the praise
       | here and on the previous discussion.
       | 
       | To be fair, I am convinced I have APD (and, to be fair again, I
       | have never got it checked out).
       | 
       | E: Just realized there is a third example. Perhaps it is not as
       | strong a statement due to Opus' doubled bitrate, but it is still
       | far scratchier. Yet, it is more decipherable than the Lyra codec
       | to me.
        
         | formerly_proven wrote:
         | I have to agree... Lyra has a tape flutter like effect and
         | requires conscious effort to decipher. I can see how by an SNR
         | and similar metric Lyra might outperform the others, but that
         | then just goes to show that it's not a good proxy metric for
         | perceived quality.
         | 
         | Edit: In the third example it's a little bit closer, but Lyra
         | again sounds like flutter and a mumbling speaker, while Opus
         | sounds like a clear speaker with lots of noise.
        
         | bscphil wrote:
         | I personally wouldn't say "difficult to comprehend". I would
         | say that the Lyra audio is "cleaner" but that the artifacts
         | that there are, are louder and more annoying in Lyra. There's a
         | very bad ringing effect and some flutter. If you personally
         | find these artifacts distracting or confusing, I could very
         | easily see the Lyra examples being harder to understand.
         | 
         | I'm almost certain that Lyra has increased the volume on the
         | first sample too. It's quite audible, although I haven't
         | confirmed this with Audacity.
         | 
         | Through good quality headphones, I actually find the Lyra
         | artifacts rather piercing and think I'd pretty quickly get
         | fatigued through having it in my ears over a long conversation.
         | Maybe they would handle this better with a bit of a lowpass
         | filter added.
        
       | tyingq wrote:
       | If the demos are actually representative, it does seem
       | impressive. Could save a lot of bandwidth for VoiP if it replaced
       | 8kb/s G729.
        
         | lxgr wrote:
         | Isn't VoIP at such low data rates already dominated by the
         | overhead of UDP, IP and whatever lower layer? Multiplexing it
         | with a low-bandwidth video stream would be possible, though.
         | 
         | I was thinking this could be most relevant for something like
         | digital wireless transmissions.
        
           | nousermane wrote:
           | To say the least, yeah. At 3kbps and 20ms framing, it's only
           | 7.5 bytes of payload per frame.
           | 
           | RTP, UDP, IP, and Ethernet overhead are what - 60-ish bytes?
        
             | rjsw wrote:
             | You might have PPPoE on top of that with another 8 bytes.
        
             | zamadatix wrote:
             | 60ish sounds right, though with Ethernet it's going to be
             | padded to a minimum 64 bytes regardless. Might not matter
             | depending what your bottleneck link actually uses though.
        
           | tyingq wrote:
           | G.729 is 21-30kbps with transport overhead, depending on a
           | few factors. So shaving off 5kbps would still be meaningful.
           | Or better quality at the same bandwidth might enable in-band
           | DTMF or fax, neither works on G.729 now.
        
             | lxgr wrote:
             | In-band fax will certainly not work over a lossy voice
             | codec, unless your fax modem is able to mimic human speech
             | patterns.
        
       | wrongdonf wrote:
       | We are getting to the point where compression is so good, you
       | aren't actually hearing the other person. Wild
        
         | moonbug wrote:
         | you'll never know if you're speaking to the Blight.
        
       | Vadoff wrote:
       | How come the clean reference wav file is 168KB, while the clean
       | Lyra (@3kbps) wav file is significantly larger at 328KB?
        
         | walrus01 wrote:
         | Since a browser can't play lyra I think they took the lyra
         | output and put it inside something lossless like a 44kHz stereo
         | wav so that people can listen to it.
        
         | bscphil wrote:
         | When they converted back to wav from Lyra, they used a 32 bit
         | 16 Khz wav instead of 16 bit 16 Khz wav like the source. The
         | size of the Lyra file is almost exactly 2x as big as the
         | reference.
         | 
         | Note that this isn't cheating in any way, the source is the
         | source, so it's just a quirk from their conversion process.
         | Probably the tooling around Lyra is pretty rudimentary and the
         | decoder could only output a 32 bit file.
        
       | eznzt wrote:
       | Is it any good for languages other than English?
        
         | wmf wrote:
         | "As with any ML based system, the model must be trained to make
         | sure that it works for everyone. We've trained Lyra with
         | thousands of hours of audio with speakers in over 70 languages
         | using open-source audio libraries and then verifying the audio
         | quality with expert and crowdsourced listeners. ... Lyra trains
         | on a wide dataset, including speakers in a myriad of languages,
         | to make sure the codec is robust to any situation it might
         | encounter."
        
       | kristofferR wrote:
       | The Lyra examples are way harder to understand than the rest of
       | them though. It injects sounds that wasn't there.
       | 
       | It sounds like he said "Someuve" instead of the "Some" clearly
       | audible in the other versions.
        
       | walrus01 wrote:
       | the 3kbps example here with the 'bread with chocolate filling
       | inside' video is frankly amazing, how good it is compared to the
       | original.
       | 
       | https://ai.googleblog.com/2021/02/lyra-new-very-low-bitrate-...
       | 
       | It is unfortunate that for now this appears to be proprietary,
       | closed source and being treated as a google competitive advantage
       | over others, unlike opus which is fully open.
        
         | londons_explore wrote:
         | If they add it to WebRTC as they suggest, it will get auto-
         | included in nearly all videoconferencing applications (most use
         | webrtc under the covers, and a simple git pull will get it
         | included in the next release).
        
           | walrus01 wrote:
           | They briefly mention the existence and current role of the
           | webrtc codecs, but I don't see where they suggest they intend
           | to contribute it or open it up as a library others can use.
        
       ___________________________________________________________________
       (page generated 2021-03-01 23:00 UTC)