[HN Gopher] MLow: Meta's low bitrate audio codec
       ___________________________________________________________________
        
       MLow: Meta's low bitrate audio codec
        
       Author : mikece
       Score  : 333 points
       Date   : 2024-06-13 15:05 UTC (7 hours ago)
        
 (HTM) web link (engineering.fb.com)
 (TXT) w3m dump (engineering.fb.com)
        
       | PaulHoule wrote:
       | Sometimes it sounds great but there are moments I think I'm
       | listening to a harp and not somebody's voice.
        
         | plus wrote:
         | It's not exactly reasonable to expect super high fidelity audio
         | at the bitrate constraints they're targeting here, and it
         | certainly sounds a lot better than the Opus examples they're
         | comparing against.
        
           | cobbal wrote:
           | The more complicated the codec, the more fascinating the
           | failure modes. I love watching a digital TV with a bad
           | signal, because the motion tracking in the codec causes
           | people to wear previous, glitched frames as a skin while they
           | move.
        
             | ugjka wrote:
             | Look up datamoshing on youtube
        
             | cnity wrote:
             | Good observation, and probably part of what makes "glitchy"
             | AI generated video so captivating to watch.
        
           | 77pt77 wrote:
           | Are they comparing against opus using nolace?
           | 
           | Because that makes all the difference!
        
       | Tostino wrote:
       | That is a marked improvement compared to the other examples
       | provided. Nice to see it also has less compute resources required
       | for that higher quality output.
        
       | zekica wrote:
       | Honest question: why do we need to optimize for <10kbps? It's
       | really impressive what they are able to achieve at 6kbps, but LTE
       | already supports >32kbps and there we have AMR-WB or Opus (Opus
       | even has in-band FEC at these bitrates so packet loss is not that
       | catastrophic). Maybe it's useful in satellite direct-to-phone
       | use-cases?
        
         | ThrowawayTestr wrote:
         | > why do we need to optimize for <10kbps?
         | 
         | Because some people have really slow internet
        
         | hokumguru wrote:
         | There exist a few billion people without LTE. Meta doesn't only
         | operate in the western world.
        
           | noprocrasted wrote:
           | Are there really many situations where a 10kbps connection
           | would actually be _stable_ enough to be usable? Usually when
           | you get these kinds of speeds it means the underlying
           | connection is well and truly compromised, and any kind of
           | real-time audio would fail anyway because you 're drowning in
           | a sea of packet loss and retransmissions.
           | 
           | Even in cases where you do get a _stable_ 10kbps connection
           | from upstream, how are you going to manage getting any usable
           | traffic through it when everything nowadays wastes bandwidth
           | and competes with you (just look at any iOS device 's
           | background network activity - and that's before running any
           | apps which usually embed dozens of malicious SDKs all
           | competing for bandwidth)?
        
             | bogwog wrote:
             | I don't know what you consider "stable enough", but the 30%
             | packet loss demo in the article is pretty impressive.
        
             | zeroxfe wrote:
             | > Are there really many situations where a 10kbps
             | connection would actually be stable enough to be usable?
             | 
             | Yes there are. We ran on stable low bandwidth connections
             | for a very long time before we had stable high bandwidth
             | connections. A large part of the underdeveloped world has
             | very low bandwidth, and use 5 - 10 Kbps voice channels.
        
               | noprocrasted wrote:
               | > We ran on stable low bandwidth connections
               | 
               | Are you talking about the general "we" or your situation
               | in particular? For the former, yes sure we started with
               | dial-up, then DSL, etc, but back then software was built
               | with these limitations in mind.
               | 
               | Constant background traffic for "product improvement"
               | purposes would be completely unthinkable 20 years ago;
               | now it's the norm. All this crap (and associated TLS
               | handshakes) quickly adds up if all you've got is kilobits
               | per second.
        
               | dspillett wrote:
               | _> Are you talking about the general  "we"_
               | 
               | I assume the general-ish "we", where it is general to the
               | likes of you and I (and that zeroxfe). There are likely
               | many in the world stuck at the end of connections run
               | over tech that this "general subset" would consider
               | archaic, and that zeroxfe was implying their connections,
               | while slow, may be similarly stable to ours back then.
               | 
               | Also, a low bandwidth stable connection could be one of
               | many multiplexed through a higher bandwidth stable
               | connection.
        
               | zeroxfe wrote:
               | Let's not move the goalposts here :-) The context is an
               | audio codec, not heavyweight web applications, in
               | response to your question "Are there really many
               | situations where a 10kbps connection would actually be
               | stable enough to be usable?" And I'm saying yes, in that
               | context, there are many situations, like VoIP, where
               | 10kbps is usable.
               | 
               | Nobody here would argue that 10kbps is usable today for
               | the "typical" browser-based Internet use.
        
             | meindnoch wrote:
             | >Are there really many situations where a 10kbps connection
             | would actually be stable enough to be usable?
             | 
             | Scroll to this part of the article:
             | 
             | >Here are two audio samples at 14 kbps with heavy 30
             | percent receiver-side packet loss.
        
             | gorkish wrote:
             | Yes; backhaul connections in telephony applications are
             | often very stable and are already capacity managed by
             | tuning codec bandwidth. Say you are carrying 1000 calls
             | with uLaw (64kbps * 1000) over a pair of links and one
             | fails. Do you A) carry 500 calls on the remaining link B)
             | stuff all calls onto the same link and drop 50% of the
             | packets or C) Change to a 32kbps codec?
             | 
             | It seems you may be imaging the failure case where your
             | "ISP is slow" or something like that due to congestion or
             | packet loss -- as I posted elsewhere in the thread the
             | bandwidth is only one aspect of how a "low bitrate" codec
             | may be expected to perform in a real world application. How
             | such a codec degrades when faced with bit errors or even
             | further reduced channel capacity is often more important in
             | the real application. These issues are normally solved with
             | things like FEC which can be incorporated as part of the
             | codec design itself or incorporated as part of the
             | modem/encoding/modulation of the underlying transport.
        
               | wmf wrote:
               | Facebook Messenger and WhatsApp don't run over TDM
               | though. If WhatsApp is only getting ~10 kbps, that's due
               | to extreme congestion.
        
               | gorkish wrote:
               | Yes; but what is your point? A congested network like you
               | describe isnt ever going to reliably carry realtime
               | communications anyway due to latency and jitter. All you
               | could reasonably due to 'punch through' that situation is
               | to use dirty tricks to give your client more than its
               | fair share of network resources.
               | 
               | 6kbps is 10x less data to transfer than 64kbps, so for
               | all the async aspects of Messenger or WhatsApp there is
               | still enormous benefit to smaller data.
        
             | dspillett wrote:
             | _> Are there really many situations where a 10kbps
             | connection would actually be stable enough to be usable?_
             | 
             | Yes (most likely: that was an intuited "yes" not one born
             | of actually checking facts!). There are many places still
             | running things over POTS rather than anything like (A)DSL,
             | line quality issues could push that down low and even if
             | you have a stable 28kbit/s you might want to do something
             | with it at the same time as the audio comms.
             | 
             | Also, you may be trying to cram multiple channels over a
             | relatively slow (but stable) link. Given the quality of the
             | audio when calling some support lines I suspect this is
             | very common.
             | 
             | Furthermore, you might find a much faster _unstable_
             | connection with a packet-loss "correcting" transport
             | layered on top effectively producing a stable connection of
             | much lesser speed (though you might get periods of  <10kbit
             | here due to prolonged dropouts and/or have to institute an
             | artificial delay if the resend latency is high).
        
           | treflop wrote:
           | Even in the Western world, you can appreciate low bandwidth
           | apps even you are a music festival or traveling through
           | relative wilderness.
        
         | gorkish wrote:
         | It's useful.
         | 
         | AMBE currently has a stranglehold in this area and by any and
         | every measurable metric, AMBE is terrible and should be burned
         | in the deepest fires of hell and obliterated from all of
         | history.
        
         | londons_explore wrote:
         | Internet connectivity tends to have a throughput vs latency
         | curve.
         | 
         | If you need reliable low latency, as you want for a phone call,
         | you get very little throughput.
         | 
         | Examples of such connections are wifi near the end of the
         | range, or LTE connections with only one signal bar.
         | 
         | In those cases, a speedtest might say you have multiple
         | megabits available, but you probably only have kilobits of
         | bandwidth if you want reliable low latency.
        
           | zekica wrote:
           | Yes, but it doesn't have to be. Have you looked into Dave
           | Taht's crusade against buffers?
        
             | lxgr wrote:
             | Correct buffer sizing isn't a good solution for
             | Bufferbloat: The ideal size corresponds to the end-to-end
             | bandwidth-delay product, but since one buffer can handle
             | multiple flows with greatly varying latencies/delays, that
             | number does not necessarily converge.
             | 
             | Queueing aware scheduling algorithms are much more
             | effective, are readily available in Linux (tc_codel and
             | others), and are slowly making their way into even consumer
             | routers (or at least I hope).
        
           | lxgr wrote:
           | Load ratios of > 0.5 are definitely achievable without
           | entering Bufferbloat territory, and even more is possible
           | using standing queue aware schedulers such as CoDel.
           | 
           | Also, Bufferbloat is usually not (only) caused by you, but by
           | people sharing the same chokepoint as you in either or both
           | directions. But if you're lucky, the router owning the
           | chokepoint has at least some rudimentary per-flow or per-IP
           | fair scheduler, in which case sending less yourself can
           | indeed help.
           | 
           | Still, to have that effect result in a usable data rate of
           | kilobits on a connection that can otherwise push megabits
           | (disregarding queueing delay), the chokepoint would have to
           | be severely overprovisioned and/or extremely poorly
           | scheduled.
        
         | cornstalks wrote:
         | There's a section ("Our motivation for building a new codec")
         | in the article that directly addresses this. Assuming you have
         | >32 kbps bandwidth available is a bad assumption.
        
           | nicce wrote:
           | The best assumption would be that you either have connection
           | available or not available.
           | 
           | Then, if it is available, what is the minimal data rate for
           | connections which are available in general? If we do
           | statistical analysis for that, is it lower that 32 kbps? How
           | significantly?
           | 
           | For some reason, I would assume that if you have connection,
           | it is faster than 2G these days.
        
             | zamadatix wrote:
             | The question isn't really the minimal bandwidth of the PHY
             | rate it's about the goodput for a given reliability.
             | Regardless of your radio there will always be some point
             | where someone is at the edge of a connection and goodput is
             | less than minimal PHY bandwidth. The call then turns
             | choppy/into a time stretched robot you get every other
             | syllable from. The less data you need to transmit + the
             | more FEC you can fit in the goodput then the better that
             | situation becomes.
             | 
             | Not to mention "just because I have some minimal baseline
             | of $x kbps doesn't mean I want $y to use all of it the
             | entire time I'm on a call if it doesn't have to".
        
             | sangnoir wrote:
             | > For some reason, I would assume that if you have
             | connection, it is faster than 2G these days.
             | 
             | That assumption does not hold for a sizable chunk of Meta's
             | 3.98B-strong userbase. The list of counties that switched
             | off 2G is surprisingly short.
        
               | nicce wrote:
               | Now that you mention it, Wikipedia seems to have
               | interesting list about that. Seems like that by 2030, the
               | most starts to switch off.
               | 
               | https://en.wikipedia.org/wiki/2G
        
         | hateful wrote:
         | It's not only about the end that's receiving, it's also the end
         | that's transmitting 10kbps * thousands of users.
        
         | sogen wrote:
         | I'm assuming they'll just re-encode everything, for every user,
         | to a lower bitrate using this codec.
         | 
         | So, with their huge user base they'll be saving a gazillion
         | terabytes hourly, that's what I concluded from their "2 years
         | in the making" announcement.
        
           | ajb wrote:
           | If you mean for storage, real time codecs are actually pretty
           | inefficient for that use case because they don't get much use
           | of temporal redundancy. Although I'm not actually aware of a
           | non-real time audio codec specialised for voice. They
           | probably exist in Cheltenham and Maryland but for Meta this
           | likely doesn't make a big enough part of their storage costs
           | to bother
        
         | lxgr wrote:
         | Meta's use case are OTT applications on the Internet, which are
         | usually billed per byte transmitted. Reducing the bitrate for
         | the audio codec used lets people talk longer per month on the
         | same data plan.
         | 
         | That said, returns are diminishing in that space due to the
         | overhead of RTP, UDP and IP; see my other comment for details
         | on that.
        
           | evandrofisico wrote:
           | More than that, in developing countries, such as my own, Meta
           | has peering agreements with telephony companies which allow
           | said companies to offer basic plans where traffic to Meta
           | applications (mostly whatsapp) is not billed. This would
           | certainly reduce their costs immensely, considering that
           | people use whatsapp as THE communications service.
        
         | kylehotchkiss wrote:
         | Maybe something like this would be helpful for Apple to
         | implement voice messages over satellite. Also a LOT of people
         | in developing countries use WhatsApp voice messages with slow
         | network speeds or expensive data. It's too easy to forget how
         | big an audience Meta has outside the western world
        
       | gorkish wrote:
       | The lack of any reference or comparison to Codec2 immediately
       | leads me to question the real value and motivation of this work.
       | The world doesn't need another IP-encumbered audio codec in this
       | space.
        
         | muizelaar wrote:
         | They also don't compare with Lyra
         | (https://github.com/google/lyra)
        
           | cvg wrote:
           | Nice. Google's soundstream already has some great quality.
           | Some examples at 6kbps here: https://google-
           | research.github.io/seanet/soundstream/example...
        
           | gorkish wrote:
           | Or speex narrowband or others. I think the tendency to pick
           | Opus is just because it has a newer date on it -- its design
           | goals were not necessarily to optimize for low bitrate; Opus
           | just happened to still sound OK when the knob was turned down
           | that far.
           | 
           | One other point I intended to make that is not reflected in
           | many listening/comparison tests offered by these
           | presentations -- in the typical applications of low bitrate
           | codecs, they absolutely must be able to gracefully degrade.
           | We see Mlow performing at 6kbps here; how does it perform
           | with 5% bit errors? Can it be tuned for lower bitrates like
           | 3kpbs? A codec with a 6kbps floor that garbles into nonsense
           | with a single bit flip would be dead-on-arrival for most real
           | world applications. If you have to double the bitrate with
           | FEC to make it reliable, have you really designed a low
           | bitrate codec? The only example we heard of mlow was 30% loss
           | on a 14kbps stream = 9.8kbps. Getting 6kbps through such a
           | channel is a trivial exercise.
        
             | DragonStrength wrote:
             | My understanding was Opus was specifically developed with
             | the idea of replacing both Speex and Vorbis. "Better
             | quality than Speex" is literally one of their selling
             | points, so I'd be interested to hear more details.
        
         | Dwedit wrote:
         | There's also the LPCNet Codec (2019), which does wideband
         | speech at 1.6kb/s by using a recurrent neural network.
         | 
         | https://jmvalin.ca/demo/lpcnet_codec/
        
       | nickels wrote:
       | Could it be used for voice over satellite, ie Emergency SOS via
       | satellite on iPhones?
        
         | lxgr wrote:
         | iPhones use Globalstar, which theoretically supports voice
         | bitrates of (I believe) 9.6 kbps, although only using dedicated
         | satphones with large, external antennas.
         | 
         | Apple's current solution requires several seconds to transmit a
         | location stamp of only a handful of bytes, so I think we're
         | some either iPhone or satellite upgrades away from real-time
         | voice communication over that.
         | 
         | Starlink has demonstrated a direct-to-device video call
         | already, though, so we seem to be quickly approaching that
         | point! My strong suspicion is that Apple has bigger plans for
         | Globalstar than just text messaging.
        
           | zekica wrote:
           | Starlink is in a better position as their satellites are in a
           | low earth orbit - 30 times closer than geostationary. It
           | correlates to 1000 times (30dB) stronger signal on both
           | sides.
        
             | lxgr wrote:
             | Globalstar is LEO as well, although a bit higher (~1400 km)
             | than Iridium (~780 km) and Starlink (below Iridium; various
             | altitudes). In terms of SNR, they're very comparable.
             | 
             | Newer GEO direct-to-device satellites also have huge
             | reflectors and often much higher transmit power levels that
             | can compensate for the greater distance somewhat. Terrestar
             | and Thuraya have had quite small phones available since the
             | late 2000s already, and they're both (large) GEO.
        
             | ianburrell wrote:
             | Iridium and Globalstar aren't geostationary. They are LEO
             | not much higher than Starlink.
             | 
             | Starlink is doing direct-to-cell. Talking to existing
             | phones requires a large antenna. The bandwidth for each
             | device is slow, not enough for mobile data, but better than
             | Iridium. I think they recently showed off voice calls.
        
       | chronogram wrote:
       | No mention of NoLACE make the comparison samples a bit less
       | useful: https://opus-codec.org/demo/opus-1.5/
        
         | jamal-kumar wrote:
         | That does sound very nice
        
         | sitkack wrote:
         | This is really cool and I very very very much appreciate that
         | xiph puts so much work into standardization.
         | https://datatracker.ietf.org/wg/mlcodec/documents/
         | 
         | It would be nice if Meta donated this to the world so we have
         | less anchors for patent trolls and can transition the future we
         | deserve.
        
       | 77pt77 wrote:
       | Where is the source code?
        
       | lxgr wrote:
       | All these new, low-bitrate codecs are amazing, but ironically I
       | suspect that they won't actually be very useful in most of the
       | scenarios Meta is using them:
       | 
       | To keep latency low in real-time communications, the packet rate
       | needs to be relatively high, and at some point the overhead of
       | UDP, IP, and lower layers starts dominating over the actual
       | payload.
       | 
       | As an example, consider (S)RTP (over UDP and IP): RTP adds at
       | least 12 bytes of overhead (let's ignore the SRTP authentication
       | tag for now); UDP adds 8 byte, and IPv4 adds 20, for a total of
       | 40. At at typical packet rate of 50 per second (for a
       | serialization delay of 1/50 = 20ms), that's 16 kbps of overhead
       | alone!
       | 
       | It might still be acceptable to reduce the packet rate to 25 per
       | second, which would cut this in half for an overhead of 8 kbps,
       | but the overhead would still be dominating the total transmission
       | rate.
       | 
       | Where codecs like this can really shine, though, is circuit-
       | switched communication (some satphones use bitrates of around 2
       | kbps, which currently sound awful!), or protocol-aware VoIP
       | systems that can employ header compression such as that used by
       | LTE and 5G in IMS (most of the 40 bytes per frame are extremely
       | predictable).
        
         | tgtweak wrote:
         | I think this is likely incorrect based on how much voice/audio
         | distribution meta does today with facebook (and facebook live),
         | instagram and whatsapp - moreso with whatsapp voice message and
         | calling given it's considerable market share in countries with
         | intermittent and low-reliability network connectivity. The fact
         | it is more packet-loss robust and jitter-robust means that you
         | can rely on protocols that have less error correction,
         | segmenting and receive-reply overhead as well.
         | 
         | I don't think it's unreasonable to assume this could reduce
         | their total audio-sourced bandwidth consumption by a
         | considerable amount while maintaining/improving reliability and
         | perceived "quality".
         | 
         | Looking at wireshark review of whatsapp on an active call there
         | was around 380 UDP packets sent from source to recipient during
         | a 1 minute call, and a handful of TCP packets to whatsapp's
         | servers. That would yield a transmission overhead of about
         | 2.2kbps.
         | 
         | quick edit to clarify why this is: you can see starting ptime
         | (audio size per packet) set to 20ms here, but maxptime set to
         | 150ms, which the clients can/will use opportunistically to
         | reduce the number of packets being sent taking into
         | consideration the latency between parties and bandwidth
         | available.
         | 
         | (image): https://www.twilio.com/content/dam/twilio-
         | com/global/en/blog...
        
           | lxgr wrote:
           | What part of that calculation is incorrect in your view?
           | 
           | > 380 UDP packets sent from source to recipient during a 1
           | minute call, and a handful of TCP packets to whatsapp's
           | servers. That would yield a transmission overhead of about
           | 2.2kbps.
           | 
           | That sounds like way too many packets! 380 packets per
           | second, at 40 bytes of overhead per packet, would be almost
           | 120 kbps.
           | 
           | My calculation only assumes just 50, and that's already at a
           | quite high packet rate.
           | 
           | > you can rely on protocols that have less error correction
           | 
           | You could, but there's no way to get a regular smartphone IP
           | stack running over Wi-Fi or mobile data to actually expose
           | that capability to you. Even just getting the OS's UDP stack
           | (to say nothing of middleboxes) to ignore UDP checksums and
           | let you use those extra four bytes for data can be tricky.
           | 
           | Non-IP protocols, or even just IP or UDP header compression,
           | are completely out of reach for an OTT application. (Networks
           | might transparently do it; I'm pretty sure they'd still
           | charge based on the gross data rate though, and as soon as
           | the traffic leaves their core network, it'll be back to
           | regular RTP over UDP over IP).
           | 
           | What they could do (and I suspect they might already be
           | doing) is to compress RTP headers (or use something other
           | than RTP) and/or pick even lower packet rates.
           | 
           | > I don't think it's unreasonable to assume this could reduce
           | their total audio-sourced bandwidth consumption by a
           | considerable amount while maintaining/improving reliability
           | and perceived "quality".
           | 
           | I definitely don't agree on the latter assertion - packet
           | loss resilience is a huge deal for perceived quality! I'm
           | just a bit more pessimistic on the former, unless they do the
           | other optimizations mentioned above.
        
             | roman-holovin wrote:
             | I read it as in 380 packets per whole call, which was a
             | minute long, not 380 packets per second during 1 minute.
        
               | mikepavone wrote:
               | That's about 160 ms of audio per packet. That's a lot of
               | latency to add before you even hit the network
        
               | ant6n wrote:
               | Assuming continuous sound. You don't need to send many
               | packets for silence.
        
               | lxgr wrote:
               | Voice activity detection and comfort noise have been
               | available in VoIP since the very beginning, but now I
               | wonder if there's some clever optimization that could be
               | done based on a semantic understanding of conversational
               | patterns:
               | 
               | During longer monologues, decrease packet rates; for
               | interruptions, send a few early samples of the
               | interrupter to notify the speaker, and at the same time
               | make the (former) speaker's stack flush its cache to
               | allow "acknowledgement" of the interruption through
               | silence.
               | 
               | In other words, modulate the packet rate in proportion to
               | the instantaneous interactivity of a dialogue, which
               | allows spending the "overhead budget" where it matters
               | most.
        
             | markus92 wrote:
             | I think you're misreading OP, as he says 380 packets per
             | minute, not second. That would give you an overhead of 253
             | bytes per second, sounds a lot more reasonable.
        
               | lxgr wrote:
               | Wow, that would be an extremely low packet rate indeed!
               | 
               | That would definitely increase the utility of low bitrate
               | codecs by a lot, at the expense of some latency (which is
               | probably ok, if the alternative is not having the call at
               | all).
        
               | tgtweak wrote:
               | Yes 380/min = ~6/s which is a very open ptime of >100ms,
               | this can also be dynamic and change don the fly. It
               | ultimately comes down to how big the packet can be before
               | it gets split which is a function of MTU.
               | 
               | If you have 50ms of latency between parties, and you are
               | sending 150ms segments, you'll have a perceived latency
               | of ~200ms which is tolerable for voice conversations.
               | 
               | One other note is that this is ONLY for live voice
               | communication like calling where two parties need to hear
               | and respond within a resonable delay - for downloading of
               | audio messages or audio on videos, including one-way
               | livestreams for example, this ptime is irrelevant and
               | you're not encapsulating with SRTP - that is just for
               | voip-like live audio.
               | 
               | There is a reality in what OP posted which is that there
               | is diminishing returns in actual gains as you get lower
               | in the bitrate, but modern voice implementations in apps
               | like whatsapp are using dynamic ptime and are very smart
               | about adapting the voice stream to account for latency,
               | packet loss and bandwidth.
        
             | newobj wrote:
             | pretty sure they said 380 packets total in the 1 minute
             | call (~6-7/s)
        
         | vel0city wrote:
         | Another interesting use case for these kinds of ultra-low
         | bitrate voice compression systems are digital radio systems.
         | AMBE+2 and similar common voice codecs used on radio systems
         | sound pretty miserable and don't handle dropped packets nearly
         | as gracefully as compared to these newer codecs.
        
         | toast0 wrote:
         | Latency is the mind killer, but if available bandwidth is low,
         | you save a ton of overhead by bundling 2-5 of your 20ms
         | samples. Enough that the codec savings start to make sense,
         | even though 100ms packets adds a ton of latency. Fancier
         | systems can adapt codecs and samples per packet based on
         | current conditions. The one I work on is a static codec and 60
         | ms of audio per packet, which isn't ideal, but allows us to run
         | in low bandwidth much better than 20 ms per packet.
         | 
         | Edit to add: Meta can also afford to add a bit more sampling
         | delay, because they've got very wide distribution of forwarding
         | servers (they can do forwarding in their content appliances
         | embedded in many ISPs), which reduces network delay vs
         | competing services that have limited ability to host forwarding
         | around the globe. Peer to peer doesn't always work and isn't
         | always lower delay than going through a nearby forwarding
         | server.
        
         | lukevp wrote:
         | Why would you need 50 packets per second vs 10? Is 100ms not
         | acceptable but 20ms is?
        
           | tgtweak wrote:
           | Default configuration for SIP used to be 20ms, the rationale
           | behind it was actually sourced in the fact that most SIP was
           | done on LANs and inter-campus WAN which had generally high
           | bitrate connectivity and low latency. The lower the packet
           | time window the sooner the recipient could "hear" your voice,
           | and if there were to be packet loss, there would be less of
           | an impact if that packet were dropped - you'd only lose 20ms
           | of audio vs 100ms. The same applies for high bitrate but high
           | latency (3g for example) connectivity - you want to take
           | advantage of the bandwidth to mitigate some of the network
           | level latency that would impact the audio delay - being
           | "wasteful" to ensure lower latency and higher packet loss
           | tolerance.
           | 
           | Pointedly - if you had a 75ms of one-way latency (150ms RTT)
           | between two parties, and you used a 150ms audio segment
           | length (ptime) you'd be getting close to the 250ms generally
           | accepted max audio delay for smooth two-way communication.
           | the recipient is hearing your first millisecond of audio
           | 226ms later at best. If any packet does get lost, the
           | recipient would lose 150ms of your message vs 20ms.
           | 
           | Modern voice apps and voip use dynamic ptime (usually via
           | "maxptime" which specifies the highest/worst case) in their
           | protocol for this reason - it allows the clients to optimize
           | for all combinations of high/low bandwidth, high/low latency
           | and high/low packet loss in realtime - as network conditions
           | can often change during the course of a call especially while
           | driving around or roaming between wifi and cellular.
        
             | lxgr wrote:
             | > the rationale behind it was actually sourced in the fact
             | that most SIP was done on LANs and inter-campus WAN which
             | had generally high bitrate connectivity and low latency
             | 
             | In addition to that, early VoIP applications mostly used
             | uncompressed G.711 audio, both for interoperability with
             | circuit switched networks and because efficient voice
             | compression codecs weren't yet available royalty-free.
             | 
             | G.711 is 64 kbps, so 12 kbps of overhead are less than 25%
             | - not much point in cutting that down to, say, 10% at the
             | expense of doubling effective latency in a LAN use case.
        
           | crazygringo wrote:
           | > _Is 100ms not acceptable but 20ms is?_
           | 
           | Yup pretty much. Doubling it for round-trip, 200 ms is a
           | fifth of a second which is definitely noticeable in
           | conversation.
           | 
           | 40 ms is a twenty-fifth of a second, or approximately a
           | single frame of a motion picture. That's not going to be
           | noticeable in conversation all.
           | 
           | Of course both of these are on top of other sources of
           | latencies, too.
        
           | NavinF wrote:
           | Yes 100ms feels horrible. People constantly interrupting each
           | other because they start talking at around the same time and
           | then both say "you go first". Discord has decent latency and
           | IMO it's a major reason behind their success
        
         | yalok wrote:
         | this codec is for RTC comms - it supports 20ms frame rate. They
         | did mention it's launched in their calling products:
         | 
         | "We have already fully launched MLow to all Instagram and
         | Messenger calls and are actively rolling it out on WhatsApp--
         | and we've already seen incredible improvement in user
         | engagement driven by better audio quality."
        
         | saurik wrote:
         | I don't know of any setups which would support muxing in
         | exactly the way I am thinking, but another interesting use case
         | is if you have multiple incoming audio streams which you don't
         | want to be mixed by the server -- potentially because they are
         | end-to-end encrypted -- and so a single packet can contain the
         | data from multiple streams. Doing end-to-end encrypted audio
         | calls is finally becoming pretty widespread, and I could see
         | Facebook being in a good position for their products to do
         | custom muxing.
        
         | sjacob wrote:
         | What skills/concepts would one need to acquire in order to
         | fully grasp all that you've detailed here?
        
         | dan-robertson wrote:
         | I'm not totally certain about your argument for the specific
         | amount of overhead (if the receiver/sender are on mobile
         | networks, maybe something happens to the packet headers for the
         | first/last legs before the real internet). But doesn't the OP
         | already give an example where the low bit-rate codec is good:
         | if you can compress things more then you have more of an
         | opportunity to add forward error correction, which greatly
         | improves the quality of calls on lossy connections. I wonder if
         | smaller packets are less likely to be lost, and if there are
         | important cases where multiple streams may be sent like group-
         | calls.
        
       | dgmdoug wrote:
       | They also don't do a comparison with Pied Piper.
        
         | mig39 wrote:
         | It might have a Weissman score in the fives, but I haven't seen
         | a real-world implementation. Does it use middle-out
         | compression?
        
           | byteknight wrote:
           | MIDDLE OUT!
        
       | barbazoo wrote:
       | That's great, now they can reach even more developing countries
       | and do damage the way they did for example in Myanmar [1].
       | 
       | [1] https://www.amnesty.org/en/latest/news/2022/09/myanmar-
       | faceb...
        
       | thrtythreeforty wrote:
       | Are they releasing this or is this just engineering braggadocio?
       | I can't find any other references to MLow other than this blog
       | post.
       | 
       | Facebook/Meta AI Research does cool stuff, and releases a
       | substantial portion of it (I dislike Facebook but I can admit
       | they are highly innovative in the AI space).
        
         | sllabres wrote:
         | If you think about 'implementing then algorithm in a product'
         | it seems so: (From the article) "We are really excited about
         | what we have accomplished in just the last two years--from
         | developing a new codec to successfully shipping it to billions
         | of users around the globe"
        
       | animanoir wrote:
       | Nice technology, tho Opus adds that warm sound I love...
        
       | aidenn0 wrote:
       | Only slightly OT:
       | 
       | ELI5: Why is a typical phone call today less intelligible than a
       | 8kHz 8-bit m-law with ADPCM from the '90s did?
       | 
       | [edit]
       | 
       | s/sound worse/less intelligible/
        
         | skygazer wrote:
         | Does decrease in intelligibility correlate with the instance
         | count of concert seats in front of the loud speakers back in
         | the oughts?
        
         | toast0 wrote:
         | Depends on your call; u-law has poor frequency response and
         | reasonable dynamic range. Not great for music, but ok enough
         | for voice, and it's very consistent. 90s calls were almost all
         | circuit switched in the last mile, and multiplexed per sample
         | on digital lines (T1 and up). This means very low latency and
         | zero jitter; there would be a measurable but actually
         | imperceptible delay versus an end to end analog circuit
         | switched call; but digital sampling near the ends means there
         | would be a lot less noise. Circuit switching also means you'd
         | never get dropped samples --- the connection is made or its
         | not, although sometimes only one-way.
         | 
         | Modern calls are typically using 20 ms samples, over packet
         | switched networks, so you're adding sampling delay, and jitter
         | and jitter buffers. The codecs themselves have encode/decode
         | delay, because they're doing more than a ADC/DAC with a
         | logarithm. Most of the codecs are using significantly fewer
         | bits for the samples than u-law, and that's not for free
         | either.
         | 
         | HD Voice (g.722.2 AMR-Wide Band) has a much larger frequency
         | pass band, and sounds much better than GSM or OPUS or most of
         | these other low bandwidth codecs. There's still delay though;
         | even if people will tell you 20-100ms delay is imperceptible,
         | give someone an a/b call with 0 and 20 ms delay and they'll
         | tell you the 0 ms delay call is better.
        
           | rylittle wrote:
           | could you explain a little more in a more ELI5 please?
        
           | Dylan16807 wrote:
           | > HD Voice (g.722.2 AMR-Wide Band) has a much larger
           | frequency pass band, and sounds much better than GSM or OPUS
           | or most of these other low bandwidth codecs.
           | 
           | At what bitrate, for the comparison to Opus?
           | 
           | And is this Opus using LACE/NoLACE as introduced in version
           | 1.5?
           | 
           | ...and is Meta using it in their comparison? It makes a huge
           | difference.
        
             | toast0 wrote:
             | Yeah, I probably shouldn't have included Opus; I'm past the
             | edit window or I'd remove it with a note. I haven't done
             | enough comparison with Opus to really declare that part,
             | and I don't think the circumstances were even. But I'm
             | guessing the good HD Voice calls are at full bandwidth of ~
             | 24 kbps, and I'm comparing with a product that was said to
             | be using opus at 20 kbps. Opus at 32kbps sounds pretty
             | reasonable. And carrier supported HD voice probably has
             | prioritization and other things going on that mean less
             | loss and probably less jitter. Really the big issue my ear
             | has with Opus is when there's loss.
             | 
             | I don't think I've been on calls with Opus 1.5 with
             | lace/no-lace, released 3 months ago, so no, I haven't
             | compared it with HD voice that my carrier deployed a decade
             | ago. Seems a reasonable thing for Meta to test with, but it
             | might be too new to be included in their comparison as
             | well.
        
               | Dylan16807 wrote:
               | > Really the big issue my ear has with Opus is when
               | there's loss.
               | 
               | That would definitely complicate things. Going by the
               | test results that got cited on Wikipedia, Opus has an
               | advantage at 20-24, but that's easy enough to overwhelm.
               | 
               | And the Opus encoder got some other major improvements up
               | through 2018, so I'd be interested in updated charts.
               | 
               | Oh and 1.5 also adds a better packet loss mechanism.
        
         | sva_ wrote:
         | Hearing ability deteriorates with age.
        
           | aidenn0 wrote:
           | Yes, but it doesn't deteriorate in such a way as to cause
           | someone speaking to sound like gibberish and/or random
           | medium-frequency tones, which happens in nearly every single
           | cell phone conversation I have that lasts more than 5
           | minutes.
           | 
           | My experience is that phone calls nowadays alternate between
           | a much wider-band (and thus often better sounding) experience
           | and "WTF was that just now?"
        
         | hot_gril wrote:
         | Phone ear speakers are quieter than they used to be, so if the
         | other person isn't talking clearly into the mic, you can't
         | crank it up. I switched from a flip phone to an iPhone in 2013,
         | huge difference. I had to immediately switch to using earbuds
         | or speakerphone. Was in my teens at the time.
        
       | iamnotsure wrote:
       | Please stop lossy compression.
        
         | cheema33 wrote:
         | Lossy compression has its practical uses. Under ideal
         | circumstances nobody is going to stop you from using FLAC.
        
         | GrantMoyer wrote:
         | Have you ever looked at the size of losslessly compressed
         | video? It's huge. Lossy compression is the only practical way
         | to store and stream video, since it's typically less than 1% of
         | the size of uncompressed video. Lossless compression typically
         | only gets down to about 50% of the size. It's amazing how much
         | information you can throw out from a video, and barely be able
         | to tell the difference.
        
       | amelius wrote:
       | Can't we have an audio codec that first sends a model of the
       | particular voice, and then starts to send bits corresponding to
       | the actual speech?
        
         | roywiggins wrote:
         | You need a bunch of bandwidth upfront for that, which you might
         | not have, and enough compute available at the other end to
         | reconstruct it, which you really might not have.
        
           | amelius wrote:
           | Regarding your first point, how about reserving a small
           | percentage of the bandwidth for a model that improves
           | incrementally?
        
             | wildzzz wrote:
             | You're adding more complexity to both the transmitter and
             | receiver. I'd be pretty pissed if I had to endure
             | unintelligible speech for a few minutes until the model was
             | downloaded enough to be able to hear my friend. I'd also be
             | a little pissed if I had to store massive models for
             | everyone in my call log. Also both devices need to be able
             | to run this model. If you are regularly talking over a shit
             | signal, you're probably not going to be able to afford the
             | latest flagship phone that has the hardware necessary to
             | run it (which is exactly what the article touches on). The
             | ideal codec takes up almost no bandwidth, sounds like
             | you're sitting next to the caller, and runs on the average
             | budget smartphone/PC. The issue is that you aren't going to
             | be able to get one of these things so you choose a codec
             | that best balances complexity, quality, and bandwidth given
             | the situation. Link quality improves? Improve your voice
             | quality by increasing the codec bitrate or switching to
             | another less complex one to save battery. If both devices
             | are capable of running a ML codec, then use that to improve
             | quality and fit within the given bandwidth.
        
         | neckro23 wrote:
         | This is actually an old idea, minus the AI angle (1930s). It's
         | what voders and vocoders were originally designed for, before
         | Kraftwerk et al. found out you can use them to make cool robot
         | voices.
        
       | hubraumhugo wrote:
       | Is it just my perception or has Meta become cool again by sharing
       | a ton of research and open source (or open weights) work?
       | 
       | Facebook's reputation was at the bottom, but now it seems like
       | they made up for it.
        
         | mrguyorama wrote:
         | How the hell does releasing one audio codec undo years and
         | years of privacy nightmare, being a willing bystander in an
         | actual genocide, experimenting with the emotions of depressed
         | kids, and collusion to depress wages?
        
           | pt_PT_guy wrote:
           | they also did release LLM models, and zstd, and mold, and and
           | and... a lot of stuff
        
             | visarga wrote:
             | React and Pytorch
             | 
             | compare that to Angular and TensorFlow, such a difference
             | in culture
        
               | hot_gril wrote:
               | Easy vs tons-o-boilerplate.
        
             | cheema33 wrote:
             | Don't forget React. The most popular frontend stack at the
             | moment. Been that way for some time.
             | 
             | And GraphQL, Relay, Stylex...
        
               | XlA5vEKsMISoIln wrote:
               | >React
               | 
               | Ah yes, the <body id="app"></body> websites.
        
               | cztomsik wrote:
               | How is that specific to React? And who would use webapp
               | technology for a website?
        
             | ComputerGuru wrote:
             | zstd and mold are personal projects regardless of employer.
             | That said, I didn't know mold was written by a meta guy.
        
               | lucb1e wrote:
               | Zstd is a personal project? Surely it's not by accident
               | in the Facebook GitHub organization? And that you need to
               | sign a contract on code.facebook.com before they'll
               | consider merging any contributions? That seems like an
               | odd claim, unless it _used to be_ a personal project and
               | Facebook took it over
               | 
               | (https://github.com/facebook/zstd/blob/dev/CONTRIBUTING.m
               | d#co...)
        
           | risho wrote:
           | it isn't just one audio codec. they also released and
           | continue to release the best self hostable large language
           | model weights, they have authored many open source projects
           | that are staples today such as zstandard, react, pytorch
        
           | stuxnet79 wrote:
           | You will need to provide citations on the last point as
           | Facebook are widely known to have broken the gentleman's
           | agreement between Apple and Google that was suppressing tech
           | pay in the early 2010s.
        
             | giraffe_lady wrote:
             | OK sure even if they didn't do that we're still left with
             | "knowingly abetted a genocide" which no amount of open
             | source work can ever balance out.
        
               | rylittle wrote:
               | context?
        
               | _whiteCaps_ wrote:
               | https://www.amnesty.org/en/latest/news/2022/09/myanmar-
               | faceb...
        
               | robertlagrant wrote:
               | This article seems to not really mention the "knowingly"
               | or the "abetted". If there are people killing other
               | people, I wouldn't say that a communication method was to
               | blame. In Scream, Sidney didn't sue the phone company who
               | let the killer call her from inside the house. The idea
               | that some news feed posts whipped people up into a
               | killing frenzy just sounds absurd.
               | 
               | I wish the author could see that, and if the case is
               | valid, to provide it, instead of some pretty tenuous
               | claims of connection strung together to lead up to a
               | demand for money.
               | 
               | I did try to go to the link that evidenced the "multiple"
               | times Facebook was contacted in a 5 year period, but I
               | couldn't get through. How many times was it, for anyone
               | who can?
        
               | giraffe_lady wrote:
               | The full report is linked to in the first paragraph.
               | These points are all addressed in detail there.
        
           | yard2010 wrote:
           | You forgot selling the 2016 US elections to Putin for 100k[0]
           | 
           | Good luck undoing that releasing codecs haha
           | 
           | [0] https://time.com/4930532/facebook-russian-
           | accounts-2016-elec...
        
         | nine_k wrote:
         | I have the same impression.
         | 
         | Facebook the social network reputation may be not shiny, but
         | Meta the engineering company reputation is pretty high, to my
         | mind.
         | 
         | It's somehow similar to IBM, who may look not stellar as a
         | hardware or software solutions provider, but still have quite
         | cool research and microelectronics branches.
        
         | danuker wrote:
         | I don't think they made up for it. They are training AIs off of
         | personal data. The open stuff are a desperate red herring.
         | 
         | https://www.theregister.com/2024/06/10/meta_ai_training/
        
           | jhallenworld wrote:
           | So this is one argument- another is that I'm impressed that
           | they got their own LLM running and integrated into facebook
           | messenger.
           | 
           | I ran across this interesting graphic recently:
           | 
           | https://www.theverge.com/2023/12/4/23987953/the-gpu-haves-
           | an...
           | 
           | Suddenly facebook is useful as a search engine..
           | 
           | Also interesting is that Meta AI is much faster than ChatGPT
           | from end-user's point of view, but results not quite as good.
           | Here is a comparison:
           | 
           | https://www.youtube.com/watch?v=1vLvLN5wxS0
        
           | jen729w wrote:
           | > a desperate red herring.
           | 
           |  _Or_ you can recognise that 'Meta' isn't a conscious entity,
           | and that it's perfectly likely that there are some people
           | _over there_ doing amazing open-source work, and different
           | people _over there_ making ethically dubious decisions when
           | building their LLMs.
        
         | RIMR wrote:
         | I would say that I have a very favorable opinion of Meta in
         | terms of how they share their research and open source their
         | software.
         | 
         | I would say that I have a very unfavorable opinion of Meta in
         | terms of their commitment to privacy, security, and social
         | responsibility.
        
         | lucb1e wrote:
         | research department != product department
         | 
         | Microsoft Research also puts out some really cool stuff, but
         | that does not mean the "same" Microsoft can't show ads in their
         | OS' start menu for people's constant enjoyment. I noticed this
         | interesting discrepancy in Microsoft some years ago as a
         | teenager; it does not surprise me at all that Facebook has a
         | division doing cool things (zstandard, etc.) and a completely
         | separate set of people working towards completely different
         | goals simultaneously. Probably most companies larger than a
         | couple hundred people have such departmental discrepancies
        
         | yard2010 wrote:
         | Don't fall for it. They will find a way to let some powerful
         | actors exploit the users for dimes.
        
         | chefandy wrote:
         | The marketing is def working. I'm sure we'd be pretty depressed
         | by some of the projects that didn't make the blog cut.
        
       | theoperagoer wrote:
       | Was hoping this would have a GitHub link ...
        
       | mcoliver wrote:
       | Maybe SiriusXM can pick this up. The audio quality is generally
       | awful but especially on news/talk channels like Bloomberg and
       | CNBC. There is no depth or richness to the voices.
        
         | tgtweak wrote:
         | It actually comes down to the SiriusXM receiver that is being
         | used - I've witnessed the built-in sirius/xm on on the latest
         | GM platform (a $100,000+ Cadillac) sounding like AM radio to
         | immediately sitting in a better-than-apple-streaming quality
         | rendition of the exact same channel on an older lexus a few
         | minutes apart...
         | 
         | The mobile xm receivers (like ipods) that they used to sell
         | also had very good quality and I never noticed any quality
         | shortcomings even with good headphones.
         | 
         | I think the "high" quality stream is 256kbps/16k which is
         | fairly high compared to most streaming services that come in
         | around 128/160.
        
           | sitkack wrote:
           | I am archiving some music at 40kbps using Opus and the
           | quality is pretty amazing. I think once things get over
           | 20+kbps all the codecs start sounding pretty good (relative
           | to the these low bitrates).
           | 
           | I still prefer flac if possible.
        
             | theoperagoer wrote:
             | Opus is fantastic!
        
           | wildzzz wrote:
           | My old Sirius portable receiver sounded like garbage despite
           | the marketing material saying "Crystal Clear". My 2006
           | Infiniti Sirius receiver didn't sound any better despite
           | being a massive space heater in the trunk. The later cars
           | I've used it in sound good, at least good enough to sound as
           | clear as FM radio or even HD radio. I think some of the
           | channels are still in lower bitrates like the news channels
           | for example, they've always sounded bad. There's something
           | I've read about SiriusXM using terrestrial transmitters which
           | may improve the signal whereas the satellite link may be of
           | lower bandwidth.
        
       | victorp13 wrote:
       | Does anyone happen to know if ChatGPT's voice feature uses audio
       | compression similar to Opus? Especially the "heavy 30 percent
       | receiver-side packet loss" example sounds a LOT like the
       | experience I have sometimes.
        
       | dsign wrote:
       | Can I use this to make music?
       | 
       | A little bit on a tangent, a technique called Linear Predictive
       | Coding, which was developed by telecoms in the sixties and
       | seventies, has a calculated bandwidth of 2.5 kbit/s. The sound
       | quality is not any good, and telephone companies of the time
       | didn't use it for calls, but the paper I read describing the
       | technique says the decoded speech is "understandable". LPC found
       | its way into musical production, in a set of instruments called
       | "vocoders" used to distort a singer's voce. There are, for
       | example, variations of it in something called "Orange Vocoder
       | IV".
       | 
       | So, now I'm wondering, can MLow be used to purposefully introduce
       | interesting distortions in speech? Or even change a singer's
       | voice color?
        
         | WalterSear wrote:
         | Just use Digitalis :)
         | 
         | https://www.youtube.com/watch?v=bA23ysR2hAo
        
       | dbcurtis wrote:
       | What is the license? I searched but could not find anything.
        
         | varenc wrote:
         | It appears to be entirely closed source at the moment. This is
         | just the announcement of development of in-house tech. Nothing
         | public yet that could even be licensed.
        
       | annoyingnoob wrote:
       | I wonder how it sounds compared to G.729.
       | 
       | I worked for a company 20 years ago that had a modified G.729
       | codec that could go below 8kbps but sounded decent. We used this
       | for VoIP over dial-up Internet, talk about low bandwidth.
       | 
       | Turns out some of the more interesting bits were in the jitter
       | buffer and ways to manage the buffer. Scrappy connections deliver
       | packets when they can and there is an art to managing the
       | difference between the network experience and the user
       | experience. For communications, you really need to manage the
       | user experience.
        
       | eelioss wrote:
       | I am curious about encoding times comparing with standard codecs
       | using ffmpeg
        
         | lucb1e wrote:
         | They claim it's 10% lower than Opus, specifically for the
         | decoder iirc but since they speak of the 10-year-old hardware
         | used to make million of WhatsApp calls daily, the encoder can't
         | be computationally complex either
         | 
         | But, yeah, some actual data (if they're not willing to provide
         | running code) would have been a welcome addition to this PR
         | overview
        
       | therealmarv wrote:
       | Somebody knows if this is better compared to whatever Google Meet
       | is using? With choppy near unusable slow Internet Google Meet
       | still fulfils its purpose on Audio Calls where all other
       | competitors fail (tested e.g. while being in Philippines on a
       | remote island with very bad internet). However Google Meet's tech
       | is not published anywhere afaik.
        
         | lucb1e wrote:
         | Can hardly try that out if this PR piece does not contain any
         | code. We can judge it as well as you can from the couple
         | examples they showed off
        
       | mckirk wrote:
       | Maybe it's just me (or maybe I've invested too much money into
       | headphones), but I actually liked the Opus sound better at 6
       | kbps. The MLow samples had these... harsh and unnatural
       | artifacts, whereas the Opus sound (though sounding like it came
       | from a tin-can-and-string telephone and lacked all top-end) at
       | least was 'smooth'. But I'm pretty sure that's because they are
       | demonstrating here the very edge of what their codec can do, at
       | higher bitrates the choice would probably be a lot clearer.
        
       ___________________________________________________________________
       (page generated 2024-06-13 23:00 UTC)