[HN Gopher] MLow: Meta's low bitrate audio codec
___________________________________________________________________
MLow: Meta's low bitrate audio codec
Author : mikece
Score : 333 points
Date : 2024-06-13 15:05 UTC (7 hours ago)
(HTM) web link (engineering.fb.com)
(TXT) w3m dump (engineering.fb.com)
| PaulHoule wrote:
| Sometimes it sounds great but there are moments I think I'm
| listening to a harp and not somebody's voice.
| plus wrote:
| It's not exactly reasonable to expect super high fidelity audio
| at the bitrate constraints they're targeting here, and it
| certainly sounds a lot better than the Opus examples they're
| comparing against.
| cobbal wrote:
| The more complicated the codec, the more fascinating the
| failure modes. I love watching a digital TV with a bad
| signal, because the motion tracking in the codec causes
| people to wear previous, glitched frames as a skin while they
| move.
| ugjka wrote:
| Look up datamoshing on youtube
| cnity wrote:
| Good observation, and probably part of what makes "glitchy"
| AI generated video so captivating to watch.
| 77pt77 wrote:
| Are they comparing against opus using nolace?
|
| Because that makes all the difference!
| Tostino wrote:
| That is a marked improvement compared to the other examples
| provided. Nice to see it also has less compute resources required
| for that higher quality output.
| zekica wrote:
| Honest question: why do we need to optimize for <10kbps? It's
| really impressive what they are able to achieve at 6kbps, but LTE
| already supports >32kbps and there we have AMR-WB or Opus (Opus
| even has in-band FEC at these bitrates so packet loss is not that
| catastrophic). Maybe it's useful in satellite direct-to-phone
| use-cases?
| ThrowawayTestr wrote:
| > why do we need to optimize for <10kbps?
|
| Because some people have really slow internet
| hokumguru wrote:
| There exist a few billion people without LTE. Meta doesn't only
| operate in the western world.
| noprocrasted wrote:
| Are there really many situations where a 10kbps connection
| would actually be _stable_ enough to be usable? Usually when
| you get these kinds of speeds it means the underlying
| connection is well and truly compromised, and any kind of
| real-time audio would fail anyway because you 're drowning in
| a sea of packet loss and retransmissions.
|
| Even in cases where you do get a _stable_ 10kbps connection
| from upstream, how are you going to manage getting any usable
| traffic through it when everything nowadays wastes bandwidth
| and competes with you (just look at any iOS device 's
| background network activity - and that's before running any
| apps which usually embed dozens of malicious SDKs all
| competing for bandwidth)?
| bogwog wrote:
| I don't know what you consider "stable enough", but the 30%
| packet loss demo in the article is pretty impressive.
| zeroxfe wrote:
| > Are there really many situations where a 10kbps
| connection would actually be stable enough to be usable?
|
| Yes there are. We ran on stable low bandwidth connections
| for a very long time before we had stable high bandwidth
| connections. A large part of the underdeveloped world has
| very low bandwidth, and use 5 - 10 Kbps voice channels.
| noprocrasted wrote:
| > We ran on stable low bandwidth connections
|
| Are you talking about the general "we" or your situation
| in particular? For the former, yes sure we started with
| dial-up, then DSL, etc, but back then software was built
| with these limitations in mind.
|
| Constant background traffic for "product improvement"
| purposes would be completely unthinkable 20 years ago;
| now it's the norm. All this crap (and associated TLS
| handshakes) quickly adds up if all you've got is kilobits
| per second.
| dspillett wrote:
| _> Are you talking about the general "we"_
|
| I assume the general-ish "we", where it is general to the
| likes of you and I (and that zeroxfe). There are likely
| many in the world stuck at the end of connections run
| over tech that this "general subset" would consider
| archaic, and that zeroxfe was implying their connections,
| while slow, may be similarly stable to ours back then.
|
| Also, a low bandwidth stable connection could be one of
| many multiplexed through a higher bandwidth stable
| connection.
| zeroxfe wrote:
| Let's not move the goalposts here :-) The context is an
| audio codec, not heavyweight web applications, in
| response to your question "Are there really many
| situations where a 10kbps connection would actually be
| stable enough to be usable?" And I'm saying yes, in that
| context, there are many situations, like VoIP, where
| 10kbps is usable.
|
| Nobody here would argue that 10kbps is usable today for
| the "typical" browser-based Internet use.
| meindnoch wrote:
| >Are there really many situations where a 10kbps connection
| would actually be stable enough to be usable?
|
| Scroll to this part of the article:
|
| >Here are two audio samples at 14 kbps with heavy 30
| percent receiver-side packet loss.
| gorkish wrote:
| Yes; backhaul connections in telephony applications are
| often very stable and are already capacity managed by
| tuning codec bandwidth. Say you are carrying 1000 calls
| with uLaw (64kbps * 1000) over a pair of links and one
| fails. Do you A) carry 500 calls on the remaining link B)
| stuff all calls onto the same link and drop 50% of the
| packets or C) Change to a 32kbps codec?
|
| It seems you may be imaging the failure case where your
| "ISP is slow" or something like that due to congestion or
| packet loss -- as I posted elsewhere in the thread the
| bandwidth is only one aspect of how a "low bitrate" codec
| may be expected to perform in a real world application. How
| such a codec degrades when faced with bit errors or even
| further reduced channel capacity is often more important in
| the real application. These issues are normally solved with
| things like FEC which can be incorporated as part of the
| codec design itself or incorporated as part of the
| modem/encoding/modulation of the underlying transport.
| wmf wrote:
| Facebook Messenger and WhatsApp don't run over TDM
| though. If WhatsApp is only getting ~10 kbps, that's due
| to extreme congestion.
| gorkish wrote:
| Yes; but what is your point? A congested network like you
| describe isnt ever going to reliably carry realtime
| communications anyway due to latency and jitter. All you
| could reasonably due to 'punch through' that situation is
| to use dirty tricks to give your client more than its
| fair share of network resources.
|
| 6kbps is 10x less data to transfer than 64kbps, so for
| all the async aspects of Messenger or WhatsApp there is
| still enormous benefit to smaller data.
| dspillett wrote:
| _> Are there really many situations where a 10kbps
| connection would actually be stable enough to be usable?_
|
| Yes (most likely: that was an intuited "yes" not one born
| of actually checking facts!). There are many places still
| running things over POTS rather than anything like (A)DSL,
| line quality issues could push that down low and even if
| you have a stable 28kbit/s you might want to do something
| with it at the same time as the audio comms.
|
| Also, you may be trying to cram multiple channels over a
| relatively slow (but stable) link. Given the quality of the
| audio when calling some support lines I suspect this is
| very common.
|
| Furthermore, you might find a much faster _unstable_
| connection with a packet-loss "correcting" transport
| layered on top effectively producing a stable connection of
| much lesser speed (though you might get periods of <10kbit
| here due to prolonged dropouts and/or have to institute an
| artificial delay if the resend latency is high).
| treflop wrote:
| Even in the Western world, you can appreciate low bandwidth
| apps even you are a music festival or traveling through
| relative wilderness.
| gorkish wrote:
| It's useful.
|
| AMBE currently has a stranglehold in this area and by any and
| every measurable metric, AMBE is terrible and should be burned
| in the deepest fires of hell and obliterated from all of
| history.
| londons_explore wrote:
| Internet connectivity tends to have a throughput vs latency
| curve.
|
| If you need reliable low latency, as you want for a phone call,
| you get very little throughput.
|
| Examples of such connections are wifi near the end of the
| range, or LTE connections with only one signal bar.
|
| In those cases, a speedtest might say you have multiple
| megabits available, but you probably only have kilobits of
| bandwidth if you want reliable low latency.
| zekica wrote:
| Yes, but it doesn't have to be. Have you looked into Dave
| Taht's crusade against buffers?
| lxgr wrote:
| Correct buffer sizing isn't a good solution for
| Bufferbloat: The ideal size corresponds to the end-to-end
| bandwidth-delay product, but since one buffer can handle
| multiple flows with greatly varying latencies/delays, that
| number does not necessarily converge.
|
| Queueing aware scheduling algorithms are much more
| effective, are readily available in Linux (tc_codel and
| others), and are slowly making their way into even consumer
| routers (or at least I hope).
| lxgr wrote:
| Load ratios of > 0.5 are definitely achievable without
| entering Bufferbloat territory, and even more is possible
| using standing queue aware schedulers such as CoDel.
|
| Also, Bufferbloat is usually not (only) caused by you, but by
| people sharing the same chokepoint as you in either or both
| directions. But if you're lucky, the router owning the
| chokepoint has at least some rudimentary per-flow or per-IP
| fair scheduler, in which case sending less yourself can
| indeed help.
|
| Still, to have that effect result in a usable data rate of
| kilobits on a connection that can otherwise push megabits
| (disregarding queueing delay), the chokepoint would have to
| be severely overprovisioned and/or extremely poorly
| scheduled.
| cornstalks wrote:
| There's a section ("Our motivation for building a new codec")
| in the article that directly addresses this. Assuming you have
| >32 kbps bandwidth available is a bad assumption.
| nicce wrote:
| The best assumption would be that you either have connection
| available or not available.
|
| Then, if it is available, what is the minimal data rate for
| connections which are available in general? If we do
| statistical analysis for that, is it lower that 32 kbps? How
| significantly?
|
| For some reason, I would assume that if you have connection,
| it is faster than 2G these days.
| zamadatix wrote:
| The question isn't really the minimal bandwidth of the PHY
| rate it's about the goodput for a given reliability.
| Regardless of your radio there will always be some point
| where someone is at the edge of a connection and goodput is
| less than minimal PHY bandwidth. The call then turns
| choppy/into a time stretched robot you get every other
| syllable from. The less data you need to transmit + the
| more FEC you can fit in the goodput then the better that
| situation becomes.
|
| Not to mention "just because I have some minimal baseline
| of $x kbps doesn't mean I want $y to use all of it the
| entire time I'm on a call if it doesn't have to".
| sangnoir wrote:
| > For some reason, I would assume that if you have
| connection, it is faster than 2G these days.
|
| That assumption does not hold for a sizable chunk of Meta's
| 3.98B-strong userbase. The list of counties that switched
| off 2G is surprisingly short.
| nicce wrote:
| Now that you mention it, Wikipedia seems to have
| interesting list about that. Seems like that by 2030, the
| most starts to switch off.
|
| https://en.wikipedia.org/wiki/2G
| hateful wrote:
| It's not only about the end that's receiving, it's also the end
| that's transmitting 10kbps * thousands of users.
| sogen wrote:
| I'm assuming they'll just re-encode everything, for every user,
| to a lower bitrate using this codec.
|
| So, with their huge user base they'll be saving a gazillion
| terabytes hourly, that's what I concluded from their "2 years
| in the making" announcement.
| ajb wrote:
| If you mean for storage, real time codecs are actually pretty
| inefficient for that use case because they don't get much use
| of temporal redundancy. Although I'm not actually aware of a
| non-real time audio codec specialised for voice. They
| probably exist in Cheltenham and Maryland but for Meta this
| likely doesn't make a big enough part of their storage costs
| to bother
| lxgr wrote:
| Meta's use case are OTT applications on the Internet, which are
| usually billed per byte transmitted. Reducing the bitrate for
| the audio codec used lets people talk longer per month on the
| same data plan.
|
| That said, returns are diminishing in that space due to the
| overhead of RTP, UDP and IP; see my other comment for details
| on that.
| evandrofisico wrote:
| More than that, in developing countries, such as my own, Meta
| has peering agreements with telephony companies which allow
| said companies to offer basic plans where traffic to Meta
| applications (mostly whatsapp) is not billed. This would
| certainly reduce their costs immensely, considering that
| people use whatsapp as THE communications service.
| kylehotchkiss wrote:
| Maybe something like this would be helpful for Apple to
| implement voice messages over satellite. Also a LOT of people
| in developing countries use WhatsApp voice messages with slow
| network speeds or expensive data. It's too easy to forget how
| big an audience Meta has outside the western world
| gorkish wrote:
| The lack of any reference or comparison to Codec2 immediately
| leads me to question the real value and motivation of this work.
| The world doesn't need another IP-encumbered audio codec in this
| space.
| muizelaar wrote:
| They also don't compare with Lyra
| (https://github.com/google/lyra)
| cvg wrote:
| Nice. Google's soundstream already has some great quality.
| Some examples at 6kbps here: https://google-
| research.github.io/seanet/soundstream/example...
| gorkish wrote:
| Or speex narrowband or others. I think the tendency to pick
| Opus is just because it has a newer date on it -- its design
| goals were not necessarily to optimize for low bitrate; Opus
| just happened to still sound OK when the knob was turned down
| that far.
|
| One other point I intended to make that is not reflected in
| many listening/comparison tests offered by these
| presentations -- in the typical applications of low bitrate
| codecs, they absolutely must be able to gracefully degrade.
| We see Mlow performing at 6kbps here; how does it perform
| with 5% bit errors? Can it be tuned for lower bitrates like
| 3kpbs? A codec with a 6kbps floor that garbles into nonsense
| with a single bit flip would be dead-on-arrival for most real
| world applications. If you have to double the bitrate with
| FEC to make it reliable, have you really designed a low
| bitrate codec? The only example we heard of mlow was 30% loss
| on a 14kbps stream = 9.8kbps. Getting 6kbps through such a
| channel is a trivial exercise.
| DragonStrength wrote:
| My understanding was Opus was specifically developed with
| the idea of replacing both Speex and Vorbis. "Better
| quality than Speex" is literally one of their selling
| points, so I'd be interested to hear more details.
| Dwedit wrote:
| There's also the LPCNet Codec (2019), which does wideband
| speech at 1.6kb/s by using a recurrent neural network.
|
| https://jmvalin.ca/demo/lpcnet_codec/
| nickels wrote:
| Could it be used for voice over satellite, ie Emergency SOS via
| satellite on iPhones?
| lxgr wrote:
| iPhones use Globalstar, which theoretically supports voice
| bitrates of (I believe) 9.6 kbps, although only using dedicated
| satphones with large, external antennas.
|
| Apple's current solution requires several seconds to transmit a
| location stamp of only a handful of bytes, so I think we're
| some either iPhone or satellite upgrades away from real-time
| voice communication over that.
|
| Starlink has demonstrated a direct-to-device video call
| already, though, so we seem to be quickly approaching that
| point! My strong suspicion is that Apple has bigger plans for
| Globalstar than just text messaging.
| zekica wrote:
| Starlink is in a better position as their satellites are in a
| low earth orbit - 30 times closer than geostationary. It
| correlates to 1000 times (30dB) stronger signal on both
| sides.
| lxgr wrote:
| Globalstar is LEO as well, although a bit higher (~1400 km)
| than Iridium (~780 km) and Starlink (below Iridium; various
| altitudes). In terms of SNR, they're very comparable.
|
| Newer GEO direct-to-device satellites also have huge
| reflectors and often much higher transmit power levels that
| can compensate for the greater distance somewhat. Terrestar
| and Thuraya have had quite small phones available since the
| late 2000s already, and they're both (large) GEO.
| ianburrell wrote:
| Iridium and Globalstar aren't geostationary. They are LEO
| not much higher than Starlink.
|
| Starlink is doing direct-to-cell. Talking to existing
| phones requires a large antenna. The bandwidth for each
| device is slow, not enough for mobile data, but better than
| Iridium. I think they recently showed off voice calls.
| chronogram wrote:
| No mention of NoLACE make the comparison samples a bit less
| useful: https://opus-codec.org/demo/opus-1.5/
| jamal-kumar wrote:
| That does sound very nice
| sitkack wrote:
| This is really cool and I very very very much appreciate that
| xiph puts so much work into standardization.
| https://datatracker.ietf.org/wg/mlcodec/documents/
|
| It would be nice if Meta donated this to the world so we have
| less anchors for patent trolls and can transition the future we
| deserve.
| 77pt77 wrote:
| Where is the source code?
| lxgr wrote:
| All these new, low-bitrate codecs are amazing, but ironically I
| suspect that they won't actually be very useful in most of the
| scenarios Meta is using them:
|
| To keep latency low in real-time communications, the packet rate
| needs to be relatively high, and at some point the overhead of
| UDP, IP, and lower layers starts dominating over the actual
| payload.
|
| As an example, consider (S)RTP (over UDP and IP): RTP adds at
| least 12 bytes of overhead (let's ignore the SRTP authentication
| tag for now); UDP adds 8 byte, and IPv4 adds 20, for a total of
| 40. At at typical packet rate of 50 per second (for a
| serialization delay of 1/50 = 20ms), that's 16 kbps of overhead
| alone!
|
| It might still be acceptable to reduce the packet rate to 25 per
| second, which would cut this in half for an overhead of 8 kbps,
| but the overhead would still be dominating the total transmission
| rate.
|
| Where codecs like this can really shine, though, is circuit-
| switched communication (some satphones use bitrates of around 2
| kbps, which currently sound awful!), or protocol-aware VoIP
| systems that can employ header compression such as that used by
| LTE and 5G in IMS (most of the 40 bytes per frame are extremely
| predictable).
| tgtweak wrote:
| I think this is likely incorrect based on how much voice/audio
| distribution meta does today with facebook (and facebook live),
| instagram and whatsapp - moreso with whatsapp voice message and
| calling given it's considerable market share in countries with
| intermittent and low-reliability network connectivity. The fact
| it is more packet-loss robust and jitter-robust means that you
| can rely on protocols that have less error correction,
| segmenting and receive-reply overhead as well.
|
| I don't think it's unreasonable to assume this could reduce
| their total audio-sourced bandwidth consumption by a
| considerable amount while maintaining/improving reliability and
| perceived "quality".
|
| Looking at wireshark review of whatsapp on an active call there
| was around 380 UDP packets sent from source to recipient during
| a 1 minute call, and a handful of TCP packets to whatsapp's
| servers. That would yield a transmission overhead of about
| 2.2kbps.
|
| quick edit to clarify why this is: you can see starting ptime
| (audio size per packet) set to 20ms here, but maxptime set to
| 150ms, which the clients can/will use opportunistically to
| reduce the number of packets being sent taking into
| consideration the latency between parties and bandwidth
| available.
|
| (image): https://www.twilio.com/content/dam/twilio-
| com/global/en/blog...
| lxgr wrote:
| What part of that calculation is incorrect in your view?
|
| > 380 UDP packets sent from source to recipient during a 1
| minute call, and a handful of TCP packets to whatsapp's
| servers. That would yield a transmission overhead of about
| 2.2kbps.
|
| That sounds like way too many packets! 380 packets per
| second, at 40 bytes of overhead per packet, would be almost
| 120 kbps.
|
| My calculation only assumes just 50, and that's already at a
| quite high packet rate.
|
| > you can rely on protocols that have less error correction
|
| You could, but there's no way to get a regular smartphone IP
| stack running over Wi-Fi or mobile data to actually expose
| that capability to you. Even just getting the OS's UDP stack
| (to say nothing of middleboxes) to ignore UDP checksums and
| let you use those extra four bytes for data can be tricky.
|
| Non-IP protocols, or even just IP or UDP header compression,
| are completely out of reach for an OTT application. (Networks
| might transparently do it; I'm pretty sure they'd still
| charge based on the gross data rate though, and as soon as
| the traffic leaves their core network, it'll be back to
| regular RTP over UDP over IP).
|
| What they could do (and I suspect they might already be
| doing) is to compress RTP headers (or use something other
| than RTP) and/or pick even lower packet rates.
|
| > I don't think it's unreasonable to assume this could reduce
| their total audio-sourced bandwidth consumption by a
| considerable amount while maintaining/improving reliability
| and perceived "quality".
|
| I definitely don't agree on the latter assertion - packet
| loss resilience is a huge deal for perceived quality! I'm
| just a bit more pessimistic on the former, unless they do the
| other optimizations mentioned above.
| roman-holovin wrote:
| I read it as in 380 packets per whole call, which was a
| minute long, not 380 packets per second during 1 minute.
| mikepavone wrote:
| That's about 160 ms of audio per packet. That's a lot of
| latency to add before you even hit the network
| ant6n wrote:
| Assuming continuous sound. You don't need to send many
| packets for silence.
| lxgr wrote:
| Voice activity detection and comfort noise have been
| available in VoIP since the very beginning, but now I
| wonder if there's some clever optimization that could be
| done based on a semantic understanding of conversational
| patterns:
|
| During longer monologues, decrease packet rates; for
| interruptions, send a few early samples of the
| interrupter to notify the speaker, and at the same time
| make the (former) speaker's stack flush its cache to
| allow "acknowledgement" of the interruption through
| silence.
|
| In other words, modulate the packet rate in proportion to
| the instantaneous interactivity of a dialogue, which
| allows spending the "overhead budget" where it matters
| most.
| markus92 wrote:
| I think you're misreading OP, as he says 380 packets per
| minute, not second. That would give you an overhead of 253
| bytes per second, sounds a lot more reasonable.
| lxgr wrote:
| Wow, that would be an extremely low packet rate indeed!
|
| That would definitely increase the utility of low bitrate
| codecs by a lot, at the expense of some latency (which is
| probably ok, if the alternative is not having the call at
| all).
| tgtweak wrote:
| Yes 380/min = ~6/s which is a very open ptime of >100ms,
| this can also be dynamic and change don the fly. It
| ultimately comes down to how big the packet can be before
| it gets split which is a function of MTU.
|
| If you have 50ms of latency between parties, and you are
| sending 150ms segments, you'll have a perceived latency
| of ~200ms which is tolerable for voice conversations.
|
| One other note is that this is ONLY for live voice
| communication like calling where two parties need to hear
| and respond within a resonable delay - for downloading of
| audio messages or audio on videos, including one-way
| livestreams for example, this ptime is irrelevant and
| you're not encapsulating with SRTP - that is just for
| voip-like live audio.
|
| There is a reality in what OP posted which is that there
| is diminishing returns in actual gains as you get lower
| in the bitrate, but modern voice implementations in apps
| like whatsapp are using dynamic ptime and are very smart
| about adapting the voice stream to account for latency,
| packet loss and bandwidth.
| newobj wrote:
| pretty sure they said 380 packets total in the 1 minute
| call (~6-7/s)
| vel0city wrote:
| Another interesting use case for these kinds of ultra-low
| bitrate voice compression systems are digital radio systems.
| AMBE+2 and similar common voice codecs used on radio systems
| sound pretty miserable and don't handle dropped packets nearly
| as gracefully as compared to these newer codecs.
| toast0 wrote:
| Latency is the mind killer, but if available bandwidth is low,
| you save a ton of overhead by bundling 2-5 of your 20ms
| samples. Enough that the codec savings start to make sense,
| even though 100ms packets adds a ton of latency. Fancier
| systems can adapt codecs and samples per packet based on
| current conditions. The one I work on is a static codec and 60
| ms of audio per packet, which isn't ideal, but allows us to run
| in low bandwidth much better than 20 ms per packet.
|
| Edit to add: Meta can also afford to add a bit more sampling
| delay, because they've got very wide distribution of forwarding
| servers (they can do forwarding in their content appliances
| embedded in many ISPs), which reduces network delay vs
| competing services that have limited ability to host forwarding
| around the globe. Peer to peer doesn't always work and isn't
| always lower delay than going through a nearby forwarding
| server.
| lukevp wrote:
| Why would you need 50 packets per second vs 10? Is 100ms not
| acceptable but 20ms is?
| tgtweak wrote:
| Default configuration for SIP used to be 20ms, the rationale
| behind it was actually sourced in the fact that most SIP was
| done on LANs and inter-campus WAN which had generally high
| bitrate connectivity and low latency. The lower the packet
| time window the sooner the recipient could "hear" your voice,
| and if there were to be packet loss, there would be less of
| an impact if that packet were dropped - you'd only lose 20ms
| of audio vs 100ms. The same applies for high bitrate but high
| latency (3g for example) connectivity - you want to take
| advantage of the bandwidth to mitigate some of the network
| level latency that would impact the audio delay - being
| "wasteful" to ensure lower latency and higher packet loss
| tolerance.
|
| Pointedly - if you had a 75ms of one-way latency (150ms RTT)
| between two parties, and you used a 150ms audio segment
| length (ptime) you'd be getting close to the 250ms generally
| accepted max audio delay for smooth two-way communication.
| the recipient is hearing your first millisecond of audio
| 226ms later at best. If any packet does get lost, the
| recipient would lose 150ms of your message vs 20ms.
|
| Modern voice apps and voip use dynamic ptime (usually via
| "maxptime" which specifies the highest/worst case) in their
| protocol for this reason - it allows the clients to optimize
| for all combinations of high/low bandwidth, high/low latency
| and high/low packet loss in realtime - as network conditions
| can often change during the course of a call especially while
| driving around or roaming between wifi and cellular.
| lxgr wrote:
| > the rationale behind it was actually sourced in the fact
| that most SIP was done on LANs and inter-campus WAN which
| had generally high bitrate connectivity and low latency
|
| In addition to that, early VoIP applications mostly used
| uncompressed G.711 audio, both for interoperability with
| circuit switched networks and because efficient voice
| compression codecs weren't yet available royalty-free.
|
| G.711 is 64 kbps, so 12 kbps of overhead are less than 25%
| - not much point in cutting that down to, say, 10% at the
| expense of doubling effective latency in a LAN use case.
| crazygringo wrote:
| > _Is 100ms not acceptable but 20ms is?_
|
| Yup pretty much. Doubling it for round-trip, 200 ms is a
| fifth of a second which is definitely noticeable in
| conversation.
|
| 40 ms is a twenty-fifth of a second, or approximately a
| single frame of a motion picture. That's not going to be
| noticeable in conversation all.
|
| Of course both of these are on top of other sources of
| latencies, too.
| NavinF wrote:
| Yes 100ms feels horrible. People constantly interrupting each
| other because they start talking at around the same time and
| then both say "you go first". Discord has decent latency and
| IMO it's a major reason behind their success
| yalok wrote:
| this codec is for RTC comms - it supports 20ms frame rate. They
| did mention it's launched in their calling products:
|
| "We have already fully launched MLow to all Instagram and
| Messenger calls and are actively rolling it out on WhatsApp--
| and we've already seen incredible improvement in user
| engagement driven by better audio quality."
| saurik wrote:
| I don't know of any setups which would support muxing in
| exactly the way I am thinking, but another interesting use case
| is if you have multiple incoming audio streams which you don't
| want to be mixed by the server -- potentially because they are
| end-to-end encrypted -- and so a single packet can contain the
| data from multiple streams. Doing end-to-end encrypted audio
| calls is finally becoming pretty widespread, and I could see
| Facebook being in a good position for their products to do
| custom muxing.
| sjacob wrote:
| What skills/concepts would one need to acquire in order to
| fully grasp all that you've detailed here?
| dan-robertson wrote:
| I'm not totally certain about your argument for the specific
| amount of overhead (if the receiver/sender are on mobile
| networks, maybe something happens to the packet headers for the
| first/last legs before the real internet). But doesn't the OP
| already give an example where the low bit-rate codec is good:
| if you can compress things more then you have more of an
| opportunity to add forward error correction, which greatly
| improves the quality of calls on lossy connections. I wonder if
| smaller packets are less likely to be lost, and if there are
| important cases where multiple streams may be sent like group-
| calls.
| dgmdoug wrote:
| They also don't do a comparison with Pied Piper.
| mig39 wrote:
| It might have a Weissman score in the fives, but I haven't seen
| a real-world implementation. Does it use middle-out
| compression?
| byteknight wrote:
| MIDDLE OUT!
| barbazoo wrote:
| That's great, now they can reach even more developing countries
| and do damage the way they did for example in Myanmar [1].
|
| [1] https://www.amnesty.org/en/latest/news/2022/09/myanmar-
| faceb...
| thrtythreeforty wrote:
| Are they releasing this or is this just engineering braggadocio?
| I can't find any other references to MLow other than this blog
| post.
|
| Facebook/Meta AI Research does cool stuff, and releases a
| substantial portion of it (I dislike Facebook but I can admit
| they are highly innovative in the AI space).
| sllabres wrote:
| If you think about 'implementing then algorithm in a product'
| it seems so: (From the article) "We are really excited about
| what we have accomplished in just the last two years--from
| developing a new codec to successfully shipping it to billions
| of users around the globe"
| animanoir wrote:
| Nice technology, tho Opus adds that warm sound I love...
| aidenn0 wrote:
| Only slightly OT:
|
| ELI5: Why is a typical phone call today less intelligible than a
| 8kHz 8-bit m-law with ADPCM from the '90s did?
|
| [edit]
|
| s/sound worse/less intelligible/
| skygazer wrote:
| Does decrease in intelligibility correlate with the instance
| count of concert seats in front of the loud speakers back in
| the oughts?
| toast0 wrote:
| Depends on your call; u-law has poor frequency response and
| reasonable dynamic range. Not great for music, but ok enough
| for voice, and it's very consistent. 90s calls were almost all
| circuit switched in the last mile, and multiplexed per sample
| on digital lines (T1 and up). This means very low latency and
| zero jitter; there would be a measurable but actually
| imperceptible delay versus an end to end analog circuit
| switched call; but digital sampling near the ends means there
| would be a lot less noise. Circuit switching also means you'd
| never get dropped samples --- the connection is made or its
| not, although sometimes only one-way.
|
| Modern calls are typically using 20 ms samples, over packet
| switched networks, so you're adding sampling delay, and jitter
| and jitter buffers. The codecs themselves have encode/decode
| delay, because they're doing more than a ADC/DAC with a
| logarithm. Most of the codecs are using significantly fewer
| bits for the samples than u-law, and that's not for free
| either.
|
| HD Voice (g.722.2 AMR-Wide Band) has a much larger frequency
| pass band, and sounds much better than GSM or OPUS or most of
| these other low bandwidth codecs. There's still delay though;
| even if people will tell you 20-100ms delay is imperceptible,
| give someone an a/b call with 0 and 20 ms delay and they'll
| tell you the 0 ms delay call is better.
| rylittle wrote:
| could you explain a little more in a more ELI5 please?
| Dylan16807 wrote:
| > HD Voice (g.722.2 AMR-Wide Band) has a much larger
| frequency pass band, and sounds much better than GSM or OPUS
| or most of these other low bandwidth codecs.
|
| At what bitrate, for the comparison to Opus?
|
| And is this Opus using LACE/NoLACE as introduced in version
| 1.5?
|
| ...and is Meta using it in their comparison? It makes a huge
| difference.
| toast0 wrote:
| Yeah, I probably shouldn't have included Opus; I'm past the
| edit window or I'd remove it with a note. I haven't done
| enough comparison with Opus to really declare that part,
| and I don't think the circumstances were even. But I'm
| guessing the good HD Voice calls are at full bandwidth of ~
| 24 kbps, and I'm comparing with a product that was said to
| be using opus at 20 kbps. Opus at 32kbps sounds pretty
| reasonable. And carrier supported HD voice probably has
| prioritization and other things going on that mean less
| loss and probably less jitter. Really the big issue my ear
| has with Opus is when there's loss.
|
| I don't think I've been on calls with Opus 1.5 with
| lace/no-lace, released 3 months ago, so no, I haven't
| compared it with HD voice that my carrier deployed a decade
| ago. Seems a reasonable thing for Meta to test with, but it
| might be too new to be included in their comparison as
| well.
| Dylan16807 wrote:
| > Really the big issue my ear has with Opus is when
| there's loss.
|
| That would definitely complicate things. Going by the
| test results that got cited on Wikipedia, Opus has an
| advantage at 20-24, but that's easy enough to overwhelm.
|
| And the Opus encoder got some other major improvements up
| through 2018, so I'd be interested in updated charts.
|
| Oh and 1.5 also adds a better packet loss mechanism.
| sva_ wrote:
| Hearing ability deteriorates with age.
| aidenn0 wrote:
| Yes, but it doesn't deteriorate in such a way as to cause
| someone speaking to sound like gibberish and/or random
| medium-frequency tones, which happens in nearly every single
| cell phone conversation I have that lasts more than 5
| minutes.
|
| My experience is that phone calls nowadays alternate between
| a much wider-band (and thus often better sounding) experience
| and "WTF was that just now?"
| hot_gril wrote:
| Phone ear speakers are quieter than they used to be, so if the
| other person isn't talking clearly into the mic, you can't
| crank it up. I switched from a flip phone to an iPhone in 2013,
| huge difference. I had to immediately switch to using earbuds
| or speakerphone. Was in my teens at the time.
| iamnotsure wrote:
| Please stop lossy compression.
| cheema33 wrote:
| Lossy compression has its practical uses. Under ideal
| circumstances nobody is going to stop you from using FLAC.
| GrantMoyer wrote:
| Have you ever looked at the size of losslessly compressed
| video? It's huge. Lossy compression is the only practical way
| to store and stream video, since it's typically less than 1% of
| the size of uncompressed video. Lossless compression typically
| only gets down to about 50% of the size. It's amazing how much
| information you can throw out from a video, and barely be able
| to tell the difference.
| amelius wrote:
| Can't we have an audio codec that first sends a model of the
| particular voice, and then starts to send bits corresponding to
| the actual speech?
| roywiggins wrote:
| You need a bunch of bandwidth upfront for that, which you might
| not have, and enough compute available at the other end to
| reconstruct it, which you really might not have.
| amelius wrote:
| Regarding your first point, how about reserving a small
| percentage of the bandwidth for a model that improves
| incrementally?
| wildzzz wrote:
| You're adding more complexity to both the transmitter and
| receiver. I'd be pretty pissed if I had to endure
| unintelligible speech for a few minutes until the model was
| downloaded enough to be able to hear my friend. I'd also be
| a little pissed if I had to store massive models for
| everyone in my call log. Also both devices need to be able
| to run this model. If you are regularly talking over a shit
| signal, you're probably not going to be able to afford the
| latest flagship phone that has the hardware necessary to
| run it (which is exactly what the article touches on). The
| ideal codec takes up almost no bandwidth, sounds like
| you're sitting next to the caller, and runs on the average
| budget smartphone/PC. The issue is that you aren't going to
| be able to get one of these things so you choose a codec
| that best balances complexity, quality, and bandwidth given
| the situation. Link quality improves? Improve your voice
| quality by increasing the codec bitrate or switching to
| another less complex one to save battery. If both devices
| are capable of running a ML codec, then use that to improve
| quality and fit within the given bandwidth.
| neckro23 wrote:
| This is actually an old idea, minus the AI angle (1930s). It's
| what voders and vocoders were originally designed for, before
| Kraftwerk et al. found out you can use them to make cool robot
| voices.
| hubraumhugo wrote:
| Is it just my perception or has Meta become cool again by sharing
| a ton of research and open source (or open weights) work?
|
| Facebook's reputation was at the bottom, but now it seems like
| they made up for it.
| mrguyorama wrote:
| How the hell does releasing one audio codec undo years and
| years of privacy nightmare, being a willing bystander in an
| actual genocide, experimenting with the emotions of depressed
| kids, and collusion to depress wages?
| pt_PT_guy wrote:
| they also did release LLM models, and zstd, and mold, and and
| and... a lot of stuff
| visarga wrote:
| React and Pytorch
|
| compare that to Angular and TensorFlow, such a difference
| in culture
| hot_gril wrote:
| Easy vs tons-o-boilerplate.
| cheema33 wrote:
| Don't forget React. The most popular frontend stack at the
| moment. Been that way for some time.
|
| And GraphQL, Relay, Stylex...
| XlA5vEKsMISoIln wrote:
| >React
|
| Ah yes, the <body id="app"></body> websites.
| cztomsik wrote:
| How is that specific to React? And who would use webapp
| technology for a website?
| ComputerGuru wrote:
| zstd and mold are personal projects regardless of employer.
| That said, I didn't know mold was written by a meta guy.
| lucb1e wrote:
| Zstd is a personal project? Surely it's not by accident
| in the Facebook GitHub organization? And that you need to
| sign a contract on code.facebook.com before they'll
| consider merging any contributions? That seems like an
| odd claim, unless it _used to be_ a personal project and
| Facebook took it over
|
| (https://github.com/facebook/zstd/blob/dev/CONTRIBUTING.m
| d#co...)
| risho wrote:
| it isn't just one audio codec. they also released and
| continue to release the best self hostable large language
| model weights, they have authored many open source projects
| that are staples today such as zstandard, react, pytorch
| stuxnet79 wrote:
| You will need to provide citations on the last point as
| Facebook are widely known to have broken the gentleman's
| agreement between Apple and Google that was suppressing tech
| pay in the early 2010s.
| giraffe_lady wrote:
| OK sure even if they didn't do that we're still left with
| "knowingly abetted a genocide" which no amount of open
| source work can ever balance out.
| rylittle wrote:
| context?
| _whiteCaps_ wrote:
| https://www.amnesty.org/en/latest/news/2022/09/myanmar-
| faceb...
| robertlagrant wrote:
| This article seems to not really mention the "knowingly"
| or the "abetted". If there are people killing other
| people, I wouldn't say that a communication method was to
| blame. In Scream, Sidney didn't sue the phone company who
| let the killer call her from inside the house. The idea
| that some news feed posts whipped people up into a
| killing frenzy just sounds absurd.
|
| I wish the author could see that, and if the case is
| valid, to provide it, instead of some pretty tenuous
| claims of connection strung together to lead up to a
| demand for money.
|
| I did try to go to the link that evidenced the "multiple"
| times Facebook was contacted in a 5 year period, but I
| couldn't get through. How many times was it, for anyone
| who can?
| giraffe_lady wrote:
| The full report is linked to in the first paragraph.
| These points are all addressed in detail there.
| yard2010 wrote:
| You forgot selling the 2016 US elections to Putin for 100k[0]
|
| Good luck undoing that releasing codecs haha
|
| [0] https://time.com/4930532/facebook-russian-
| accounts-2016-elec...
| nine_k wrote:
| I have the same impression.
|
| Facebook the social network reputation may be not shiny, but
| Meta the engineering company reputation is pretty high, to my
| mind.
|
| It's somehow similar to IBM, who may look not stellar as a
| hardware or software solutions provider, but still have quite
| cool research and microelectronics branches.
| danuker wrote:
| I don't think they made up for it. They are training AIs off of
| personal data. The open stuff are a desperate red herring.
|
| https://www.theregister.com/2024/06/10/meta_ai_training/
| jhallenworld wrote:
| So this is one argument- another is that I'm impressed that
| they got their own LLM running and integrated into facebook
| messenger.
|
| I ran across this interesting graphic recently:
|
| https://www.theverge.com/2023/12/4/23987953/the-gpu-haves-
| an...
|
| Suddenly facebook is useful as a search engine..
|
| Also interesting is that Meta AI is much faster than ChatGPT
| from end-user's point of view, but results not quite as good.
| Here is a comparison:
|
| https://www.youtube.com/watch?v=1vLvLN5wxS0
| jen729w wrote:
| > a desperate red herring.
|
| _Or_ you can recognise that 'Meta' isn't a conscious entity,
| and that it's perfectly likely that there are some people
| _over there_ doing amazing open-source work, and different
| people _over there_ making ethically dubious decisions when
| building their LLMs.
| RIMR wrote:
| I would say that I have a very favorable opinion of Meta in
| terms of how they share their research and open source their
| software.
|
| I would say that I have a very unfavorable opinion of Meta in
| terms of their commitment to privacy, security, and social
| responsibility.
| lucb1e wrote:
| research department != product department
|
| Microsoft Research also puts out some really cool stuff, but
| that does not mean the "same" Microsoft can't show ads in their
| OS' start menu for people's constant enjoyment. I noticed this
| interesting discrepancy in Microsoft some years ago as a
| teenager; it does not surprise me at all that Facebook has a
| division doing cool things (zstandard, etc.) and a completely
| separate set of people working towards completely different
| goals simultaneously. Probably most companies larger than a
| couple hundred people have such departmental discrepancies
| yard2010 wrote:
| Don't fall for it. They will find a way to let some powerful
| actors exploit the users for dimes.
| chefandy wrote:
| The marketing is def working. I'm sure we'd be pretty depressed
| by some of the projects that didn't make the blog cut.
| theoperagoer wrote:
| Was hoping this would have a GitHub link ...
| mcoliver wrote:
| Maybe SiriusXM can pick this up. The audio quality is generally
| awful but especially on news/talk channels like Bloomberg and
| CNBC. There is no depth or richness to the voices.
| tgtweak wrote:
| It actually comes down to the SiriusXM receiver that is being
| used - I've witnessed the built-in sirius/xm on on the latest
| GM platform (a $100,000+ Cadillac) sounding like AM radio to
| immediately sitting in a better-than-apple-streaming quality
| rendition of the exact same channel on an older lexus a few
| minutes apart...
|
| The mobile xm receivers (like ipods) that they used to sell
| also had very good quality and I never noticed any quality
| shortcomings even with good headphones.
|
| I think the "high" quality stream is 256kbps/16k which is
| fairly high compared to most streaming services that come in
| around 128/160.
| sitkack wrote:
| I am archiving some music at 40kbps using Opus and the
| quality is pretty amazing. I think once things get over
| 20+kbps all the codecs start sounding pretty good (relative
| to the these low bitrates).
|
| I still prefer flac if possible.
| theoperagoer wrote:
| Opus is fantastic!
| wildzzz wrote:
| My old Sirius portable receiver sounded like garbage despite
| the marketing material saying "Crystal Clear". My 2006
| Infiniti Sirius receiver didn't sound any better despite
| being a massive space heater in the trunk. The later cars
| I've used it in sound good, at least good enough to sound as
| clear as FM radio or even HD radio. I think some of the
| channels are still in lower bitrates like the news channels
| for example, they've always sounded bad. There's something
| I've read about SiriusXM using terrestrial transmitters which
| may improve the signal whereas the satellite link may be of
| lower bandwidth.
| victorp13 wrote:
| Does anyone happen to know if ChatGPT's voice feature uses audio
| compression similar to Opus? Especially the "heavy 30 percent
| receiver-side packet loss" example sounds a LOT like the
| experience I have sometimes.
| dsign wrote:
| Can I use this to make music?
|
| A little bit on a tangent, a technique called Linear Predictive
| Coding, which was developed by telecoms in the sixties and
| seventies, has a calculated bandwidth of 2.5 kbit/s. The sound
| quality is not any good, and telephone companies of the time
| didn't use it for calls, but the paper I read describing the
| technique says the decoded speech is "understandable". LPC found
| its way into musical production, in a set of instruments called
| "vocoders" used to distort a singer's voce. There are, for
| example, variations of it in something called "Orange Vocoder
| IV".
|
| So, now I'm wondering, can MLow be used to purposefully introduce
| interesting distortions in speech? Or even change a singer's
| voice color?
| WalterSear wrote:
| Just use Digitalis :)
|
| https://www.youtube.com/watch?v=bA23ysR2hAo
| dbcurtis wrote:
| What is the license? I searched but could not find anything.
| varenc wrote:
| It appears to be entirely closed source at the moment. This is
| just the announcement of development of in-house tech. Nothing
| public yet that could even be licensed.
| annoyingnoob wrote:
| I wonder how it sounds compared to G.729.
|
| I worked for a company 20 years ago that had a modified G.729
| codec that could go below 8kbps but sounded decent. We used this
| for VoIP over dial-up Internet, talk about low bandwidth.
|
| Turns out some of the more interesting bits were in the jitter
| buffer and ways to manage the buffer. Scrappy connections deliver
| packets when they can and there is an art to managing the
| difference between the network experience and the user
| experience. For communications, you really need to manage the
| user experience.
| eelioss wrote:
| I am curious about encoding times comparing with standard codecs
| using ffmpeg
| lucb1e wrote:
| They claim it's 10% lower than Opus, specifically for the
| decoder iirc but since they speak of the 10-year-old hardware
| used to make million of WhatsApp calls daily, the encoder can't
| be computationally complex either
|
| But, yeah, some actual data (if they're not willing to provide
| running code) would have been a welcome addition to this PR
| overview
| therealmarv wrote:
| Somebody knows if this is better compared to whatever Google Meet
| is using? With choppy near unusable slow Internet Google Meet
| still fulfils its purpose on Audio Calls where all other
| competitors fail (tested e.g. while being in Philippines on a
| remote island with very bad internet). However Google Meet's tech
| is not published anywhere afaik.
| lucb1e wrote:
| Can hardly try that out if this PR piece does not contain any
| code. We can judge it as well as you can from the couple
| examples they showed off
| mckirk wrote:
| Maybe it's just me (or maybe I've invested too much money into
| headphones), but I actually liked the Opus sound better at 6
| kbps. The MLow samples had these... harsh and unnatural
| artifacts, whereas the Opus sound (though sounding like it came
| from a tin-can-and-string telephone and lacked all top-end) at
| least was 'smooth'. But I'm pretty sure that's because they are
| demonstrating here the very edge of what their codec can do, at
| higher bitrates the choice would probably be a lot clearer.
___________________________________________________________________
(page generated 2024-06-13 23:00 UTC)