[HN Gopher] WebSockets cost us $1M on our AWS bill
       ___________________________________________________________________
        
       WebSockets cost us $1M on our AWS bill
        
       Author : tosh
       Score  : 197 points
       Date   : 2024-11-06 18:50 UTC (4 hours ago)
        
 (HTM) web link (www.recall.ai)
 (TXT) w3m dump (www.recall.ai)
        
       | handfuloflight wrote:
       | Love the transparency here. Would also love if the same
       | transparency was applied to pricing for their core product.
       | Doesn't appear anywhere on the site.
        
         | lawrenceduk wrote:
         | It's ok, it's now a million dollars/year cheaper when your
         | renewal comes up!
         | 
         | Jokes aside though, some good performance sleuthing there.
        
         | DrammBA wrote:
         | I use that as a litmus test when deciding whether to use a
         | service: if I can't find a prominently linked pricing page on
         | the homepage, I'm out.
        
       | hipadev23 wrote:
       | what was the actual cost? cpu?
        
         | cynicalsecurity wrote:
         | They are desperately trying to blame anyone except themselves.
        
       | cosmotic wrote:
       | Why decode to then turn around and re-encode?
        
         | ketzo wrote:
         | I had the same question, but I imagine that the "media
         | pipeline" box with a line that goes directly from "compositor"
         | to "encoder" is probably hiding quite a lot of complexity
         | 
         | Recall's offering allows you to get "audio, video, transcripts,
         | and metadata" from video calls -- again, total conjecture, but
         | I imagine they do need to decode into raw format in order to
         | split out all these end-products (and then re-encode for a
         | video recording specifically.)
        
         | pavlov wrote:
         | Reading their product page, it seems like Recall captures
         | meetings on whatever platform their customers are using: Zoom,
         | Teams, Google Meet, etc.
         | 
         | Since they don't have API access to all these platforms, the
         | best they can do to capture the A/V streams is simply to join
         | the meeting in a headless browser on a server, then capture the
         | browser's output and re-encode it.
        
           | MrBuddyCasino wrote:
           | They're already hacking Chromium. If the compressed video
           | data is unavailable in JS, they could change that instead.
        
             | moogly wrote:
             | They did what every other startup does: put the PoC in
             | production.
        
             | pavlov wrote:
             | If you want to support every meeting platform, you can't
             | really make any assumptions about the data format.
             | 
             | To my knowledge, Zoom's web client uses a custom codec
             | delivered inside a WASM blob. How would you capture that
             | video data to forward it to your recording system? How do
             | you decode it later?
             | 
             | Even if the incoming streams are in a standard format,
             | compositing the meeting as a post-processing operation from
             | raw recorded tracks isn't simple. Video call participants
             | have gaps and network issues and layer changes, you can't
             | assume much anything about the samples as you would with
             | typical video files. (Coincidentally this is exactly what
             | I'm working on right now at my job.)
        
         | Szpadel wrote:
         | my guess is either that video they get use some proprietary
         | encoding format (js might do some magic on the feed) or it's
         | because it's latency optimized stream that consumes a lot of
         | bandwidth
        
       | a_t48 wrote:
       | Did they consider iceoryx2? From the outside, it feels like it
       | fits the bill.
        
       | ComputerGuru wrote:
       | I don't mean to be dismissive, but this would have been caught
       | very early on (in the planning stages) by anyone that had/has
       | experience in system-level development rather than full-stack web
       | js/python development. Quite an expensive lesson for them to
       | learn, even though I'm assuming they _do_ have the talent
       | somewhere on the team if they 're able to maintain a fork of
       | Chromium.
       | 
       | (I also wouldn't be surprised if they had even more memory copies
       | than they let on, marshalling between the GC-backed JS runtime to
       | the GC-backed Python runtime.)
       | 
       | I was coming back to HN to include in my comment a link to
       | various high-performance IPC libraries, but another commenter
       | already beat me linking to iceoryx2 (though of course they'd need
       | to use a python extension).
       | 
       | SHM for IPC has been well-understood as the better option for
       | high-bandwidth payloads from the 1990s and is a staple of Win32
       | application development for communication between services
       | (daemons) and clients (guis).
        
         | Sesse__ wrote:
         | It's not even clear why they need a browser in the mix; most of
         | these services have APIs you can use. (Also, why fork Chromium
         | instead of using CEF?)
        
         | CharlieDigital wrote:
         | > I don't mean to be dismissive, but this would have been
         | caught very early on (in the planning stages) by anyone that
         | had/has experience in system-level development rather than
         | full-stack web js/python development
         | 
         | Based on their job listing[0], Recall is using Rust on the
         | backend.
         | 
         | [0] https://www.workatastartup.com/companies/recall-ai
        
         | diroussel wrote:
         | Sometimes it is more important to work on proving you have a
         | viable product and market to sell it in before you optimise.
         | 
         | On the outside we can't be sure. But it's possible that they
         | took the right decision to go with a naive implementation
         | first. Then profile, measure and improve later.
         | 
         | But yes the hole idea of running a headless web browser to get
         | run JavaScript to get access to a video stream is a bit crazy.
         | But I guess that's just the world we are in.
        
         | whatever1 wrote:
         | Wouldn't also something like redis be an alternative?
        
         | randomdata wrote:
         | _> rather than full-stack web js /python development._
         | 
         | The product is not a full-stack web application. What makes you
         | think that they brought in people with that kind of experience
         | just for this particular feature?
         | 
         | Especially when they claim that they chose that route because
         | it was what was most convenient. While you might argue that
         | wasn't the right tradeoff, it is a common tradeoff developers
         | of all kinds make. "Make It Work, Make It Right, Make It Fast"
         | has become pervasive in this industry, for better or worse.
        
       | jgalt212 wrote:
       | > But it turns out that if you IPC 1TB of video per second on AWS
       | it can result in enormous bills when done inefficiently.
       | 
       | As a point of comparison, how many TB per second of video does
       | Netflix stream?
        
         | ffsm8 wrote:
         | I don't think that number is as easy to figure out as most
         | people think.
         | 
         | Netflix has hardware ISPs can get so they can serve their
         | content without saturating the ISPs lines.
         | 
         | There is a statistic floating around that Netflix was
         | responsible for 15% of the global traffic 2022/2023, and
         | YouTube 12%. If that number is real... That'd be _a lot_ more
        
       | CyberDildonics wrote:
       | Actual reality beyond the fake title:
       | 
       | "using WebSockets over loopback was ultimately costing us
       | $1M/year in AWS spend"
       | 
       | then
       | 
       | "and the quest for an efficient high-bandwidth, low-latency IPC"
       | 
       | Shared memory. It has been there for 50 years.
        
       | renewiltord wrote:
       | That's a good write-up with a standard solution in some other
       | spaces. Shared memory buffers are very fast too. It's interesting
       | to see them being used here. Nice write up. It wasn't what I
       | expected: that they were doing something dumb with API Gateway
       | Websockets. This is actual stuff. Nice.
        
       | OptionOfT wrote:
       | Did they originally NOT run things on the same machine? Otherwise
       | the WebSocket would be local and incur no cost.
        
         | jgauth wrote:
         | Did you read the article? It is about the CPU cost of using
         | WebSockets to transfer data over loopback.
        
           | kunwon1 wrote:
           | I read the entire article and that wasn't my takeaway. After
           | reading, I assumed that AWS was (somehow) billing for
           | loopback bandwidth, it wasn't apparent (to me) from the
           | article that CPU costs were the sticking point
        
             | DrammBA wrote:
             | > We set a goal for ourselves to cut this CPU requirement
             | in half, and thereby cut our cloud compute bill in half.
             | 
             | From the article intro before they dive into what exactly
             | is using the CPU.
        
         | magamanlegends wrote:
         | our websocket traffic is roughly 40% of recall.ai and our bill
         | was $150 USD this month using a high memory VPS
        
         | nemothekid wrote:
         | > _WebSocket would be local and incur no cost._
         | 
         | The memcopys are the cost that they were paying, even if it was
         | local.
        
       | akira2501 wrote:
       | > A single 1080p raw video frame would be 1080 * 1920 * 1.5 =
       | 3110.4 KB in size
       | 
       | They seem to not understand the fundamentals of what they're
       | working on.
       | 
       | > Chromium's WebSocket implementation, and the WebSocket spec in
       | general, create some especially bad performance pitfalls.
       | 
       | You're doing bulk data transfers into a multiplexed short
       | messaging socket. What exactly did you expect?
       | 
       | > However there's no standard interface for transporting data
       | over shared memory.
       | 
       | Yes there is. It's called /dev/shm. You can use shared memory
       | like a filesystem, and no, you should not be worried about
       | user/kernel space overhead at this point. It's the obvious
       | solution to your problem.
       | 
       | > Instead of the typical two-pointers, we have three pointers in
       | our ring buffer:
       | 
       | You can use two back to back mmap(2) calls to create a ringbuffer
       | which avoids this.
        
         | Scaevolus wrote:
         | It's pretty funny that they assumed that memory copying was the
         | limiting factor when they're pushing a mere 150MB/s around
         | instead of the various websocket overheads, then jumped right
         | into over-engineering a zero copy ring buffer. I get it, but
         | come on!
         | 
         | >50 GB/s of memory bandwidth is common nowadays[1], and will
         | basically never be the bottleneck for 1080P encoding. Zero copy
         | matters when you're doing something exotic, like Netflix
         | pushing dozens of GB/s from a CDN node.
         | 
         | [1]: https://lemire.me/blog/2024/01/18/how-much-memory-
         | bandwidth-...
        
         | anonymous344 wrote:
         | well someone will feel like an idiot after reading your facts.
         | This is why education and experience is important. Not just
         | React/rust course and then you are full stack senior :D
        
         | didip wrote:
         | I agree with you. The moment they said shared memory, I was
         | thinking /dev/shm. Lots of programming languages have libraries
         | to /dev/shm already.
         | 
         | And since it behaves like filesystem, you can swap it with real
         | filesystem during testing. Very convenient.
         | 
         | I am curious if they tried this already or not and if they did,
         | what problems did they encounter?
        
       | Dylan16807 wrote:
       | The title makes it sound like there was some kind of blowout, but
       | really it was a tool that wasn't the best fit for this job, and
       | they were using twice as much CPU as necessary, nothing crazy.
        
       | yapyap wrote:
       | > But it turns out that if you IPC 1TB of video per second on AWS
       | it can result in enormous bills when done inefficiently.
       | 
       | that's surprising to.. almost no one? 1TBPS is nothing to scoff
       | at
        
         | blibble wrote:
         | in terms of IPC, DDR5 can do about 50GB/s per memory channel
         | 
         | assuming you're only shuffling bytes around, on bare metal this
         | would be ~20 DDR5 channels worth
         | 
         | or 2 servers (12 channels/server for EPYC)
         | 
         | you can get an awful lot of compute these days for not very
         | much money
         | 
         | (shipping your code to the compressed video instead of the
         | exact opposite would probably make more sense though)
        
           | pyrolistical wrote:
           | Terabits vs gigabytes
        
       | turtlebits wrote:
       | Is this really an AWS issue? Sounds like you were just burning
       | CPU cycles, which is not AWS related. WebSockets makes it sound
       | like it was a data transfer or API gateway cost.
        
         | VWWHFSfQ wrote:
         | > Is this really an AWS issue?
         | 
         | I doubt they would have even noticed this outrageous cost if
         | they were running on bare-metal Xeons or Ryzen colo'd servers.
         | You can rent real 44-core Xeon servers for like, $250/month.
         | 
         | So yes, it's an AWS issue.
        
           | JackSlateur wrote:
           | You can rent real 44-core Xeon servers for like, $250/month.
           | 
           | Where, for instance ?
        
             | Faaak wrote:
             | Hetzner for example. An EPYC 48c (96t) goes for 230 euros
        
               | dilyevsky wrote:
               | Hetzner network is complete dog. They also sell you
               | machines that are long should be EOL'ed. No serious
               | business should be using them
        
               | dijit wrote:
               | What cpu do you think your workload is using on AWS?
               | 
               | GCP exposes their cpu models, and they have some Haswell
               | and Broadwell lithographies in service.
               | 
               | Thats a 10+ year old part, for those paying attention.
        
               | dilyevsky wrote:
               | Most of GCP and some AWS instances will migrate to
               | another node when it's faulty. Also disk is virtual. None
               | of this applies to baremetal hetzner
        
               | dijit wrote:
               | Why is that relevant to what I said?
        
               | dilyevsky wrote:
               | Only relevant if you care about reliability
        
               | dijit wrote:
               | AWS was working "fine" for about 10 years without live
               | migration, and I've had several individual machines
               | running without a reboot or outage for quite literally
               | half a decade. Enough to hit bugs like this: https://supp
               | ort.hpe.com/hpesc/public/docDisplay?docId=a00092...
               | 
               | Anyway, depending on individual nodes to always be up for
               | reliability is incredibly foolhardy. Things can happen,
               | cloud isn't magic, I've had instances become
               | unrecoverable. Though it is rare.
               | 
               | So, I still don't understand the point, that was not
               | exactly relevant to what I said.
        
               | tsimionescu wrote:
               | I think they meant that Hetzner is offering specific
               | machines they know to be faulty and should have EOLd to
               | customers, not that they use deprecated CPUs.
        
               | dijit wrote:
               | Thats scary if true, any sources? My google-fu is failing
               | me. :/
        
               | akvadrako wrote:
               | It's not scary, it's part of the value proposition.
               | 
               | I used to work for a company that rented lots of hetzner
               | boxes. Consumer grade hardware with frequent disk
               | failures was just what we excepted for saving a buck.
        
               | speedgoose wrote:
               | I know serious businesses using Hetzner for their
               | critical workloads. I wouldn't unless money is tight, but
               | it is possible. I use them for my non critical stuff, it
               | costs so much less.
        
               | blibble wrote:
               | I just cat'ed /proc/cpuinfo on my Hetzner and AWS
               | machines
               | 
               | AWS: E5-2680 v4 (2016)
               | 
               | Hetzner: Ryzen 5 (2019)
        
             | GauntletWizard wrote:
             | Hetzner: https://www.hetzner.com/dedicated-
             | rootserver/#cores_threads_...
        
             | VWWHFSfQ wrote:
             | There are many colos that offer dedicated server
             | rental/hosting. You can just google for colos in the region
             | you're looking for. I found this one
             | 
             | https://www.colocrossing.com/server/dedicated-servers/
        
               | petcat wrote:
               | I don't know anything about Colo Crossing (are they a
               | reseller?) but I would bet their $60 per month 4-core
               | Intel Xeons would outperform a $1,000 per month "compute
               | optimized" EC2 server.
        
               | fragmede wrote:
               | What benchmark would you like to use?
        
               | petcat wrote:
               | This blog is about doing video processing on the CPU, so
               | something akin to that.
        
               | phonon wrote:
               | For $1000 per month you can get a c8g.12xlarge (assuming
               | you use some kind of savings plan).[0] That's 48 cores,
               | 96 GB of RAM and 22.5+ Gbps networking. Of course you
               | still need to pay for storage, egress etc., but you seem
               | to be exaggerating a bit....they do offer a 44 core
               | Broadwell/128 GB RAM option for $229 per month, so AWS is
               | more like a 4x markup[1]....the C8g would likely be much
               | faster at single threaded tasks though[2][3]
               | 
               | [0]https://instances.vantage.sh/aws/ec2/c8g.12xlarge?regi
               | on=us-... [1]https://portal.colocrossing.com/register/ord
               | er/service/480
               | [2]https://browser.geekbench.com/v6/cpu/8305329
               | [3]https://browser.geekbench.com/processors/intel-
               | xeon-e5-2699-...
        
               | petcat wrote:
               | > That's 48 cores
               | 
               | That's not dedicated 48 cores, it's 48 "vCPUs". There are
               | probably 1,000 other EC2 instances running on those cores
               | stealing all the CPU cycles. You might get 4 cores of
               | actual compute throughput. Which is what I was saying
        
               | phonon wrote:
               | That's not how it works, sorry. (Unless you use burstable
               | instances, like T4g) You can run them at 100% as long as
               | you like, and it has the same performance (minus a small
               | virtualization overhead).
        
               | petcat wrote:
               | Are you telling me that my virtualized EC2 server is the
               | only thing running on the physical hardware/CPU? There
               | are no other virtualized EC2 servers sharing time on that
               | hardware/CPU?
        
         | brazzy wrote:
         | Neither the title nor the article are painting it as an AWS
         | issue, but as a websocket issue, because the protocol
         | implicitly requires all transferred data to be copied multiple
         | times.
        
           | turtlebits wrote:
           | If you call out your vendor, the issue usually lies with some
           | specific issue with them or their service. The title
           | obviously states AWS.
           | 
           | If I said that "childbirth cost us 5000 on our <hospital
           | name> bill", you assume the issue is with the hospital.
        
           | bigiain wrote:
           | I disagree. Like @turtlebits, I was waiting for the part of
           | the story where websocket connections between their AWS
           | resources somehow got billed at Amazon's internet data egress
           | rates.
        
           | anitil wrote:
           | I didn't know this - why is this the case?
        
       | londons_explore wrote:
       | They are presumably using the GPU for video encoding....
       | 
       | And the GPU for rendering...
       | 
       | So they should instead just be hooking into Chromium's GPU
       | process and grabbing the pre-composited tiles from the
       | LayerTreeHostImpl[1] and dealing with those.
       | 
       | [1]:
       | https://source.chromium.org/chromium/chromium/src/+/main:cc/...
        
         | isoprophlex wrote:
         | You'd think so but nope, they deliberately run on CPU, as per
         | the article...
        
           | yjftsjthsd-h wrote:
           | > We do our video processing on the CPU instead of on GPU, as
           | GPU availability on the cloud providers has been patchy in
           | the last few years.
           | 
           | I dunno, when we're playing with millions of dollars in costs
           | I hope they're at least regularly evaluating whether they
           | could at least run _some_ of the workload on GPUs for better
           | perf /$.
        
             | londons_explore wrote:
             | And their workload is rendering and video encoding. Using
             | GPU's should have been where they started, even if it
             | limits their choice of cloud providers a little.
        
         | mbb70 wrote:
         | They are very explicit in the article that they run everything
         | on CPUs.
        
         | orf wrote:
         | One of the first parts of the post explains how they are using
         | CPUs only
        
       | cogman10 wrote:
       | This is such a weird way to do things.
       | 
       | Here they have a nicely compressed stream of video data, so they
       | take that stream and... decode it. But they aren't processing the
       | decoded data at the source of the decode, so instead they forward
       | that decoded data, uncompressed(!!), to a different location for
       | processing. Surprisingly, they find out that moving uncompressed
       | video data from one location to another is expensive. So, they
       | compress it later (Don't worry, using a GPU!)
       | 
       | At so many levels this is just WTF. Why not forward the
       | compressed video stream? Why not decompress it where you are
       | processing it instead of in the browser? Why are you writing it
       | without any attempt at compression? Even if you want lossless
       | compression there are well known and fast algorithms like flv1
       | for that purpose.
       | 
       | Just weird.
        
         | isoprophlex wrote:
         | Article title should have been "our weird design cost us $1M".
         | 
         | As it turns out, doing something in Rust does not absolve you
         | of the obligation to actually think about what you are doing.
        
           | dylan604 wrote:
           | TFA opening graph "But it turns out that if you IPC 1TB of
           | video per second on AWS it can result in enormous bills when
           | done inefficiently. "
        
         | rozap wrote:
         | Really strange. I wonder why they omitted this. Usually you'd
         | leave it compressed until the last possible moment.
        
           | dylan604 wrote:
           | > Usually you'd leave it compressed until the last possible
           | moment.
           | 
           | Context matters? As someone working in production/post, we
           | want to keep it uncompressed until the last possible moment.
           | At least as far as no more compression than how it was
           | acquired.
        
             | DrammBA wrote:
             | > Context matters?
             | 
             | It does, but you just removed all context from their
             | comment and introduced a completely different context
             | (video production/post) for seemingly no reason.
             | 
             | Going back to the original context, which is grabbing a
             | compressed video stream from a headless browser, the
             | correct approach to handle that compressed stream is to
             | leave it compressed until the last possible moment.
        
               | pavlov wrote:
               | Since they aim to support every meeting platform, they
               | don't necessarily even have the codecs. Platforms like
               | Zoom can and do use custom video formats within their web
               | clients.
               | 
               | With that constraint, letting a full browser engine
               | decode and composite the participant streams is the only
               | option. And it definitely is an expensive way to do it.
        
         | tbarbugli wrote:
         | Possibly because they capture the video from xvfb or similar
         | (they run a headless browser to capture the video) so at that
         | point the decoding already happened (webrtc?)
        
         | bri3d wrote:
         | I think the issue with compression is that they're scraping the
         | online meeting services rather than actually reverse
         | engineering them, so the compressed video stream is hidden
         | inside some kind of black box.
         | 
         | I'm pretty sure that feeding the browser an emulated hardware
         | decoder (ie - write a VAAPI module that just copies compressed
         | frame data for you) would be a good semi-universal solution to
         | this, since I don't think most video chat solutions use DRM
         | like Widevine, but it's not as universal as dumping the
         | framebuffer output off of a browser session.
         | 
         | They could also of course one-off reverse each meeting service
         | to get at the backing stream.
         | 
         | What's odd to me is that even with this frame buffer approach,
         | why would you not just recompress the video at the edge? You
         | could even do it in Javascript with WebCodecs if that was the
         | layer you were living at. Even semi-expensive compression on a
         | modern CPU is going to be way cheaper than copying raw video
         | frames, even just in terms of CPU instruction throughput vs
         | memory bandwidth with shared memory.
         | 
         | It's easy to cast stones, but this is a weird architecture and
         | making this blog post about the "solution" is even stranger to
         | me.
        
       | dbrower wrote:
       | How much did the engineering time to make this optimization cost?
        
       | thadk wrote:
       | Could Arrow be a part of the shared memory solution in another
       | context?
        
       | bauruine wrote:
       | FWIW: The MTU of the loopback interface on Linux is 64KB by
       | default
        
       | beoberha wrote:
       | Classic Hacker News getting hung up on the narrative framing.
       | It's a cool investigation! Nice work guys!
        
       | marcopolo wrote:
       | Masking in the WebSocket protocol is kind of a funny and sad fix
       | to the problem of intermediaries trying to be smart and helpful,
       | but failing miserably.
       | 
       | The linked section of the RFC is worth the read: https://www.rfc-
       | editor.org/rfc/rfc6455#section-10.3
        
       | jazzyjackson wrote:
       | I for one would like to praise the company for sharing their
       | failure, hopefully next time someone Googles "transport video
       | over websocket" theyll find this thread.
        
       | pier25 wrote:
       | Why were they using websockets to send video in the first place?
       | 
       | Was it because they didn't want to use some multicast video
       | server?
        
       | trollied wrote:
       | >In a typical TCP/IP network connected via ethernet, the standard
       | MTU (Maximum Transmission Unit) is 1500 bytes, resulting in a TCP
       | MSS (Maximum Segment Size) of 1448 bytes. This is much smaller
       | than our 3MB+ raw video frames.
       | 
       | > Even the theoretical maximum size of a TCP/IP packet, 64k, is
       | much smaller than the data we need to send, so there's no way for
       | us to use TCP/IP without suffering from fragmentation.
       | 
       | Just highlights that they do not have enough technical knowledge
       | in house. Should spend the $1m/year saving on hiring some good
       | devs.
        
         | karamanolev wrote:
         | I fail to see how TCP/IP fragmentation really affects this use
         | case. I don't know why it's mentioned and given that there
         | aren't multiple network devices with different MTUs it will
         | cause issues. Am I right? Is that the lack of technical
         | knowledge you're referring to or am I missing something?
        
           | drowsspa wrote:
           | Sounds weird that apparently they expected to send 3 MB in a
           | single TCP packet
        
             | bcrl wrote:
             | Modern NICs will do that for you via a feature called TSO
             | -- TCP Segmentation Offload.
             | 
             | More shocking to me is that anyone would attempt to run
             | network throughput oriented software inside of Chromium.
             | Look at what Cloudflare and Netflix do to get an idea what
             | direction they should really be headed in.
        
         | maxmcd wrote:
         | Please explain?
        
         | hathawsh wrote:
         | Why do you say that? Their solution of using shared memory
         | (structured as a ring buffer) sounds perfect for their use
         | case. Bonus points for using Rust to do it. How would you do
         | it?
         | 
         | Edit: I guess perhaps you're saying that they don't know all
         | the networking configuration knobs they could exercise, and
         | that's probably true. However, they landed on a more optimal
         | solution that avoided networking altogether, so they no longer
         | had any need to research network configuration. I'd say they
         | made the right choice.
        
           | maxmcd wrote:
           | Yes, maybe they're talking about this:
           | https://en.wikipedia.org/wiki/TCP_window_scale_option
        
         | adamrezich wrote:
         | This reminds me of when I was first starting to learn "real
         | game development" (not using someone else's engine)--I was
         | using C#/MonoGame, and, while having no idea what I was doing,
         | decided I wanted vector graphics. I came across libcairo,
         | figured out how to use it, set it all up correctly and
         | everything... and then found that, whoops, sending 1920x1080x4
         | bytes to your GPU to render, 60 times a second, doesn't exactly
         | work--for reasons that were incredibly obvious, in retrospect!
         | At least it didn't cost me a million bucks to learn from my
         | mistake.
        
         | lttlrck wrote:
         | The article reads a like a personal "learn by doing" blog post.
        
       | cperciva wrote:
       | _We use atomic operations to update the pointers in a thread-safe
       | manner_
       | 
       | Are you sure about that? Atomics are not locks, and not all
       | systems have strong memory ordering.
        
         | jpc0 wrote:
         | > ... update the pointers ...
         | 
         | Pretty sure ARM and x86 you would be seeing on AWS does have
         | strong memory ordering, and has atomic operations that operate
         | on something the size of a single register...
        
           | cperciva wrote:
           | Graviton has weaker memory ordering than amd64. I know this
           | because FreeBSD had a ring buffer which was buggy on
           | Graviton...
        
         | Sesse__ wrote:
         | Rust atomics, like C++ atomics, include memory barriers (the
         | programmer chooses how strong, the compiler/CPU is free to give
         | stronger).
        
       | gwbas1c wrote:
       | Classic story of a startup taking a "good enough" shortcut and
       | then coming back later to optimize.
       | 
       | ---
       | 
       | I have a similar story: Where I work, we had a cluster of VMs
       | that were always high CPU and a bit of a problem. We had a lot of
       | fire drills where we'd have to bump up the size of the cluster,
       | abort in-progress operations, or some combination of both.
       | 
       | Because this cluster of VMs was doing batch processing that the
       | founder believed should be CPU intense, everyone just assumed
       | that increasing load came with increasing customer size; and that
       | this was just an annoyance that we could get to after we made one
       | more feature.
       | 
       | But, at one point the bean counters pointed out that we spent
       | disproportionately more on cloud than a normal business did.
       | After one round of combining different VM clusters (that really
       | didn't need to be separate servers), I decided that I could take
       | some time to hook up this very CPU intense cluster up to a
       | profiler.
       | 
       | I thought I was going to be in for a 1-2 week project and would
       | follow a few worms. Instead, the CPU load was because we were
       | constantly loading an entire table, that we never deleted from,
       | into the application's process. The table had transient data that
       | should only last a few hours at most.
       | 
       | I quickly deleted almost a decade's worth of obsolete data from
       | the table. After about 15 minutes, CPU usage for this cluster
       | dropped to almost nothing. The next day we made the VM cluster a
       | fraction of its size, and in the next release, we got rid of the
       | cluster and merged the functionality into another cluster.
       | 
       | I also made a pull request that introduced a simple filter to the
       | query to only load 3 days of data; and then introduced a
       | background operation to clean out the table periodically.
        
         | alsetmusic wrote:
         | As much as you can say (perhaps not hard numbers, but as a
         | percentage), what was the savings to the bottom line / cloud
         | costs?
        
           | gwbas1c wrote:
           | Probably ~5% of cloud costs. Combined with the prior round of
           | optimizations, it was substantial.
           | 
           | I was really disappointed when my wife couldn't get the night
           | off from work when the company took everyone out to a fancy
           | steak house.
        
             | chgs wrote:
             | So you saved the company $10k a month and got a $200 meal
             | in gratitude? Awesome.
        
         | antisthenes wrote:
         | 99% of the time, it's either a quadratic (or exponential)
         | algorithm or a really bad DB query.
        
       | wiml wrote:
       | > One complicating factor here is that raw video is surprisingly
       | high bandwidth.
       | 
       | It's weird to be living in a world where this is a _surprise_ but
       | here we are.
       | 
       | Nice write up though. Web sockets has a number of nonsensical
       | design decisions, but I wouldn't have expected that _this_ is the
       | one that would be chewing up all your cpu.
        
         | handfuloflight wrote:
         | > It's weird to be living in a world where this is a surprise
         | but here we are.
         | 
         | I think it's because the cost of it is so abstracted away with
         | free streaming video all across the web. Once you take a look
         | at the egress and ingress sides you realize how quickly it adds
         | up.
        
         | arccy wrote:
         | I think it's just rare for a lot of people to be handling raw
         | video. Most people interact with highly efficient (lossy)
         | codecs on the web.
        
         | carlhjerpe wrote:
         | I was surprised when calculating and sizing the shared memory
         | for my Gaming VM for use with "Looking-Glass". At 165hz 2k HDR
         | it's many gigabytes per second, that's why HDMI and DisplayPort
         | is specced really high
        
         | sensanaty wrote:
         | I always knew video was "expensive", but my mark for what
         | expensive meant was a good few orders of magnitude off when I
         | researched the topic for a personal project.
         | 
         | I can easily imagine the author being in a similar boat,
         | knowing that it isn't cheap, but then not realizing that
         | expensive in this context truly does mean expensive until they
         | actually started seeing the associated costs.
        
       | IX-103 wrote:
       | Chromium already has a zero-copy IPC mechanism that uses shared
       | memory built-in. It's called Mojo. That's how the various browser
       | processes talk to each other. They could just have passed
       | mojo::BigBuffer messages to their custom.process and not had to
       | worry about platform-specific code.
       | 
       | But writing a custom ring buffer implementation is also nice, I
       | suppose...
        
       | cyberax wrote:
       | Egress fees strike again.
        
       | sfink wrote:
       | ...and this is why I will never start a successful business.
       | 
       | The initial approach was shipping _raw video_ over a _WebSocket_.
       | I could not imagine putting something like that together and
       | selling it. When your first computer came with 64KB in your
       | entire machine, some of which you can 't use at all and some you
       | can't use without bank switching tricks, it's really really hard
       | to even conceive of that architecture as a possibility. It's a
       | testament to the power of today's hardware that it worked at all.
       | 
       | And yet, it did work, and it served as the basis for a successful
       | product. They presumably made money from it. The inefficiency
       | sounds like it didn't get in the way of developing and iterating
       | on the rest of the product.
       | 
       | I can't do it. Premature optimization may be the root of all
       | evil, but I can't work without having _some_ sense for how much
       | data is involved and how much moving or copying is happening to
       | it. That sense would make me immediately reject that approach. I
       | 'd go off over-architecting something else before launching, and
       | somebody would get impatient and want their money back.
        
       | apitman wrote:
       | I've been toying around with a design for a real-time chat
       | protocol, and was recently in a debate of WebSockets vs HTTP long
       | polling. This should give me some nice ammunition.
        
         | pavlov wrote:
         | No, this story is about interprocess communication on a single
         | computer, it has practically nothing to do with WebSockets vs
         | something else over an IP network.
        
       ___________________________________________________________________
       (page generated 2024-11-06 23:00 UTC)