[HN Gopher] Migrating Millions of Concurrent WebSockets to Envoy
___________________________________________________________________
Migrating Millions of Concurrent WebSockets to Envoy
Author : jbredeche
Score : 93 points
Date : 2021-03-16 14:12 UTC (1 days ago)
(HTM) web link (slack.engineering)
(TXT) w3m dump (slack.engineering)
| endisneigh wrote:
| I wonder if Slack has considered using webrtc to do peer to peer
| chats on the client side and then gathering up the chat metadata
| and having each client periodically send their version of the
| history and reconciling it server side.
|
| This would also have the effect of allowing slack to peer more or
| less normally even if Slack was down (of course bots, search, etc
| wouldn't work).
| ryanianian wrote:
| I suspect there may be regulatory restrictions about allowing
| text-based communications that aren't available during an
| audit.
| ssss11 wrote:
| What regulation do you think would apply? And how/why would
| this regulation differ for e2e encrypted chat products like
| Signal, Telegram, WhatsApp etc that can't access text based
| chat messages?
| zonotope wrote:
| IANAL, but the enterprise companies that make up Slack's
| customer base are often under regulations to preserve their
| employees' official communications in case they are needed
| for future investigations. Those same regulations prevent
| them from using the products you listed as official
| communication channels.
| [deleted]
| toomuchtodo wrote:
| FINRA recordkeeping and retention requirements, as well as
| SEC statute around records and reporting requirements
| (finance industry specific).
| detaro wrote:
| Companies that have such audit requirements do not use
| Signal et al either for internal comms.
| lovedswain wrote:
| It's possible to implement all of this without inheriting the
| additional infrastructure and networking complexity WebRTC
| brings along with it, not forgetting WebRTC still relies on
| centralized components to coordinate. Don't use WebRTC unless
| you really need the features it offers, routers in many
| scenarios hate it and even where they allow it, the
| combinatorial explosion in possible configurations to support
| and diagnose between peers is a problem nobody should willingly
| invite unless they can't achieve a solution any other way
|
| With WebRTC you give up the nice ultra-low-common-denominator
| "outbound port 443/TCP needs to work" requirement and replace
| it with "UDP networking generally healthy, possible to
| establish port mappings, possible to maintain stable port
| mappings over time, possible to not have mappings go away due
| to lack of traffic" etc etc
| SahAssar wrote:
| It sounds like the only thing you did was signaling, not STUN
| and TURN.
|
| If you do both STUN and TURN it works on most networks. I've
| worked at really restricted work sites, and while STUN fails
| at those if you have a TURN server then it almost always
| works.
|
| These sort of comments are why people think webRTC is
| unstable while the same people use slack calls which
| literally use webRTC.
|
| I might be wrong, but please don't talk about network
| reliability in webRTC without specifying if you have a
| working STUN and/or TURN setup.
| meheleventyone wrote:
| Hah, this is so true. Am building a little hobby project to
| try out WebRTC for game development. On my ISP provided
| router a Mac and Windows computer can't see each other over
| WiFi due to some mDNS issue likely the router support for
| multicast. Using Chrome flags to turn off mDNS and they can
| connect fine but obviously expose internal IPs. Wire one of
| the machines and mDNS works. TURN is essentially a necessity
| but then why not use a server (particularly for a chat app).
| SahAssar wrote:
| Sounds like you mean STUN, not TURN.
| meheleventyone wrote:
| No, I'm using a STUN server. This issue is unrelated and
| due to the local IPs being masked by mDNS addresses so
| that local network topology isn't leaked to the world at
| large and my routers handling of mDNS. Which is why
| everything works over the local network if I disable mDNS
| use in Chrome. TURN is the ultimate fallback to being
| unable to NAT punch.
|
| Ironically getting machines connected across the internet
| with WebRTC has so far been relatively smooth sailing.
| littlestymaar wrote:
| There's some truth in what you said, but also a few
| exaggerations.
|
| First of all, while WebRTC has its share of complexity when
| using it for videoconferencing, here we are talking about
| using the DataChannel, which is really straightforward to use
| and doesn't need additional infrastructure.
|
| > not forgetting WebRTC still relies on centralized
| components to coordinate
|
| It needs a centralized component to _setup_ the connection
| (signaling), if it fails later, your communication channel is
| still up. And the good thing if you have a websocket-based
| chat service, is that you can directly use it for the
| signaling purpose with zero modifications on the back-end
| side.
|
| > routers in many scenarios hate it and even where they allow
| it, the combinatorial explosion in possible configurations to
| support and diagnose between peers is a problem nobody should
| willingly invite unless they can't achieve a solution any
| other way
|
| When using the Datachannel, your failure mode is _can 't
| establish a connection_, not some hard to understand
| Heisenbug. All you need is to provide a centralized fallback
| for clients who cannot establish a connection. This fallback
| will depend on the centralized service being up, but in case
| of failure you'll keep most of your users without disturbance
| (at least in the first world, the network is not as WebRTC
| friendly in other places of the world). And because the
| DataChannel's API is close to the WebSocket's one,
| implementing the fallback is straightforward.
|
| Though, in Slack's situation there is a good reason not to
| use WebRTC: they can have several thousands of people in the
| same channel (IIRC IBM uses Slack and have most of their
| employees in a shared channel for official announcements).
| You won't be able to do that with WebRTC[1] if a user needs
| to establish a connection with every other users in the
| channel (there's just not enough ports available). And even
| worse, back in 2016, Chrome's implementation of the
| DataChannel was so poor, you could not establish more than a
| handful of PeerConnection before feeling the browser's
| becoming sluggish (this wasn't the case in Firefox so maybe
| Google fixed that since then).
|
| Also, Slack's users are likely to be in some enterprise
| network, which makes WebRTC more likely to fail than when you
| customers are home, which reduces the opportunity.
|
| Main takeaway: WebRTC-based chat is probably not a great fit
| for Slack, but don't be afraid of using it: it's not hard, it
| combines well with your already existing centralized
| infrastructure, and can massively reduces the load on it.
|
| [1] unless you want to build some fancy sparse mesh network,
| but _this_ is likely overengineering.
| ex3ndr wrote:
| I am curious about backend part - do ws is still ws on services?
| Why? For example, why to have thousands of connections instead of
| a single one (or a bunch) that simply forwards websocket packets
| with some "connection id" with them.
|
| This way you could restart service without killing ws connection,
| move all overhead of handling millions of connections to the lb.
| jeffbee wrote:
| When you control both client and server it seems like hot restart
| is just a complicated stunt you don't need. Isn't it fine to just
| stop accepting connections, tell all your clients to reconnect,
| and do a normal restart? The frontend load balancer that stands
| between you and Gmail doesn't know how to restart hot but you
| probably never noticed.
| hermanradtke wrote:
| > stop accepting connections, tell all your clients to
| reconnect
|
| This "drain" pattern is great for maintenance, upgrades, etc
| too.
|
| The only caveat is that the clients need to be given time to
| migrate. How long that is depends on how well the clients
| behave. A hot restart may be much faster.
| jsiepkes wrote:
| > Isn't it fine to just stop accepting connections, tell all
| your clients to reconnect, and do a normal restart?
|
| Dependents on how many config changes you need (per day).
|
| Besides Envoy supports it and I would call it a bonus if you
| can reload your configuration without client interruption. As
| for complicating things the implementation for hot reload isn't
| terribly complicated in Envoy.
| jeffbee wrote:
| I'm mentally separating the hot restart part from the
| reloadable configs part, even though they are together in the
| article. To me, not having reloadable configs is too crazy to
| even imagine.
| mbyio wrote:
| I think Slack is different than Gmail because people are
| actively having conversations, so if you disconnect, it is much
| more likely to be noticeable and annoying.
|
| Reading between the lines, I think what they would need is a
| way to tell clients to move to a new websocket connection at
| the _proxy_ layer. I don 't think there is an easy built-in way
| to do this in the websocket protocol, so they would have to
| implement something custom in their application layer. This
| would also require triggering custom code in the client to make
| a new websocket connection, start using it, and then close the
| old connection.
|
| I feel like it would have been simpler to just have the client
| do a graceful reconnect every 5 minutes. But they probably
| decided to use envoy so they could have the other advantages
| too.
| zemo wrote:
| maybe it could work, but in practice it's often not as easy as
| you'd like it to be. disconnecting everyone at or around the
| same time simultaneously can easily create a thundering herd or
| a TCP global synchronization problem, so "just ask everyone to
| reconnect" has its own set of complications.
| jayd16 wrote:
| The Gmail load balancer has to do a cold restart to add or
| remove an instance? That's the requirement they placed on
| themselves because they do not trust the runtime HAProxy api.
| theflyinghorse wrote:
| If they do not trust HAProxy runtime API then why are they
| using HAProxy at all?
| vad_ wrote:
| Haproxy itself is a solid piece of software. The runtime
| API is something they added on top of it because of
| competitors (envoy).
| [deleted]
| forgotmypw17 wrote:
| Is there a plan to migrate to an open protocol or non-crappy
| client?
___________________________________________________________________
(page generated 2021-03-17 23:00 UTC)