[HN Gopher] Corrosion
___________________________________________________________________
Corrosion
Author : cgb_
Score : 156 points
Date : 2025-10-23 11:21 UTC (4 days ago)
(HTM) web link (fly.io)
(TXT) w3m dump (fly.io)
| soamv wrote:
| > New nullable columns are kryptonite to large Corrosion tables:
| cr-sqlite needs to backfill values for every row in the table
|
| Is this a typo? Why does it backfill values for a nullable
| column?
| andrewaylett wrote:
| I assume it _would_ backfill values for any column, as a side-
| effect of propagating values for any column. But nullable
| columns are the only type you can add to a table that already
| contains rows, and mean that every row immediately has an
| update that needs to be sent.
| ricardobeat wrote:
| It seems to be a quirk of cr-sqlite, it wants to keep track of
| clock values for the new column. It's not backfilling the field
| values as far as I understand. There is a comment mentioning it
| could be optimized away:
|
| https://github.com/vlcn-io/cr-sqlite/blob/891fe9e0190dd20917...
| throwaway290 wrote:
| I guess all designers at fly were replaced by ai because this
| article is using gray bold font for the whole text. I remember
| these guys had good blog some time ago
| foofoo12 wrote:
| It's totally unreadable.
| davidham wrote:
| Looks like it always has, to me.
| dewey wrote:
| Not sure if that was changed since then, but it's not bold for
| me and also readable. Maybe browser rendering?
| ceigey wrote:
| Also not bold for me (Safari). Variable font rendering issue?
| throwaway290 wrote:
| stock safari on ios 26 for me. is it another of 37366153
| regressions of ios 26?
| iviv wrote:
| Looks normal to me on iOS 26.0.1
| throwaway290 wrote:
| stock safari on ios
|
| and I think the intended webfont is loaded because the font
| is clearly weird ish and non-standard and the text is
| invisible for good 2 seconds at first while it loads:)
| mcny wrote:
| Please try the article mode in your web browser. Firefox has a
| pretty good one but I understand all major browsers have this
| now.
| throwaway290 wrote:
| I only use article mode in exceptional cases. I hold fly to
| higher standard than that.
| tptacek wrote:
| D'awwwwww.
| tptacek wrote:
| The design hasn't changed in years. If someone has a screenshot
| and a browser version we can try to figure out why it's coming
| out fucky for you.
| kg wrote:
| Looking at the css, there's a .text-gray-600 CSS style that
| would cause this, and it's overridden by some other style in
| order to achieve the actual desired appearance. Maybe the
| override style isn't loading - perhaps the GP has javascript
| disabled?
| tptacek wrote:
| Thanks! Relayed.
| throwaway290 wrote:
| javascript is enabled but I don't see the problem on
| another phone, so yeah seems related
| jjtheblunt wrote:
| latest macos firefox and safari both show grey on white,
| legible but contrast somewhat lacking, but rendered properly
| for grey on white.
| bananapub wrote:
| in case people don't read all the way to the end, the important
| takeaway is "you simply can't afford to do instant global state
| distribution" - you can formal method and Rust and test and
| watchdog yourself as much as you want, but you simply have to
| stop doing that or the unknown unknowns will just keep taking you
| down.
| tptacek wrote:
| I mean, the thing we're saying is that instant global state
| _with database-style consensus_ is unworkable. Instant state
| distribution though is kind of just... necessary? for a
| platform like ours. You bring up an app in Europe, proxies in
| Asia need to know about it to route to it. So you say, "ok,
| well, they can wait a minute to learn about the app, not the
| end of the world". Now: that same European instance _goes
| down_. Proxies in Asia need to know about that, right away, and
| this time you can 't afford to wait.
| __turbobrew__ wrote:
| > Proxies in Asia need to know about that, right away, and
| this time you can't afford to wait.
|
| Did you ever consider envoy xDS?
|
| There are a lot of really cool things in envoy like outlier
| detection, circuit breakers, load shedding, etc...
| tptacek wrote:
| Nope. Talk a little about how how Envoy's service discovery
| would scale to millions of apps in a global network?
| There's no way we found the only possible point in the
| solution space. Do they do something clever here?
|
| What we (think we) know won't work is a topologically
| centralized database that uses distributed consensus
| algorithms to synchronize. Running consensus
| transcontinentally is very painful, and keep the servers
| central, so that update proposals are local and the
| protocol can run quickly, subjects large portions of the
| network to partition risk. The natural response (what I
| think a lot of people do, in fact) is just to run multiple
| consensus clusters, but our UX includes a global namespace
| for customer workloads.
| hedgehog wrote:
| Is it actually necessary to run transcontinental
| consensus? Apps in a given location are not movable so it
| would seem for a given app it's known which part of the
| network writes can come from. That would require
| partitioning the namespace but, given that apps are not
| movable, does that matter? It feel like there are other
| areas like docs and tooling that would benefit from
| relatively higher prioritization.
| tptacek wrote:
| Apps in a given location are _extremely_ movable! That 's
| the point of the service!
| hedgehog wrote:
| We unfortunately lost our location with not a whole lot
| of notice and the migration to a new one was not
| seamless, on top of things like the GitHub actions being
| out of date (only supporting the deprecated Postgres
| service, not the new one).
| __turbobrew__ wrote:
| I haven't personally worked on envoy xds, but it is what
| I have seen several BigCo's use for routing from the edge
| to internal applications.
|
| > Running consensus transcontinentally is very painful
|
| You don't necessarily have to do that, you can keep your
| quorum nodes (lets assume we are talking about etcd) far
| enough apart to be in separate failure domains (fires,
| power loss, natural disasters) but close enough that
| network latency isn't unbearably high between the
| replicas.
|
| I have seen the following scheme work for millions of
| workloads:
|
| 1. Etcd quorum across 3 close, but independent regions
|
| 2. On startup, the app registers itself under a prefix
| that all other app replicas register
|
| 3. All clients to that app issue etcd watches for that
| prefix and almost instantly will be notified when there
| is a change. This is baked as a plugin within grpc
| clients.
|
| 4. A custom grpc resolver is used to do lookups by
| service name
| tptacek wrote:
| I'm thrilled to have people digging into this, because I
| think it's a super interesting problem, but: no, keeping
| quorum nodes close-enough-but-not-too-close doesn't solve
| our problem, because we support a unified customer
| namespace that runs from Tokyo to Sydney to Sao Paulo to
| Northern Virginia to London to Frankfurt to Johannesburg.
|
| Two other details that are super important here:
|
| This is a public cloud. There is no real correlation
| between apps/regions and clients. Clients are public
| Internet users. When you bring an app up, it just needs
| to work, for completely random browsers on completely
| random continents. Users can and do move their instances
| (or, more likely, reallocate instances) between regions
| with no notice.
|
| The second detail is that no matter what DX compromise
| you make to scale global consensus up, you still _need_
| reliable realtime update of instances going down. Not
| knowing about a new instance that just came up isn 't
| that big a deal! You just get less optimal routing for
| the request. Not knowing that an instance went _down_ is
| a very big deal: you end up routing requests to dead
| instances.
|
| The deployment strategy you're describing is in fact what
| we used to do! We had a Consul cluster in North America
| and ran the global network off it.
| __turbobrew__ wrote:
| > I'm thrilled to have people digging into this, because
| I think it's a super interesting problem
|
| Yes, somehow this is a problem all the big companies
| have, but it seems like there is no standard solution and
| nobody has open sourced their stuff (except you)!
|
| Taking a step back, and thinking about the AWS outage
| last week which was caused by a buggy bespoke system
| built on top of DNS, it seems like we need an IETF
| standard for service discovery. DNS++ if you will. I have
| seen lots of (ab)use of DNS for dynamic service discovery
| and it seems like we need a better solution which is
| either push based or gossip based to more quickly
| disseminate service discovery updates.
| otterley wrote:
| I work for AWS; opinions are my own and I'm not
| affiliated with the service team in question.
|
| That a DNS record was deleted is tangential to the
| proximate cause of the incident. It was a latent bug in
| the _control plane_ that updated the records, not the
| data plane. If the discovery protocol were DNS++ or
| /etc/hosts files, the same problem could have happened.
|
| DNS has a lot of advantages: it's a dirt cheap protocol
| to serve (both in terms of bytes over the wire and CPU
| utilization), is reasonably flexible (new RR types are
| added as needs warrant), isn't filtered by middleboxes,
| has separate positive and negative caching, and server
| implementations are very robust. If you're doing to
| replace DNS, you're going to have a steep hill to climb.
| __turbobrew__ wrote:
| > you still need reliable realtime update of instances
| going down
|
| The way I have seen this implemented is through a cluster
| of service watcher that ping all services once every X
| seconds and deregister the service when the pings fail.
|
| Additionally you can use grpc with keepalives which will
| detect on the client side when a service goes down and
| automatically remove it from the subset. Grpc also has
| client side outlier detection so the clients can also
| automatically remove slow servers from the subset as
| well. This only works for grpc though, so not generally
| useful if you are creating a cloud for HTTP servers...
| tptacek wrote:
| _Detecting_ that the service went down is easy. Notifying
| every proxy in the fleet that it 's down is not. Every
| proxy in the fleet cannot directly probe every
| application on the platform.
| JoachimSchipper wrote:
| (Hopping in here because the discussion is interesting...
| feel very free to ignore.)
|
| Thanks for writing this up! It was a very interesting
| read about a part of networking that I don't get to
| seriously touch.
|
| That said: I'm sure you guys have thought about this a
| lot and that I'm just missing something, but "why can't
| every proxy probe every [worker, not application]?" was
| exactly one of the questions I had while reading.
|
| Having the workers being the source-of-truth about
| applications is a nicely resilient design, and
| bruteforcing the problem by having, say 10k proxies each
| retrieve the state of 10k workers every second... may not
| be obviously impossible? Somewhat similar to
| sending/serving 10k DNS requests/s/worker? That's not
| trivial, but maybe not _that_ hard? (You've been working
| on modern Linux servers a lot more than I, but I'm
| thinking of e.g. https://blog.cloudflare.com/how-to-
| receive-a-million-packets...)
|
| I did notice the sentence about "saturating our uplinks",
| but... assuming 1KB=8Kb of compressed critical state per
| worker, you'd end up with a peak bandwidth demand of
| about 80 Mbps of data per worker / per proxy; that may
| not be obviously impossible? (One could reduce _average_
| bandwidth a lot by having the proxies mostly send some
| kind of "send changes since <...>" or "send all data
| unless its hash is <...>" query.)
|
| (Obviously, bruteforcing the routing table does not get
| you out of doing _something_ more clever than that to
| tell the proxies about new workers joining/leaving the
| pool, and probably a hundred other tasks that I'm
| missing; but, as you imply, not all tasks are equally
| timing-critical.)
|
| The other question I had while reading was why you need
| one failure/replication domain (originally, one global;
| soon, one per-region); if you shard worker state over 100
| gossip (SWIM Corrosion) instances, obviously your proxies
| do need to join every sharded instance to build the
| global routing table - but bugs in replication per se
| should only take down 1/100th of your fleet, which would
| hit fewer customers (and, depending on the exact bug, may
| mean that customers with some redundancy and/or
| autoscaling stay up.) This wouldn't have helped in your
| exact case - perfectly replicating something that takes
| down your proxies - but might make a crash-stop of your
| consensus-ish protocol more tolerable?
|
| Both of the questions above might lead to a less
| convenient programming model, which be enough reason on
| its own to scupper it; an article isn't necessarily
| improved by discussing every possible alternative; and
| again, I'm sure you guys have thought about this a lot
| more than I did (and/or that I got a couple of things
| embarassingly wrong). But, well, if you happen to be
| willing to entertain my questions I would appreciate it!
| tptacek wrote:
| Hold up, I sniped Dov into answering this instead of me.
| :)
| DAlperin wrote:
| (I used to work at Fly, specifically on the proxy so my
| info may be slightly out of date, but I've spent a lot of
| time thinking about this stuff.)
|
| > why can't every proxy probe every [worker, not
| application]?
|
| There are several divergent issues with this approach
| (though it can have it's place). First, you still need
| _some_ service discovery to tell you where the nodes are,
| though it's easy to assume this can be solved via some
| consul-esque system. Secondly, there is a lot more data
| than you might be thinking at play here. A single
| proxy/host might have many thousands of VMs under its
| purview. That works out to a lot of data. As you point
| out there are ways to solve this:
|
| > One could reduce _average_ bandwidth a lot by having
| the proxies mostly send some kind of "send changes since
| <...>" or "send all data unless its hash is <...>" query.
|
| This is definitely an improvement. But we have a new
| issue. Lets say I have proxies A, B, and C. A and C lose
| connectivity. Optimally (and in fact fly has several
| mechanisms for this) A could send it's traffic to C via
| B. But in this case it might not even know that there is
| a VM candidate on C at all! It wasn't able to sync data
| for a while.
|
| There are ways to solve this! We could make it possible
| for proxies to relay each others state. To recap: - We
| have workers that poll each other - They exchange diffs
| rather than the full state - The state diffs can be
| relayed by other proxies
|
| We have in practice invented something quite close to a
| gossip protocol! If we continued drawing the rest of the
| owl you might end up with something like SWIM.
|
| As far as your second question I think you kinda got it
| exactly. A crash of a single corrosion does not generally
| affect anything else. But if something bad is replicated,
| or there is a gossip storm, isolating that failure is
| important.
| JoachimSchipper wrote:
| Thanks a lot for your response!
| __turbobrew__ wrote:
| I believe it is possible within envoy to detect a bad
| backend and automatically remove it from the load
| balancing pool, so why can the proxy not determine that
| certain backend instances are unavailable and remove them
| from the pool? No coordination needed and it also handles
| other cases where the backend is bad such as overload or
| deadlock?
|
| It also seems like part of your pain point is that there
| is an any-to-any relationship between proxy and backend,
| but that doesn't need to be the case necessarily, cell
| based architecture with shuffle sharding of backends
| between cells can help alleviate that fundamental pain.
| Part of the advantage of this is that config and code
| changes can then be rolled out cell by cell which is much
| safer as if your code/configs cause a fault in a cell it
| will only affect a subset of infrastructure. And if you
| did shuffle sharding correctly, it should have a
| negligible affect when a single cell goes down.
| tptacek wrote:
| Ok, again: this isn't a cluster of load balancers in
| front of a discrete collection of app servers in a data
| center. It's thousands of load balancers handling
| millions of applications scattered all over the world,
| with instances going up and down constantly.
|
| The interesting part of this problem isn't _noticing that
| an instance is down_. Any load balancer can do that. The
| interesting problem is noticing than and then _informing
| every proxy in the world_.
|
| I feel like a lot of what's happening in these threads is
| people using a mental model that they'd use for hosting
| one application globally, or, if not one, then a
| collection of applications they manage. These are
| customer applications. We can't assume anything about
| their request semantics.
| otterley wrote:
| Out of curiosity, what's your upper bound latency SLO for
| propagating this state? (I assume this actually conforms
| to a percentile histogram and isn't a single value.)
| __turbobrew__ wrote:
| > The interesting problem is noticing than and then
| informing every proxy in the world.
|
| Yes and that is why I suggested why your any-to-any
| relationship of proxy to application is a decision you
| have made which is part of the painpoint that caused you
| to come up with this solution. The fact that any proxy
| box can proxy to any backend is a choice which was made
| which created the structure and mental model you are
| working within. You could batch your proxies into say
| 1024 cells and then assign a customer app to say 4/1024
| cells using shuffle sharding. Then that decomposes the
| problem into maintaining state within a cell instead of
| globally.
|
| Im not saying what you did was wrong or dumb, I am saying
| you are working within a framework that maybe you are not
| even consciously aware of.
| tptacek wrote:
| Again: it's the premise of the platform. If you're saying
| "you picked a hard problem to work on", I guess I agree.
|
| We cannot in fact assign our customers apps to 0.3% of
| our proxies! When you deploy an app in Chicago on Fly.io,
| it has to work from a Sydney edge. I mean, that's part of
| the DX; there are deeper reasons why it would have to
| work that way (due to BGP4), but we don't even get there
| before becoming a different platform.
| justinparus wrote:
| The solutions across different BigCorp Clouds varies
| depending on the SLA from their underlying network. Doing
| this on top the public internet is very different than on
| redundant subsea fiber with dedicated BigCorp bandwidth!
| otterley wrote:
| Lots of solutions appear to work in a steady-state
| scenario--which, admittedly, is most of the time. The key
| question is how resilient to failure they are, not just
| under blackout conditions but brownouts as well.
|
| Many people will read a comment like this and cargo-cult
| an implementation ("millions of workloads", you say?!)
| without knowing how they are going to handle the many
| different failure modes that can result, or even at what
| scale the solution will break down. Then, when the
| inevitable happens, panic and potentially data loss will
| ensue. Or, the system will eventually reach scaling
| limits that will require a significant architectural
| overhaul to solve.
|
| TL;DR: There isn't a one-size-fits-all solution for most
| distributed consensus problems, especially ones that
| require global consistency and fault tolerance, _and_ on
| top of that have established upper bounds on information
| propagation latency.
| vlovich123 wrote:
| > Now: that same European instance goes down. Proxies in Asia
| need to know about that, right away, and this time you can't
| afford to wait.
|
| But they have to. Physically no solution will be
| instantaneous because that's not how the speed of light nor
| relativity works - even two events next to each other cannot
| find out about each other instantaneously. So then the
| question is "how long can I wait for this information". And
| that's the part that I feel isn't answered - eg if the app
| dies, the TCP connections die and in theory that information
| travels as quickly as anything else you send. It's not
| reliably detectable but conceivably you could have an eBPF
| program monitoring death and notifying the proxies. Thats the
| part that's really not explained in the article which is why
| you need to maintain an eventually consistent view of the
| connectivity. I get maybe why that could be useful but
| noticing app connectivity death seems wrong considering I
| believe you're more tracking machine and cluster health
| right? Ie not noticing an app instance goes down but noticing
| all app instances on a given machine are gone and consensus
| deciding globally where the new app instance will be as
| quickly as possible?
| tptacek wrote:
| A request routed to a dead instance doesn't fall into a
| black hole: our proxies reroute it. But that's very slow;
| to deliver acceptable service quality you need to minimize
| the number of times that happens. So you can't accept a
| solution that leaves large windows of time within which
| _every_ instance that has gone down has a stale entry.
| Remember: instances coming up and down happens all the time
| on this platform! It 's part of the point.
| kflansburg wrote:
| > an if let expression over an RWLock assumed (reasonably, but
| incorrectly) in its else branch that the lock had been released.
| Instant and virulently contagious deadlock.
|
| I believe this behavior is changing in the 2024 edition:
| https://doc.rust-lang.org/edition-guide/rust-2024/temporary-...
| kibwen wrote:
| _> I believe this behavior is changing_
|
| Past tense, the 2024 edition stabilized in (and has been the
| default edition for `cargo new` since) Rust 1.85.
| kflansburg wrote:
| Yes, I've already performed the upgrade for my projects, but
| since they hit this bug, I'm guessing they haven't.
| kibwen wrote:
| They may have upgraded by now, their source links to a
| thread from a year ago, prior to the 2024 edition, which
| may be when they encountered that particular bug.
| kflansburg wrote:
| I see now that this incident happened in September 2024
| as well.
| ricardobeat wrote:
| > Like an unattended turkey deep frying on the patio, truly
| global distributed consensus promises deliciousness while
| yielding only immolation
|
| Their writing is so good, always a fun and enlightening read.
| mrbluecoat wrote:
| For the TL;DR folks: https://github.com/superfly/corrosion
| blinkingled wrote:
| > The bidding model is elegant, but it's insufficient to route
| network requests. To allow an HTTP request in Tokyo to find the
| nearest instance in Sydney, we really do need some kind of global
| map of every app we host.
|
| So is this a case of wanting to deliver a differentiating feature
| before the technical maturity is there and validated? It's an
| acceptable strategy if you are building a lesser product but if
| you are selling Public Cloud maybe having a better strategy than
| waiting for problems to crop up makes more sense? Consul, missing
| watchdogs, certificate expiry, CRDT back filling nullable columns
| - sure in a normal case these are not very unexpected or to-be-
| ashamed-of problems but for a product that claims to be Public
| Cloud you want to think of these things and address them before
| day 1. Cert expiry for example - you should be giving your users
| tools to never have a cert expire - not fixing it for your stuff
| after the fact! (Most CAs offer API to automate all this - no
| excuse for it.)
|
| I don't mean to be dismissive or disrespectful, the problem is
| challenging and the work is great - merely thinking of loss of
| customer trust - people are never going to trust a new comer that
| has issues like this and for that reason move fast break things
| and fix when you find isn't a good fit for this kind of a
| product.
| tptacek wrote:
| It's not a "differentiating feature"; it eliminated a scaling
| bottleneck. It's also a decision that long predates Corrosion.
| blinkingled wrote:
| I was referring to the "HTTP request in Tokyo to find the
| nearest instance in Sydney" part which felt to me like a
| differentiating feature- no other cloud provider seems to
| have bidding or HTTP request level cross regional lookup or
| whatever.
|
| The "decision that long predates Corrosion" is precisely the
| point I was trying to make - was it made too soon before
| understanding the ramifications and/or having a validated
| technical solution ready? IOW maybe the feature requiring the
| problem solution could have come later? (I don't know much
| about fly.io and its features, so apologies if some of this
| is unclear/wrongly assumes things.)
| tptacek wrote:
| That's literally the premise of the service and always has
| been.
| x0x0 wrote:
| fwiw, I'm happily running a company and some contract
| work on fly literally as aws, but what if it weren't the
| most massively complex pile of shit you've ever seen.
|
| I have a couple reasonably sized, understandable toml
| files and another 100 lines of ruby that runs long-
| running rake tasks as individual fly machines. The whole
| thing works really nicely.
| conradev wrote:
| To ensure every instance arrives at the same "working set"
| picture, we use cr-sqlite, the CRDT SQLite extension.
|
| Cool to see cr-sqlite used in production!
| mosura wrote:
| Someone needs to read about ant colony optimization.
| https://en.wikipedia.org/wiki/Ant_colony_optimization_algori...
|
| This blog is not impressive for an infra company.
| tucnak wrote:
| I respect Fly, and it does sound like a nice place to work, but
| honestly, you're onto something. You would expect ostensibly
| Public Cloud provider to have a more solid grasp on networking.
| Instead, we're discovering how they're learning about things
| like OSPF!
|
| Makes you think that's all.
| tptacek wrote:
| What a weird thing to say. I wrote my first OSPF
| implementation in 1999. The point is that we noticed the
| solution we'd settled on owes more to protocols like OSPF
| than to distributed consensus databases, which are the
| mainstream solution to this problem. It's not "OMG we just
| discovered this neat protocol called OSPF". We don't actually
| _run_ OSPF. We don 't even do a graph->tree reduction. We're
| routing HTTP requests, not packets.
| mosura wrote:
| Look at one of the other comments:
|
| > in case people don't read all the way to the end, the
| important takeaway is "you simply can't afford to do
| instant global state distribution"
|
| This is what people saw as the key takeaway. If that
| takeaway is news to you then I don't know what you are
| doing writing distributed systems.
|
| While this message may not be what was intended it was what
| was broadcast.
| akerl_ wrote:
| It seems weird to take an inaccurate paraphrase from a
| commenter and then use it to paint the authors with your
| desired brush.
| mosura wrote:
| Not sure the replies to that comment help the cause at
| all.
| nodesocket wrote:
| Anybody used rqlite[1] in production? I'm exploring how to make
| my application fault-tolerant using multiple app vm instances.
| The problem of course is the SQLite database on disk. Using a
| network file system like NFS is a no-go with SQLite (this
| includes Amazon Elastic File System (EFS)).
|
| I was thinking I'll just have to bite the bullet and migrate to
| PostgreSQL, but perhaps rqlite can work.
|
| [1] https://rqlite.io
| otoolep wrote:
| rqlite creator here. Right there on the rqlite homepage[1] are
| listed two production users: replicated.com[2] and
| textgroove.com are both using it.
|
| [1] https://rqlite.io/
|
| [2] https://www.replicated.com/blog/app-manager-with-rqlite
| tucnak wrote:
| What's this obsession with SQLite? For all intents and purposes,
| what they'd accomplished is effectively a Type 2 table with extra
| steps. CRDT is totally overkill in this situation. You can
| implement this in Postgres easily with very little changes to
| your access patterns... DISTINCT ON. Maybe this kind of
| "solution" is impressive for Rust programmers, I'm not sure
| what's the deal exactly, but all it tells me is Fly ought to hire
| actual networking professionals, maybe even compute-in-network
| guys with FPGA experience like everyone else, and develop their
| own routers that way--if only to learn more about networking.
| tptacek wrote:
| What part of this problem do you think FPGAs would help with?
|
| In what sense do you think we need specialty _routers_?
|
| How would you deploy Postgres to address these problems?
| jimmyl02 wrote:
| always wondered at what scale gossip / SWIM breaks down and you
| need a hierarchy / partitioning. fly's use of corrosion seems to
| imply it's good enough for a single region which is pretty
| surprising because iirc Uber's ringpop was said to face problems
| at around 3K nodes.
|
| it would be super cool to learn more about how the world's
| largest gossip systems work :)
| tptacek wrote:
| SWIM is probably going to scale pretty much indefinitely. The
| issue we have with a single global SWIM broadcast domain isn't
| that the scale is breaking down; it's just that the blast
| radius for bugs (both in Corrosion itself, and in the services
| that depend on Corrosion) is too big.
|
| We're actually keeping the global Corrosion cluster! We're just
| stripping most of the data out of it.
| chucky_z wrote:
| Back of napkin math I've done previously, it breaks down around
| 2 million members with Hashicorps defaults. The defaults are
| quite aggressive though and if you can tolerate seconds of
| latency (called out in the article) you could reach billions
| without a lot of trouble.
| tptacek wrote:
| It's also frequency of changes and granularity of state, when
| sizing workloads. My understanding is that most Hashi shops
| would federate workloads of our size/global distribution; it
| would be weird to try to run one big cluster to capture
| everything.
| chucky_z wrote:
| From my literal conversation I'm having right now, 'try to
| run one big cluster to capture everything' is our active
| state. I've brought up federation a bunch of times and it's
| fallen on deaf ears. :)
|
| We are probably past the size of the entirety of fly.io for
| reference, and maintenance is very painful. It works
| because we are doing really strange things with Consul
| (batch txn cross-cluster updates of static entries) on
| really, really big servers (4gbps+ filesystems, 1tb memory,
| 100s of big and fast cores, etc).
| anentropic wrote:
| blog posts should have a date at the top
| chrisweekly wrote:
| YES. THIS. ALWAYS!
|
| Huge pet peeve. At least this one has a date somewhere (at the
| bottom, "last updated Oct 22, 2025").
| natebrennand wrote:
| > Finally, let's revisit that global state problem. After the
| contagious deadlock bug, we concluded we need to evolve past a
| single cluster. So we took on a project we call
| "regionalization", which creates a two-level database scheme.
| Each region we operate in runs a Corrosion cluster with fine-
| grained data about every Fly Machine in the region. The global
| cluster then maps applications to regions, which is sufficient to
| make forwarding decisions at our edge proxies.
|
| This tier approach makes a lot of sense to mitigate the scaling
| limit per corrosion node. Can you share how much data you wind up
| tracking in each tier in practice?
|
| How concise is the entry for each application -> [regions] table?
| Does the constraint of running this on every node mean that this
| creates a global limit for number of applications? It also seems
| like the region level database would have a regional limit for
| the number of Fly machines too?
___________________________________________________________________
(page generated 2025-10-27 23:01 UTC)