[HN Gopher] Corrosion
       ___________________________________________________________________
        
       Corrosion
        
       Author : cgb_
       Score  : 156 points
       Date   : 2025-10-23 11:21 UTC (4 days ago)
        
 (HTM) web link (fly.io)
 (TXT) w3m dump (fly.io)
        
       | soamv wrote:
       | > New nullable columns are kryptonite to large Corrosion tables:
       | cr-sqlite needs to backfill values for every row in the table
       | 
       | Is this a typo? Why does it backfill values for a nullable
       | column?
        
         | andrewaylett wrote:
         | I assume it _would_ backfill values for any column, as a side-
         | effect of propagating values for any column. But nullable
         | columns are the only type you can add to a table that already
         | contains rows, and mean that every row immediately has an
         | update that needs to be sent.
        
         | ricardobeat wrote:
         | It seems to be a quirk of cr-sqlite, it wants to keep track of
         | clock values for the new column. It's not backfilling the field
         | values as far as I understand. There is a comment mentioning it
         | could be optimized away:
         | 
         | https://github.com/vlcn-io/cr-sqlite/blob/891fe9e0190dd20917...
        
       | throwaway290 wrote:
       | I guess all designers at fly were replaced by ai because this
       | article is using gray bold font for the whole text. I remember
       | these guys had good blog some time ago
        
         | foofoo12 wrote:
         | It's totally unreadable.
        
           | davidham wrote:
           | Looks like it always has, to me.
        
         | dewey wrote:
         | Not sure if that was changed since then, but it's not bold for
         | me and also readable. Maybe browser rendering?
        
           | ceigey wrote:
           | Also not bold for me (Safari). Variable font rendering issue?
        
             | throwaway290 wrote:
             | stock safari on ios 26 for me. is it another of 37366153
             | regressions of ios 26?
        
               | iviv wrote:
               | Looks normal to me on iOS 26.0.1
        
           | throwaway290 wrote:
           | stock safari on ios
           | 
           | and I think the intended webfont is loaded because the font
           | is clearly weird ish and non-standard and the text is
           | invisible for good 2 seconds at first while it loads:)
        
         | mcny wrote:
         | Please try the article mode in your web browser. Firefox has a
         | pretty good one but I understand all major browsers have this
         | now.
        
           | throwaway290 wrote:
           | I only use article mode in exceptional cases. I hold fly to
           | higher standard than that.
        
             | tptacek wrote:
             | D'awwwwww.
        
         | tptacek wrote:
         | The design hasn't changed in years. If someone has a screenshot
         | and a browser version we can try to figure out why it's coming
         | out fucky for you.
        
           | kg wrote:
           | Looking at the css, there's a .text-gray-600 CSS style that
           | would cause this, and it's overridden by some other style in
           | order to achieve the actual desired appearance. Maybe the
           | override style isn't loading - perhaps the GP has javascript
           | disabled?
        
             | tptacek wrote:
             | Thanks! Relayed.
        
             | throwaway290 wrote:
             | javascript is enabled but I don't see the problem on
             | another phone, so yeah seems related
        
         | jjtheblunt wrote:
         | latest macos firefox and safari both show grey on white,
         | legible but contrast somewhat lacking, but rendered properly
         | for grey on white.
        
       | bananapub wrote:
       | in case people don't read all the way to the end, the important
       | takeaway is "you simply can't afford to do instant global state
       | distribution" - you can formal method and Rust and test and
       | watchdog yourself as much as you want, but you simply have to
       | stop doing that or the unknown unknowns will just keep taking you
       | down.
        
         | tptacek wrote:
         | I mean, the thing we're saying is that instant global state
         | _with database-style consensus_ is unworkable. Instant state
         | distribution though is kind of just... necessary? for a
         | platform like ours. You bring up an app in Europe, proxies in
         | Asia need to know about it to route to it. So you say,  "ok,
         | well, they can wait a minute to learn about the app, not the
         | end of the world". Now: that same European instance _goes
         | down_. Proxies in Asia need to know about that, right away, and
         | this time you can 't afford to wait.
        
           | __turbobrew__ wrote:
           | > Proxies in Asia need to know about that, right away, and
           | this time you can't afford to wait.
           | 
           | Did you ever consider envoy xDS?
           | 
           | There are a lot of really cool things in envoy like outlier
           | detection, circuit breakers, load shedding, etc...
        
             | tptacek wrote:
             | Nope. Talk a little about how how Envoy's service discovery
             | would scale to millions of apps in a global network?
             | There's no way we found the only possible point in the
             | solution space. Do they do something clever here?
             | 
             | What we (think we) know won't work is a topologically
             | centralized database that uses distributed consensus
             | algorithms to synchronize. Running consensus
             | transcontinentally is very painful, and keep the servers
             | central, so that update proposals are local and the
             | protocol can run quickly, subjects large portions of the
             | network to partition risk. The natural response (what I
             | think a lot of people do, in fact) is just to run multiple
             | consensus clusters, but our UX includes a global namespace
             | for customer workloads.
        
               | hedgehog wrote:
               | Is it actually necessary to run transcontinental
               | consensus? Apps in a given location are not movable so it
               | would seem for a given app it's known which part of the
               | network writes can come from. That would require
               | partitioning the namespace but, given that apps are not
               | movable, does that matter? It feel like there are other
               | areas like docs and tooling that would benefit from
               | relatively higher prioritization.
        
               | tptacek wrote:
               | Apps in a given location are _extremely_ movable! That 's
               | the point of the service!
        
               | hedgehog wrote:
               | We unfortunately lost our location with not a whole lot
               | of notice and the migration to a new one was not
               | seamless, on top of things like the GitHub actions being
               | out of date (only supporting the deprecated Postgres
               | service, not the new one).
        
               | __turbobrew__ wrote:
               | I haven't personally worked on envoy xds, but it is what
               | I have seen several BigCo's use for routing from the edge
               | to internal applications.
               | 
               | > Running consensus transcontinentally is very painful
               | 
               | You don't necessarily have to do that, you can keep your
               | quorum nodes (lets assume we are talking about etcd) far
               | enough apart to be in separate failure domains (fires,
               | power loss, natural disasters) but close enough that
               | network latency isn't unbearably high between the
               | replicas.
               | 
               | I have seen the following scheme work for millions of
               | workloads:
               | 
               | 1. Etcd quorum across 3 close, but independent regions
               | 
               | 2. On startup, the app registers itself under a prefix
               | that all other app replicas register
               | 
               | 3. All clients to that app issue etcd watches for that
               | prefix and almost instantly will be notified when there
               | is a change. This is baked as a plugin within grpc
               | clients.
               | 
               | 4. A custom grpc resolver is used to do lookups by
               | service name
        
               | tptacek wrote:
               | I'm thrilled to have people digging into this, because I
               | think it's a super interesting problem, but: no, keeping
               | quorum nodes close-enough-but-not-too-close doesn't solve
               | our problem, because we support a unified customer
               | namespace that runs from Tokyo to Sydney to Sao Paulo to
               | Northern Virginia to London to Frankfurt to Johannesburg.
               | 
               | Two other details that are super important here:
               | 
               | This is a public cloud. There is no real correlation
               | between apps/regions and clients. Clients are public
               | Internet users. When you bring an app up, it just needs
               | to work, for completely random browsers on completely
               | random continents. Users can and do move their instances
               | (or, more likely, reallocate instances) between regions
               | with no notice.
               | 
               | The second detail is that no matter what DX compromise
               | you make to scale global consensus up, you still _need_
               | reliable realtime update of instances going down. Not
               | knowing about a new instance that just came up isn 't
               | that big a deal! You just get less optimal routing for
               | the request. Not knowing that an instance went _down_ is
               | a very big deal: you end up routing requests to dead
               | instances.
               | 
               | The deployment strategy you're describing is in fact what
               | we used to do! We had a Consul cluster in North America
               | and ran the global network off it.
        
               | __turbobrew__ wrote:
               | > I'm thrilled to have people digging into this, because
               | I think it's a super interesting problem
               | 
               | Yes, somehow this is a problem all the big companies
               | have, but it seems like there is no standard solution and
               | nobody has open sourced their stuff (except you)!
               | 
               | Taking a step back, and thinking about the AWS outage
               | last week which was caused by a buggy bespoke system
               | built on top of DNS, it seems like we need an IETF
               | standard for service discovery. DNS++ if you will. I have
               | seen lots of (ab)use of DNS for dynamic service discovery
               | and it seems like we need a better solution which is
               | either push based or gossip based to more quickly
               | disseminate service discovery updates.
        
               | otterley wrote:
               | I work for AWS; opinions are my own and I'm not
               | affiliated with the service team in question.
               | 
               | That a DNS record was deleted is tangential to the
               | proximate cause of the incident. It was a latent bug in
               | the _control plane_ that updated the records, not the
               | data plane. If the discovery protocol were DNS++ or
               | /etc/hosts files, the same problem could have happened.
               | 
               | DNS has a lot of advantages: it's a dirt cheap protocol
               | to serve (both in terms of bytes over the wire and CPU
               | utilization), is reasonably flexible (new RR types are
               | added as needs warrant), isn't filtered by middleboxes,
               | has separate positive and negative caching, and server
               | implementations are very robust. If you're doing to
               | replace DNS, you're going to have a steep hill to climb.
        
               | __turbobrew__ wrote:
               | > you still need reliable realtime update of instances
               | going down
               | 
               | The way I have seen this implemented is through a cluster
               | of service watcher that ping all services once every X
               | seconds and deregister the service when the pings fail.
               | 
               | Additionally you can use grpc with keepalives which will
               | detect on the client side when a service goes down and
               | automatically remove it from the subset. Grpc also has
               | client side outlier detection so the clients can also
               | automatically remove slow servers from the subset as
               | well. This only works for grpc though, so not generally
               | useful if you are creating a cloud for HTTP servers...
        
               | tptacek wrote:
               | _Detecting_ that the service went down is easy. Notifying
               | every proxy in the fleet that it 's down is not. Every
               | proxy in the fleet cannot directly probe every
               | application on the platform.
        
               | JoachimSchipper wrote:
               | (Hopping in here because the discussion is interesting...
               | feel very free to ignore.)
               | 
               | Thanks for writing this up! It was a very interesting
               | read about a part of networking that I don't get to
               | seriously touch.
               | 
               | That said: I'm sure you guys have thought about this a
               | lot and that I'm just missing something, but "why can't
               | every proxy probe every [worker, not application]?" was
               | exactly one of the questions I had while reading.
               | 
               | Having the workers being the source-of-truth about
               | applications is a nicely resilient design, and
               | bruteforcing the problem by having, say 10k proxies each
               | retrieve the state of 10k workers every second... may not
               | be obviously impossible? Somewhat similar to
               | sending/serving 10k DNS requests/s/worker? That's not
               | trivial, but maybe not _that_ hard? (You've been working
               | on modern Linux servers a lot more than I, but I'm
               | thinking of e.g. https://blog.cloudflare.com/how-to-
               | receive-a-million-packets...)
               | 
               | I did notice the sentence about "saturating our uplinks",
               | but... assuming 1KB=8Kb of compressed critical state per
               | worker, you'd end up with a peak bandwidth demand of
               | about 80 Mbps of data per worker / per proxy; that may
               | not be obviously impossible? (One could reduce _average_
               | bandwidth a lot by having the proxies mostly send some
               | kind of "send changes since <...>" or "send all data
               | unless its hash is <...>" query.)
               | 
               | (Obviously, bruteforcing the routing table does not get
               | you out of doing _something_ more clever than that to
               | tell the proxies about new workers joining/leaving the
               | pool, and probably a hundred other tasks that I'm
               | missing; but, as you imply, not all tasks are equally
               | timing-critical.)
               | 
               | The other question I had while reading was why you need
               | one failure/replication domain (originally, one global;
               | soon, one per-region); if you shard worker state over 100
               | gossip (SWIM Corrosion) instances, obviously your proxies
               | do need to join every sharded instance to build the
               | global routing table - but bugs in replication per se
               | should only take down 1/100th of your fleet, which would
               | hit fewer customers (and, depending on the exact bug, may
               | mean that customers with some redundancy and/or
               | autoscaling stay up.) This wouldn't have helped in your
               | exact case - perfectly replicating something that takes
               | down your proxies - but might make a crash-stop of your
               | consensus-ish protocol more tolerable?
               | 
               | Both of the questions above might lead to a less
               | convenient programming model, which be enough reason on
               | its own to scupper it; an article isn't necessarily
               | improved by discussing every possible alternative; and
               | again, I'm sure you guys have thought about this a lot
               | more than I did (and/or that I got a couple of things
               | embarassingly wrong). But, well, if you happen to be
               | willing to entertain my questions I would appreciate it!
        
               | tptacek wrote:
               | Hold up, I sniped Dov into answering this instead of me.
               | :)
        
               | DAlperin wrote:
               | (I used to work at Fly, specifically on the proxy so my
               | info may be slightly out of date, but I've spent a lot of
               | time thinking about this stuff.)
               | 
               | > why can't every proxy probe every [worker, not
               | application]?
               | 
               | There are several divergent issues with this approach
               | (though it can have it's place). First, you still need
               | _some_ service discovery to tell you where the nodes are,
               | though it's easy to assume this can be solved via some
               | consul-esque system. Secondly, there is a lot more data
               | than you might be thinking at play here. A single
               | proxy/host might have many thousands of VMs under its
               | purview. That works out to a lot of data. As you point
               | out there are ways to solve this:
               | 
               | > One could reduce _average_ bandwidth a lot by having
               | the proxies mostly send some kind of "send changes since
               | <...>" or "send all data unless its hash is <...>" query.
               | 
               | This is definitely an improvement. But we have a new
               | issue. Lets say I have proxies A, B, and C. A and C lose
               | connectivity. Optimally (and in fact fly has several
               | mechanisms for this) A could send it's traffic to C via
               | B. But in this case it might not even know that there is
               | a VM candidate on C at all! It wasn't able to sync data
               | for a while.
               | 
               | There are ways to solve this! We could make it possible
               | for proxies to relay each others state. To recap: - We
               | have workers that poll each other - They exchange diffs
               | rather than the full state - The state diffs can be
               | relayed by other proxies
               | 
               | We have in practice invented something quite close to a
               | gossip protocol! If we continued drawing the rest of the
               | owl you might end up with something like SWIM.
               | 
               | As far as your second question I think you kinda got it
               | exactly. A crash of a single corrosion does not generally
               | affect anything else. But if something bad is replicated,
               | or there is a gossip storm, isolating that failure is
               | important.
        
               | JoachimSchipper wrote:
               | Thanks a lot for your response!
        
               | __turbobrew__ wrote:
               | I believe it is possible within envoy to detect a bad
               | backend and automatically remove it from the load
               | balancing pool, so why can the proxy not determine that
               | certain backend instances are unavailable and remove them
               | from the pool? No coordination needed and it also handles
               | other cases where the backend is bad such as overload or
               | deadlock?
               | 
               | It also seems like part of your pain point is that there
               | is an any-to-any relationship between proxy and backend,
               | but that doesn't need to be the case necessarily, cell
               | based architecture with shuffle sharding of backends
               | between cells can help alleviate that fundamental pain.
               | Part of the advantage of this is that config and code
               | changes can then be rolled out cell by cell which is much
               | safer as if your code/configs cause a fault in a cell it
               | will only affect a subset of infrastructure. And if you
               | did shuffle sharding correctly, it should have a
               | negligible affect when a single cell goes down.
        
               | tptacek wrote:
               | Ok, again: this isn't a cluster of load balancers in
               | front of a discrete collection of app servers in a data
               | center. It's thousands of load balancers handling
               | millions of applications scattered all over the world,
               | with instances going up and down constantly.
               | 
               | The interesting part of this problem isn't _noticing that
               | an instance is down_. Any load balancer can do that. The
               | interesting problem is noticing than and then _informing
               | every proxy in the world_.
               | 
               | I feel like a lot of what's happening in these threads is
               | people using a mental model that they'd use for hosting
               | one application globally, or, if not one, then a
               | collection of applications they manage. These are
               | customer applications. We can't assume anything about
               | their request semantics.
        
               | otterley wrote:
               | Out of curiosity, what's your upper bound latency SLO for
               | propagating this state? (I assume this actually conforms
               | to a percentile histogram and isn't a single value.)
        
               | __turbobrew__ wrote:
               | > The interesting problem is noticing than and then
               | informing every proxy in the world.
               | 
               | Yes and that is why I suggested why your any-to-any
               | relationship of proxy to application is a decision you
               | have made which is part of the painpoint that caused you
               | to come up with this solution. The fact that any proxy
               | box can proxy to any backend is a choice which was made
               | which created the structure and mental model you are
               | working within. You could batch your proxies into say
               | 1024 cells and then assign a customer app to say 4/1024
               | cells using shuffle sharding. Then that decomposes the
               | problem into maintaining state within a cell instead of
               | globally.
               | 
               | Im not saying what you did was wrong or dumb, I am saying
               | you are working within a framework that maybe you are not
               | even consciously aware of.
        
               | tptacek wrote:
               | Again: it's the premise of the platform. If you're saying
               | "you picked a hard problem to work on", I guess I agree.
               | 
               | We cannot in fact assign our customers apps to 0.3% of
               | our proxies! When you deploy an app in Chicago on Fly.io,
               | it has to work from a Sydney edge. I mean, that's part of
               | the DX; there are deeper reasons why it would have to
               | work that way (due to BGP4), but we don't even get there
               | before becoming a different platform.
        
               | justinparus wrote:
               | The solutions across different BigCorp Clouds varies
               | depending on the SLA from their underlying network. Doing
               | this on top the public internet is very different than on
               | redundant subsea fiber with dedicated BigCorp bandwidth!
        
               | otterley wrote:
               | Lots of solutions appear to work in a steady-state
               | scenario--which, admittedly, is most of the time. The key
               | question is how resilient to failure they are, not just
               | under blackout conditions but brownouts as well.
               | 
               | Many people will read a comment like this and cargo-cult
               | an implementation ("millions of workloads", you say?!)
               | without knowing how they are going to handle the many
               | different failure modes that can result, or even at what
               | scale the solution will break down. Then, when the
               | inevitable happens, panic and potentially data loss will
               | ensue. Or, the system will eventually reach scaling
               | limits that will require a significant architectural
               | overhaul to solve.
               | 
               | TL;DR: There isn't a one-size-fits-all solution for most
               | distributed consensus problems, especially ones that
               | require global consistency and fault tolerance, _and_ on
               | top of that have established upper bounds on information
               | propagation latency.
        
           | vlovich123 wrote:
           | > Now: that same European instance goes down. Proxies in Asia
           | need to know about that, right away, and this time you can't
           | afford to wait.
           | 
           | But they have to. Physically no solution will be
           | instantaneous because that's not how the speed of light nor
           | relativity works - even two events next to each other cannot
           | find out about each other instantaneously. So then the
           | question is "how long can I wait for this information". And
           | that's the part that I feel isn't answered - eg if the app
           | dies, the TCP connections die and in theory that information
           | travels as quickly as anything else you send. It's not
           | reliably detectable but conceivably you could have an eBPF
           | program monitoring death and notifying the proxies. Thats the
           | part that's really not explained in the article which is why
           | you need to maintain an eventually consistent view of the
           | connectivity. I get maybe why that could be useful but
           | noticing app connectivity death seems wrong considering I
           | believe you're more tracking machine and cluster health
           | right? Ie not noticing an app instance goes down but noticing
           | all app instances on a given machine are gone and consensus
           | deciding globally where the new app instance will be as
           | quickly as possible?
        
             | tptacek wrote:
             | A request routed to a dead instance doesn't fall into a
             | black hole: our proxies reroute it. But that's very slow;
             | to deliver acceptable service quality you need to minimize
             | the number of times that happens. So you can't accept a
             | solution that leaves large windows of time within which
             | _every_ instance that has gone down has a stale entry.
             | Remember: instances coming up and down happens all the time
             | on this platform! It 's part of the point.
        
       | kflansburg wrote:
       | > an if let expression over an RWLock assumed (reasonably, but
       | incorrectly) in its else branch that the lock had been released.
       | Instant and virulently contagious deadlock.
       | 
       | I believe this behavior is changing in the 2024 edition:
       | https://doc.rust-lang.org/edition-guide/rust-2024/temporary-...
        
         | kibwen wrote:
         | _> I believe this behavior is changing_
         | 
         | Past tense, the 2024 edition stabilized in (and has been the
         | default edition for `cargo new` since) Rust 1.85.
        
           | kflansburg wrote:
           | Yes, I've already performed the upgrade for my projects, but
           | since they hit this bug, I'm guessing they haven't.
        
             | kibwen wrote:
             | They may have upgraded by now, their source links to a
             | thread from a year ago, prior to the 2024 edition, which
             | may be when they encountered that particular bug.
        
               | kflansburg wrote:
               | I see now that this incident happened in September 2024
               | as well.
        
       | ricardobeat wrote:
       | > Like an unattended turkey deep frying on the patio, truly
       | global distributed consensus promises deliciousness while
       | yielding only immolation
       | 
       | Their writing is so good, always a fun and enlightening read.
        
       | mrbluecoat wrote:
       | For the TL;DR folks: https://github.com/superfly/corrosion
        
       | blinkingled wrote:
       | > The bidding model is elegant, but it's insufficient to route
       | network requests. To allow an HTTP request in Tokyo to find the
       | nearest instance in Sydney, we really do need some kind of global
       | map of every app we host.
       | 
       | So is this a case of wanting to deliver a differentiating feature
       | before the technical maturity is there and validated? It's an
       | acceptable strategy if you are building a lesser product but if
       | you are selling Public Cloud maybe having a better strategy than
       | waiting for problems to crop up makes more sense? Consul, missing
       | watchdogs, certificate expiry, CRDT back filling nullable columns
       | - sure in a normal case these are not very unexpected or to-be-
       | ashamed-of problems but for a product that claims to be Public
       | Cloud you want to think of these things and address them before
       | day 1. Cert expiry for example - you should be giving your users
       | tools to never have a cert expire - not fixing it for your stuff
       | after the fact! (Most CAs offer API to automate all this - no
       | excuse for it.)
       | 
       | I don't mean to be dismissive or disrespectful, the problem is
       | challenging and the work is great - merely thinking of loss of
       | customer trust - people are never going to trust a new comer that
       | has issues like this and for that reason move fast break things
       | and fix when you find isn't a good fit for this kind of a
       | product.
        
         | tptacek wrote:
         | It's not a "differentiating feature"; it eliminated a scaling
         | bottleneck. It's also a decision that long predates Corrosion.
        
           | blinkingled wrote:
           | I was referring to the "HTTP request in Tokyo to find the
           | nearest instance in Sydney" part which felt to me like a
           | differentiating feature- no other cloud provider seems to
           | have bidding or HTTP request level cross regional lookup or
           | whatever.
           | 
           | The "decision that long predates Corrosion" is precisely the
           | point I was trying to make - was it made too soon before
           | understanding the ramifications and/or having a validated
           | technical solution ready? IOW maybe the feature requiring the
           | problem solution could have come later? (I don't know much
           | about fly.io and its features, so apologies if some of this
           | is unclear/wrongly assumes things.)
        
             | tptacek wrote:
             | That's literally the premise of the service and always has
             | been.
        
               | x0x0 wrote:
               | fwiw, I'm happily running a company and some contract
               | work on fly literally as aws, but what if it weren't the
               | most massively complex pile of shit you've ever seen.
               | 
               | I have a couple reasonably sized, understandable toml
               | files and another 100 lines of ruby that runs long-
               | running rake tasks as individual fly machines. The whole
               | thing works really nicely.
        
       | conradev wrote:
       | To ensure every instance arrives at the same "working set"
       | picture, we use cr-sqlite, the CRDT SQLite extension.
       | 
       | Cool to see cr-sqlite used in production!
        
       | mosura wrote:
       | Someone needs to read about ant colony optimization.
       | https://en.wikipedia.org/wiki/Ant_colony_optimization_algori...
       | 
       | This blog is not impressive for an infra company.
        
         | tucnak wrote:
         | I respect Fly, and it does sound like a nice place to work, but
         | honestly, you're onto something. You would expect ostensibly
         | Public Cloud provider to have a more solid grasp on networking.
         | Instead, we're discovering how they're learning about things
         | like OSPF!
         | 
         | Makes you think that's all.
        
           | tptacek wrote:
           | What a weird thing to say. I wrote my first OSPF
           | implementation in 1999. The point is that we noticed the
           | solution we'd settled on owes more to protocols like OSPF
           | than to distributed consensus databases, which are the
           | mainstream solution to this problem. It's not "OMG we just
           | discovered this neat protocol called OSPF". We don't actually
           | _run_ OSPF. We don 't even do a graph->tree reduction. We're
           | routing HTTP requests, not packets.
        
             | mosura wrote:
             | Look at one of the other comments:
             | 
             | > in case people don't read all the way to the end, the
             | important takeaway is "you simply can't afford to do
             | instant global state distribution"
             | 
             | This is what people saw as the key takeaway. If that
             | takeaway is news to you then I don't know what you are
             | doing writing distributed systems.
             | 
             | While this message may not be what was intended it was what
             | was broadcast.
        
               | akerl_ wrote:
               | It seems weird to take an inaccurate paraphrase from a
               | commenter and then use it to paint the authors with your
               | desired brush.
        
               | mosura wrote:
               | Not sure the replies to that comment help the cause at
               | all.
        
       | nodesocket wrote:
       | Anybody used rqlite[1] in production? I'm exploring how to make
       | my application fault-tolerant using multiple app vm instances.
       | The problem of course is the SQLite database on disk. Using a
       | network file system like NFS is a no-go with SQLite (this
       | includes Amazon Elastic File System (EFS)).
       | 
       | I was thinking I'll just have to bite the bullet and migrate to
       | PostgreSQL, but perhaps rqlite can work.
       | 
       | [1] https://rqlite.io
        
         | otoolep wrote:
         | rqlite creator here. Right there on the rqlite homepage[1] are
         | listed two production users: replicated.com[2] and
         | textgroove.com are both using it.
         | 
         | [1] https://rqlite.io/
         | 
         | [2] https://www.replicated.com/blog/app-manager-with-rqlite
        
       | tucnak wrote:
       | What's this obsession with SQLite? For all intents and purposes,
       | what they'd accomplished is effectively a Type 2 table with extra
       | steps. CRDT is totally overkill in this situation. You can
       | implement this in Postgres easily with very little changes to
       | your access patterns... DISTINCT ON. Maybe this kind of
       | "solution" is impressive for Rust programmers, I'm not sure
       | what's the deal exactly, but all it tells me is Fly ought to hire
       | actual networking professionals, maybe even compute-in-network
       | guys with FPGA experience like everyone else, and develop their
       | own routers that way--if only to learn more about networking.
        
         | tptacek wrote:
         | What part of this problem do you think FPGAs would help with?
         | 
         | In what sense do you think we need specialty _routers_?
         | 
         | How would you deploy Postgres to address these problems?
        
       | jimmyl02 wrote:
       | always wondered at what scale gossip / SWIM breaks down and you
       | need a hierarchy / partitioning. fly's use of corrosion seems to
       | imply it's good enough for a single region which is pretty
       | surprising because iirc Uber's ringpop was said to face problems
       | at around 3K nodes.
       | 
       | it would be super cool to learn more about how the world's
       | largest gossip systems work :)
        
         | tptacek wrote:
         | SWIM is probably going to scale pretty much indefinitely. The
         | issue we have with a single global SWIM broadcast domain isn't
         | that the scale is breaking down; it's just that the blast
         | radius for bugs (both in Corrosion itself, and in the services
         | that depend on Corrosion) is too big.
         | 
         | We're actually keeping the global Corrosion cluster! We're just
         | stripping most of the data out of it.
        
         | chucky_z wrote:
         | Back of napkin math I've done previously, it breaks down around
         | 2 million members with Hashicorps defaults. The defaults are
         | quite aggressive though and if you can tolerate seconds of
         | latency (called out in the article) you could reach billions
         | without a lot of trouble.
        
           | tptacek wrote:
           | It's also frequency of changes and granularity of state, when
           | sizing workloads. My understanding is that most Hashi shops
           | would federate workloads of our size/global distribution; it
           | would be weird to try to run one big cluster to capture
           | everything.
        
             | chucky_z wrote:
             | From my literal conversation I'm having right now, 'try to
             | run one big cluster to capture everything' is our active
             | state. I've brought up federation a bunch of times and it's
             | fallen on deaf ears. :)
             | 
             | We are probably past the size of the entirety of fly.io for
             | reference, and maintenance is very painful. It works
             | because we are doing really strange things with Consul
             | (batch txn cross-cluster updates of static entries) on
             | really, really big servers (4gbps+ filesystems, 1tb memory,
             | 100s of big and fast cores, etc).
        
       | anentropic wrote:
       | blog posts should have a date at the top
        
         | chrisweekly wrote:
         | YES. THIS. ALWAYS!
         | 
         | Huge pet peeve. At least this one has a date somewhere (at the
         | bottom, "last updated Oct 22, 2025").
        
       | natebrennand wrote:
       | > Finally, let's revisit that global state problem. After the
       | contagious deadlock bug, we concluded we need to evolve past a
       | single cluster. So we took on a project we call
       | "regionalization", which creates a two-level database scheme.
       | Each region we operate in runs a Corrosion cluster with fine-
       | grained data about every Fly Machine in the region. The global
       | cluster then maps applications to regions, which is sufficient to
       | make forwarding decisions at our edge proxies.
       | 
       | This tier approach makes a lot of sense to mitigate the scaling
       | limit per corrosion node. Can you share how much data you wind up
       | tracking in each tier in practice?
       | 
       | How concise is the entry for each application -> [regions] table?
       | Does the constraint of running this on every node mean that this
       | creates a global limit for number of applications? It also seems
       | like the region level database would have a regional limit for
       | the number of Fly machines too?
        
       ___________________________________________________________________
       (page generated 2025-10-27 23:01 UTC)