[HN Gopher] Distributed transactions in Go: Read before you try
       ___________________________________________________________________
        
       Distributed transactions in Go: Read before you try
        
       Author : roblaszczak
       Score  : 91 points
       Date   : 2024-10-02 14:10 UTC (4 days ago)
        
 (HTM) web link (threedots.tech)
 (TXT) w3m dump (threedots.tech)
        
       | p10jkle wrote:
       | This is also a good use case for durable execution, see eg
       | https://restate.dev
        
       | liampulles wrote:
       | This does a good job of encapsulating the considerations of an
       | event driven system.
       | 
       | I think the author is a little too easily dismissive of sagas
       | though - for starters, an event driven system is also still going
       | to need to deal with compensating actions, its just going to
       | result in a larger set of events that various systems potentially
       | need to handle.
       | 
       | The virtue of a saga or some command driven orchestration
       | approach is that the story of what happens when a user does X is
       | plainly visible. The ability to dictate that upfront and figure
       | it out easily later when diagnosing issues cannot be understated.
        
       | relistan wrote:
       | This is a good summary of building an evented system. Having
       | built and run one that scaled up to 130 services and nearly 60
       | engineers, I can say this solves a lot of problems. Our
       | implementation was a bit different but in a similar vein. When
       | the company didn't do well in the market and scaled down, 9
       | engineers were (are) able to operate almost all of that same
       | system. The decoupling and lack of synchronous dependencies means
       | failures are fairly contained and easy to rectify. Replays can
       | fix almost anything after the fact. Scheduled daily replays
       | prevent drift across the system, helping guarantee that things
       | are consistent... eventually.
        
         | latentsea wrote:
         | I would really hate to join a team of 9 engineers that owned
         | 130 services.
        
       | physicsguy wrote:
       | > And if your product backlog is full and people who designed the
       | microservices are still around, it's unlikely to happen.
       | 
       | Oh man I feel this
        
       | ekidd wrote:
       | Let's assume you're not a FAANG, and you don't have a billion
       | customers.
       | 
       | If you're gluing microservices together using distributed
       | transactions (or durable event queues plus eventual consistency,
       | or whatever), the odds are good that you've gone far down the
       | wrong path.
       | 
       | For many applications, it's easiest to start with a modular
       | monolith talking to a shared database, one that natively supports
       | transactions. When this becomes too expensive to scale, the next
       | step may be sharding your backend. (It depends on whether you
       | have a system where users mostly live in their silos, or where
       | everyone talks to everyone. If your users are siloed, you can
       | shard at almost any scale.)
       | 
       | Microservices make sense when they're "natural". A video encoder
       | makes a great microservice. So does a map tile generator.
       | 
       | Distributed systems are _expensive_ and complicated, and they
       | kill your team 's development velocity. I've built several of
       | them. Sometimes, they turned out to be serious mistakes.
       | 
       | As a rule of thumb:
       | 
       | 1. Design for 10x your current scale, not 1000x. 10x your scale
       | allows for 3 consecutive years of 100% growth before you need to
       | rebuild. Designing for 1000x your scale usually means you're
       | sacrificing development velocity to cosplay as a FAANG.
       | 
       | 2. You will want transactions in places that you didn't expect.
       | 
       | 3. If you need transactions between two microservices, strongly
       | consider merging them and having them talk to the same database.
       | 
       | Sometimes you'll have no better choice than to use distributed
       | transactions or durable event queues. They are inherent in some
       | problems. But they should be treated as a giant flashing "danger"
       | sign.
        
         | hggigg wrote:
         | I would add that just because you add these things it does not
         | mean you _can_ scale afterwards. All microservices
         | implementations I 've seen so far are bolted on top of some
         | existing layer of mud and serve only to make function calls
         | that were inside processes run over the network with added
         | latency and other overheads. The end game is the aggregate
         | latency and cost increases only with no functional scalability
         | improvements.
         | 
         | Various engineering leads, happy with what went on their
         | resume, leave and tell everyone how they increased scalability.
         | And persuade another generation of failures.
        
         | xyzzy_plugh wrote:
         | I frequently see folks fail to understand that when the unicorn
         | rocketship spends a month and ten of their hundreds of
         | engineers replacing their sharded mysql from setting ablaze
         | daily due to overwhelming load, it is actually pretty close to
         | the correct time for that work. Sure it may have been
         | stressful, and customers may have been impacted, but it's a
         | good problem to have. Conversely not having that problem maybe
         | doesn't really mean anything at all, but there's a good chance
         | it means you were solving these scaling problems prematurely.
         | 
         | It's a balancing act, but putting out the fires before they
         | even begin is often the wrong approach. Often a little fire is
         | good for growth.
        
           | AmericanChopper wrote:
           | You really have to be doing huge levels of throughput before
           | you start to struggle with scaling MySQL or Postgres. There's
           | really not many workloads that actually require strict ACID
           | guarantees _and_ produce that level of throughput. 10-20
           | years ago I was running hundreds to thousands of transactions
           | per second on beefy Oracle and Postgres instances, and the
           | workloads had to be especially big before we'd even consider
           | any fancy scaling strategies to be necessary, and there
           | wasn't some magic tipping point where we'd decide that some
           | instance had to go distributed all of a sudden.
           | 
           | Most of the distributed architectures I've seen have been led
           | by engineers needs (to do something popular or interesting)
           | rather than an actual product need, and most of them have had
           | issues relating to poor attempts to replicate ACID
           | functionality. If you're really at the scale where you're
           | going to benefit from a distributed architecture, the chances
           | are eventual consistency will do just fine.
        
         | MuffinFlavored wrote:
         | > For many applications, it's easiest to start with a modular
         | monolith talking to a shared database, one that natively
         | supports transactions.
         | 
         | I don't think this handles "what if my app is a wrapper on
         | external APIs and my own database".
         | 
         | You don't get automatic rollbacks with API calls the same way
         | you do database transactions. What to do then?
        
           | natdempk wrote:
           | You have a distributed system on your hands at that point so
           | you need idempotent processes + reconciliation/eventual
           | consistency. Basically thinking a lot about failure,
           | resyncing data/state, patterns like transactional outboxes,
           | durable queues, two-phase commit, etc etc. It just quickly
           | gets into the specifics of your task/system/APIs so hard to
           | give general advice. Most apps do not solve these problems
           | super well for a long time unless they are in critical places
           | like billing, and even then it might just mean weird repair
           | jobs, manual APIs to resync stuff, audits, etc. Usually an
           | event-bus/queue or related DB table for the idempotent work +
           | some async process validating that table can go a long way
           | though.
        
         | gustavoamigo wrote:
         | I agree with start with a monolith and a shared database. I've
         | done that in the past quite successfully. I would just add that
         | if scaling becomes an issue, I wouldn't consider sharding my
         | first option, it's more of a last resort. I would prefer
         | scaling vertically the shared database and optimizing it as
         | much as possible. Also, another strategy I've adopted was
         | avoiding doing `JOIN` or `ORDER BY`, as they stress your
         | database precious CPU and IO. `JOIN` also adds coupling between
         | tables, which I find hard to refactor once done.
        
           | vbezhenar wrote:
           | I don't understand how do you avoid JOIN and ORDER BY?
           | 
           | Well, with ORDER BY, if your result set is not huge, sure,
           | you can just sort it on the client side. Although sorting 100
           | rows on database side isn't expensive. But if you need, say,
           | latest 500 records out of million (very frequent use-case),
           | you have to sort it on the database side. Also with proper
           | indices, database sometimes can avoid any explicit sort.
           | 
           | Do you just prefer to duplicate everything in every table
           | instead of JOINing them? I did some denormalization to
           | improve performance, but that was more like the last thing I
           | would do, if there's no other recourse, because it makes it
           | very possible that database will contain logically
           | inconsistent data and it causes lots of headache. Fixing bugs
           | in software is easier. Fixing bugs in data is hard and
           | requires lots of analytic work and sometimes manual work.
        
             | danenania wrote:
             | I think a better maxim would be to never have an _un-
             | indexed_ ORDER BY or JOIN.
             | 
             | A big part of what many "nosql" databases that prioritize
             | scale are doing is simply preventing you from ever running
             | an adhoc un-indexed query.
        
         | vbezhenar wrote:
         | I wonder if anyone tried long-lasting transactions passed
         | between services?
         | 
         | Like imagine you have a postgres pooler with API so you can use
         | one postgres connection between several applications. Now you
         | can start the transaction in one application, pass its ID to
         | another application and commit it there.
         | 
         | Implement queues using postgres, use the same transaction for
         | both business data and queue operations and some things will
         | become easier.
        
         | natdempk wrote:
         | Pretty great advice!
         | 
         | I think the one thing you can run into that is hard is once you
         | want to support different datasets that fall outside the scope
         | of a transaction (think events/search/derived-data, anything
         | that needs to read/write to a system that is not your primary
         | transactional DB) you probably do want some sort of event
         | bus/queue type thing to get eventual consistency across all the
         | things. Otherwise you just end up in impossible situations when
         | you try to manage things like doing a DB write + ES document
         | update. Something has to fail and then your state is desynced
         | across datastores and you're in velocity/bug hell. The other
         | side of this though is once you introduce the event bus and
         | transactional-outbox or whatever, you then have a problem of
         | writes/updates happening and not being reflected immediately. I
         | think the best things that solve this problem are stuff like
         | Meta's TAO that combines these concepts, but no idea what is
         | available to the mere mortals/startups to best solve these
         | types of problems. Would love to know if anyone has killer
         | recommendations here.
        
           | ljm wrote:
           | I think the question is if you need the entire system to be
           | strongly consistent, or just the core of it?
           | 
           | To use ElasticSearch as an example: do you need to add the
           | complexity of keeping the index up to date in realtime, or
           | can you live with periodic updates for search or a background
           | job for it?
           | 
           | As long as your primary DB is the source of truth, you can
           | use that to bring other less critical stores up to date
           | outside of the context of an API request.
        
             | natdempk wrote:
             | Well, the problem you run into is that you kind of want
             | different datastores for different use-cases. For example
             | search vs. specific page loads, and you want to try and
             | make both of those consistent, but you don't have a single
             | DB that can serve both use-cases (often times primary DB +
             | ElasticSearch for example). If you don't keep them
             | consistent, you have user-facing bugs where a user can
             | update a record but not search for it immediately, or if
             | you try to load everything from ES to provide consistent
             | views to a user, then updates can disappear on refresh. Or
             | if you try to write to both SQL + ES in an API request,
             | they can desync on failure writing to one or the other. The
             | problem is even less the complexity of keeping the index up
             | to date in realtime, and more that the ES index isn't even
             | consistent with the primary DB, and to a user they are just
             | different parts of your app that kinda seem a little broken
             | in subtle ways inconsistently. It would be great to be able
             | to have everything present a consistent view to users, that
             | updates together on-write.
        
               | misiek08 wrote:
               | The way I solved it once was trying to update ES
               | synchronously and if it failed or timeouted - queue event
               | to index the doc. Timeout wasn't an issue, because double
               | update wasn't harmful.
        
         | jrockway wrote:
         | I agree with this.
         | 
         | Having worked at FAANG, I'm always excited by the fanciness
         | that is required. Spanner is the best database I've ever used.
         | 
         | That said, my feeling is that in the real world, you just pick
         | Postgres and forget about it. Let's Encrypt issues every TLS
         | cert on the Internet with one beefy Postgres database.
         | Computers are HUGE these days. By the time a 128 core machine
         | isn't good enough for your app, you will have sold your shares
         | and will be living on your own private island or whatever. If
         | you want to wrap sqlite in raft over the weekend for some fun,
         | sure, do that. But don't put it in prod.
        
           | Andys wrote:
           | Agreed - A place I worked needed 24/7 uptime for financial
           | transactions, and we still managed to keep scaling a standard
           | MySQL database, despite it getting hammered, over the course
           | of 10 years up to something like 64 cores and 384GB of RAM on
           | a large EC2 instance.
           | 
           | We did have to move reporting off to a non-SQL solution
           | because it was too slow to do in realtime, but that was a
           | decision based on evidence of the need.
        
         | devjab wrote:
         | I disagree rather strongly with this advice. Mostly because
         | I've spent almost a decade earning rather lucrative money on
         | cleaning up after companies and organisations which did it.
         | Part of what you say is really good advice, if you're not
         | Facebook then don't build your infrastructure as though you
         | were. I think it's always a good idea to remind yourself that
         | StackOverflow ran on a few IIS servers for a long while doing
         | exactly what you're recommending that people do. (Well almost
         | anyway).
         | 
         | Using a single database always ends up being a mess. Ok, I
         | shouldn't say always because it's technically possible for it
         | not to happen. I've just never seen the OOP people not utterly
         | fuck up the complexity in their models. It gets even worse when
         | they've decided to use stores procedures or some magical ORM
         | which not everyone understood the underlying workings of. I
         | think you should definitely separate your data as much as
         | possible. Even small scale companies will quickly struggle
         | scaling their DBs if they don't, and it'll quickly become
         | absolutely horrible if you have to remove parts of your
         | business. Maybe they are unseeded, maybe they get sold off,
         | whatever it is. With that said, however, I think you're
         | completely correct about not doing distributed transactions. I
         | think that both you and the author are completely right that if
         | you're doing this, then you're building complexity you should
         | be building until you're Facebook (or maybe when you're almost
         | Facebook).
         | 
         | A good micro-service is one that can live in total isolation.
         | It'll full-fill the need of a specific business domain, and it
         | should contain all the data for this. If that leaves you with a
         | monolith and a single shared database, then that is perfectly
         | fine. If you can split it up. Say you have solar plants which
         | are owned by companies but as far as the business goes a solar
         | plant and a company can operate completely independently, then
         | you should absolutely build them as two services. If you don't,
         | then you're going start building your mess once you need to add
         | wind plants or something different. Do note that I said that
         | this depends on the business needs. If something like
         | individual banking accounts of company owners is relevant to
         | the greenfield workers and asset managers, then you probably
         | can't split up solar plants and companies. Keeping things
         | separate like this will also help you immensely as you add on
         | business intelligence and analytics.
         | 
         | If you keep everything in a single "model store", then you're
         | eventually going end up with "oh, only John knows what that
         | data does" while needing to pay someone like me a ridiculous
         | amount of money to help your IT department get to a point where
         | they are no longer hindering your company growth. Again, I'm
         | sure this doesn't have to be the case and I probably should
         | just advise people to do exactly as you say. In my experience
         | it's an imperfect world and unless you keep things as simple as
         | possible with as few abstractions as possible then you're going
         | to end up with a mess.
        
           | mamcx wrote:
           | All that you say is true, and people who do that are THE LESS
           | capable of becoming better at the MUCH harder challenges of
           | microservices.
           | 
           | I work in the ERP space and interact with dozens and I see
           | the _horrors_ that some only know as fair tales.
           | 
           | Without exception, staying in an RDBMS is the best option of
           | all. I have seen the _cosmical horrors_ of what people that
           | struggle with rdbms do when moved to nosql and such, and is
           | always much worse than before.
           | 
           | And all that you say, that is true, hide the real (practical)
           | solution: Learn how to use the RDBMS, use SQL, remove
           | complexity, and maybe put the thing in a bigger box.
           | 
           | All that is symptoms that are barely related to the use of a
           | single database.
        
           | ekidd wrote:
           | I wanted to respond to you, because you had some excellent
           | points.
           | 
           | > _Mostly because I've spent almost a decade earning rather
           | lucrative money on cleaning up after companies and
           | organisations which did it._
           | 
           | For many companies, this is actually a pretty successful
           | outcome. They built an app, they earned a pile of money, they
           | kept adding customers, and now they have a mess. But they can
           | afford to pay you to fix their mess!
           | 
           | My rule of thumb of "design for 10x scale" is intended to be
           | used iteratively. When something is slow and miserable, take
           | the current demand, multiply it by 10, and design something
           | that can handle it. Sometimes, yeah, this means you need to
           | split stuff up or use a non-SQL database. But at least it's a
           | real need at that point. And there's no substitute for
           | engineering knowledge and good taste.
           | 
           | But as other people have pointed out, people who can't use an
           | RDBMS correctly are going to have a bad time implementing
           | distributed transactions across microservices.
           | 
           | So I'm going to stick with my advice to start with a single
           | database, and to only pull things out when there's a clear
           | need to scale something.
        
         | foobiekr wrote:
         | Great advice. Microservices also open the door to polyglot, so
         | you lose the ability to even arrive that everyone uses/has
         | access to/understands the things in a common libCompany that
         | make it possible for anyone to at least make sense of code.
         | 
         | When I talk to people who did microservices, I ask them "why is
         | this a service separate from this?"
         | 
         | I have legitimately - and commonly - gotten the answer that the
         | dev I'm talking to wanted their own service.
         | 
         | It's malpractice.
        
           | danenania wrote:
           | > Microservices also open the door to polyglot
           | 
           | While I see your point about the downsides of involving too
           | many languages/technologies, I think the really key
           | distinction is whether a service has its own separate
           | database.
           | 
           | It's really not such a big problem to have "microservices"
           | that share the same database. This can bring many of the
           | benefits of microservices without most of the downsides.
           | 
           | Imo it would be good if we had some common terminology to
           | distinguish these approaches. It seems like a lot of people
           | are creating services with their own separate databases for
           | no reason other than "that's how you're supposed to do
           | microservices".
        
         | jayd16 wrote:
         | Sure sure sure
         | 
         | ...but micro services is used as a people organization
         | technique.
         | 
         | Once you're there you'll run into a situation where you'll have
         | to do transactions. Might as well get good at it.
        
         | mekoka wrote:
         | _> 1. Design for 10x your current scale, not 1000x._
         | 
         | I'd even say that the advice counts double for early stage
         | startups. That is, at that scale, it should be _design for 5x_.
         | 
         | You could spend years building a well architected, multi-
         | tenanted, microserviced system, whose focus on sound
         | engineering is actually distracting your team from building
         | core solutions that address your clients' real problems _now_.
         | Or, you could instead redirect that focus on first solving
         | those immediate problems with simplistic, suboptimal, but valid
         | engineering.
         | 
         | An early stage solopreneur could literally just clone or
         | copy/paste/configure their monolith in a new directory and
         | spawn a new database every time they have a new client. They
         | could literally do this for their first 20+ clients in their
         | first year. When I say this, some people look at me in
         | disbelief. Some of them, having yet to make their first sale.
         | Instead, they're working on solutions to counter the
         | _anticipated_ scalability issues they 'll have in two years,
         | when they finally start to sell and become a huge success.
         | 
         | For another few, copy/paste/createdb seems like a stroke of
         | genius. But I'm not a genius, I'm just oldish. Many companies
         | did basically this 20 years ago and it worked fine. The reason
         | it's not even considered anymore seems to be a cultural
         | amnesia/insanity that's made certain practices arcane, if not
         | taboo altogether. So we tend to spontaneously reach for the
         | nuclear reactor, when a few pieces of coal would suffice to
         | fuel our current momentum.
        
           | pkhuong wrote:
           | > spawn a new database every time they have a new client.
           | 
           | I've seen this work great for multitenancy, with sqlite
           | (again, a single beefy server goes a long way). At some point
           | though, you hit niche scaling issues and that's how you end
           | up with, e.g., arcane sqlite hacks. Hopefully these mostly
           | come from people who have found early success, or looked at
           | by others who want reassurance that there is an escape hatch
           | that doesn't involve rewriting everything for web scale.
        
       | pjmlp wrote:
       | As usual, don't try to use the network boundary to do what
       | modules already offer in most languages.
       | 
       | Distributed systems spaghetti is much worse to deal with.
        
       | kunley wrote:
       | this is smart, but also: the overall design was so overengineed
       | in the first place..
        
       | atombender wrote:
       | I find the "forwarder" system here a rather awkward way to bridge
       | the database and Pub/Sub system.
       | 
       | A better way to do this, I think, is to ignore the term
       | "transaction," which overloaded with too many concepts (such as
       | transactional isolation), and instead to consider the desired
       | behaviour, namely atomicity: You want two updates to happen
       | together, and (1) if one or both fail you want to retry until
       | they are both successful, and (2) if the two updates cannot both
       | be successfully applied within a certain time limit, they should
       | both be undone, or at least flagged for manual intervention.
       | 
       | A solution to both (1) and (2) is to bundle _both_ updates into a
       | single action that you retry. You can execute this with a queue-
       | based system. You don 't need an outbox for this, because you
       | don't need to create a "bridge" between the database and the
       | following update. Just use Pub/Sub or whatever to enqueue an
       | "update user and apply discount" action. Using acks and nacks,
       | the Pub/Sub worker system can ensure the action is repeatedly
       | retried until both updates complete as a whole.
       | 
       | You can build this from basic components like Redis yourself, or
       | you can use a system meant for this type of execution, such as
       | Temporal.
       | 
       | To achieve (2), you extend the action's execution with knowledge
       | about whether it should retry or undo its work. For such a simple
       | action as described above, "undo" means taking away the discount
       | and removing the user points, which are just the opposite of the
       | normal action. A durable execution system such as Temporal can
       | help you do that, too. You simply decide, on error, whether to
       | return a "please retry" error, or roll back the previous steps
       | and return a "permanent failure, don't retry" error.
       | 
       | To tie this together with an HTTP API that pretends to be
       | synchronous, have the API handler enqueue the task, then wait for
       | its completion. The completion can be a separate queue keyed by a
       | unique ID, so each API request filters on just that completion
       | event. If you're using Redis, you could create a separate Pub/Sub
       | per request. With Temporal, it's simpler: The API handler just
       | starts a workflow and asks for its result, which is a poll
       | operation.
       | 
       | The outbox pattern is better in cases where you simply want to
       | bridge between two data processing systems, but where the
       | consumers aren't known. For example, you want all orders to
       | create a Kafka message. The outbox ensures all database changes
       | are eventually guaranteed to land in Kafka, but doesn't know
       | anything about what happens next in Kafka land, which could be
       | stuff that is managed by a different team within the same
       | company, or stuff related to a completely different part of the
       | app, like billing or ops telemetry. But if your app already
       | _knows_ specifically what should happen (because it 's a single
       | app with a known data model), the outbox pattern is unnecessary,
       | I think.
        
         | wavemode wrote:
         | I think attempting to automatically "undo" partially failed
         | distributed operations is fraught with danger.
         | 
         | 1. It's not really safe, since another system could observe the
         | updated data B (and perhaps act on that information) before you
         | manage to roll it back to A.
         | 
         | 2. It's not really reliable, since the sort of failures that
         | prevent you from completing the operation, could also prevent
         | you from rolling back parts of it.
         | 
         | The best way to think about it is that if a distributed
         | operation fails, your system is now in an indeterminate state
         | and all bets are off. So if you really must coordinate updates
         | within two or more separate systems, it's best if either
         | 
         | a) The operation is designed so that nothing really happens
         | until the whole operation is done. One example of this pattern
         | is git (and, by extension, the GitHub API). To commit to a
         | branch, you have to create a new blob, associate that blob to a
         | tree, create a new commit based on that tree, then move the
         | branch tip to point to the new commit. As you can see, this
         | series of operations is perfectly fine to do in an eventually-
         | consistent manner, since a partial failure just leaves some
         | orphan blobs lying around, and doesn't actually affect the
         | branch (since updating the branch itself is the last step, and
         | is atomic). You can imagine applying this same sort of pattern
         | to problems like ordering or billing, where the last step is to
         | update the order or update the invoice.
         | 
         | b) The alternative is, as you say, flag for manual
         | intervention. Most systems in the world operate at a scale
         | where this is perfectly feasible, and so sometimes it just
         | makes the most sense (compared to trying to achieve perfect
         | automated correctness).
        
           | atombender wrote:
           | Undo may not always be possible or appropriate, but you _do_
           | need to consider the edge case where an action cannot be
           | applied fully, in which case a decision must be made about
           | what to do. In OP 's example, failure to roll back would
           | grant the user points but no discount, which isn't a nice
           | outcome.
           | 
           | Trying to achieve a "commit point" where changes become
           | visible only at the end of all the updates is worth
           | considering, but it's potentially much more complex to
           | achieve. Your entire data model (such as database tables,
           | including index) has to be adapted to support the kind of
           | "state swap" you need.
        
           | kgeist wrote:
           | >I think attempting to automatically "undo" partially failed
           | distributed operations is fraught with danger.
           | 
           | We once tried it, and decided against it because
           | 
           | 1) in practice rollbacks were rarely properly tested and were
           | full of bugs
           | 
           | 2) we had a few incidents when a rollback overwrote
           | everything with stale data
           | 
           | Manual intervention is probably the safest way.
        
       | cletus wrote:
       | To paraphase [1]:
       | 
       | > Some people, when confronted with a problem, think "I know,
       | I'll use micro-services." Now they have two problems.
       | 
       | As soon as I read this example where there's users and orders
       | microservices, you've already made an error (IMHO). What happens
       | when the traffic becomes such an issue that you need to shard
       | your microservices? Now you've got session and load-balancing
       | issues. If you ignore them, you may break the read-your-write
       | guarantee and that's going to create a huge cost to development.
       | 
       | It goes like this: can you read uncommitted changes within your
       | transaction or request? Generally the answer should be "yes". But
       | imagine you need to speak to a sharded service, what happens when
       | you hit a service that didn't do the mutation but it isn't
       | committed yet?
       | 
       | A sharded data backend will take you as far as you need to go. If
       | it's good enough for Facebook, it's good enough for you.
       | 
       | When I worked at FB, I had a project where someone had come in
       | from Netflix and they fell into the trap many people do of trying
       | to reinvent Netflix architecture at Facebook. Even if the Netflix
       | microservices architecture is an objectively good idea (which I
       | honestly have no opinion on, other than having personally never
       | seen a good solution with microservices), that train has sailed.
       | FB has embraced a different architecture so even if it's
       | objectively good, you're going against established practice and
       | changing what any FB SWE is going to expect when they come across
       | your system.
       | 
       | FB has a write through in-memory graph database (called TAO) that
       | writes to sharded MySQL backends. You almost never speak to MySQL
       | directly. You don't even really talk to TAO directly most of the
       | time. There's a data modelling framework on top of it (that
       | enforces privacy and a lot of other things; talk to TAO directly
       | and you'll have a lot of explaining to do). Anyway, TAO makes the
       | read-your-write promise and the proposed microservices broke
       | that. This was pointed out from the very beginning, yet they
       | barreled on through.
       | 
       | I can understand putting video encoding into a "service" but I
       | tend to view those as "workers" more than a "service".
       | 
       | [1]: https://regex.info/blog/2006-09-15/247
        
         | codethief wrote:
         | > Even if the Netflix microservices architecture is an
         | objectively good idea (which I honestly have no opinion on
         | 
         | I have no opinion on that either, but at least this[0] story by
         | ThePrimeagen didn't make it sound all too great. (Watch this
         | classic[1] before for context, unless you already know Wingman,
         | Galactus, etc.)
         | 
         | [0]: https://youtu.be/s-vJcOfrvi0?t=319
         | 
         | [1]: https://m.youtube.com/watch?v=y8OnoxKotPQ
        
       | alphazard wrote:
       | The best advice (organizationally) is to just do everything in a
       | single transaction on top of Postgres or MySQL for as long as
       | possible. This produces no cognitive overhead for the developers.
       | 
       | Sometimes that doesn't deliver enough performance and you need to
       | involve another datastore (or the same datastore across multiple
       | transactions). At that point eventual consistency is a good
       | strategy, much less complicated than distributed transactions.
       | This adds a significant tax to all of your work though. Now
       | everyone has to think through all the states, and additionally
       | design a background process to drive the eventual consistency. Do
       | you have a process in place to ensure all your developers are
       | getting this right for every feature? Did you answer code review?
       | Are you sure there's always enough time to re-do the
       | implementation, and you'll never be forced to merge inconsistent
       | "good enough" behavior?
       | 
       | And the worst option (organizationally) is distributed
       | transactions, which basically means a small group of talented
       | engineers can't work on other things and need to be consulted for
       | every new service and most new features and maintain the clients
       | and server for the thumbs up/down system.
       | 
       | If you make it hard to do stuff, then people will either 1. do
       | less stuff, or 2. do the same amount of stuff, but badly.
        
       | junto wrote:
       | This is giving me bad memories of MSDTC and Microsoft SQL Server
       | here.
        
       | kgeist wrote:
       | Main source of pain with eventual consistency is lots of customer
       | calls/emails "we did X but nothing happened". I'd also add that
       | you should make it clear to the user that the action may not be
       | instantenous.
        
         | renegade-otter wrote:
         | That's another thing about a single database, if it's well-
         | tuned and your squeel is good - in many cases you don't even
         | need any kind of cache, removing an entire class of bugs.
        
       | Scubabear68 wrote:
       | If I had a nickel for all the clients I've seen with micro
       | services everywhere, and 90% of the code is replicating an RDBMS
       | with hand coded in memory joins.
       | 
       | What could have been a simple SQL query in a sane architecture
       | becomes N REST calls (possibly nested with others downstream) and
       | manually stitching together results.
       | 
       | And that is just the read only case. As the author notes updates
       | add another couple of levels of horror.
        
         | renegade-otter wrote:
         | In the good old days, if you did that, you would rightfully be
         | labeled as an "amateur".
        
       | revskill wrote:
       | You can have your cake and eat it too by allowing replication.
        
       | wwarner wrote:
       | AGREE! The author's point is very well argued. Beginning a
       | transaction is almost _never_ a good idea. Design your data model
       | so that if two pieces of data must be consistent, they are in the
       | same row, and allow associated rows to be missing, handling nulls
       | in the application. Inserts and updates should operate on a
       | single table, because in the case of failure, nothing changed,
       | and you have a simple error to deal with. In short, as explained
       | in the article, embrace eventual consistency. There was a great
       | post from the Github team about why they didn 't allow
       | transactions in their rails app, from around 2013, but I can't
       | find it for the life of me.
       | 
       | I realize that you're staring at me in disbelief right now, but
       | this is gospel!
        
       | latchkey wrote:
       | Back in the early 2000's, I was working for the largest hardcore
       | porn company in the world, serving tons of traffic. We built a
       | cluster of 3 Dell 2950 servers with JBoss4. We were using
       | Hibernate and EJB2 entities, with a MySQL backend. This was all
       | before "cloud" allowed porn on their own systems, so we had to do
       | it ourselves.
       | 
       | Once configured correctly and all the multicast networking was
       | set up, distributed 2PC transactions via jgroups worked
       | flawlessly for years. We actually only needed one server for all
       | the traffic, but used 3 for redundancy and rolling updates.
       | 
       | -\\_(tsu)_/-, kids these days
        
         | ebiester wrote:
         | Different problems have different solutions.
         | 
         | You likely mostly had very simple business logic in 90% of your
         | system. If your system is automating systems for a cross-domain
         | sector (think payroll), you're likely to have a large number of
         | developers on a relatively small amount of data and speed is
         | secondary to managing the complexity across teams.
         | 
         | Microservices might not be a great solution, and distributed
         | monoliths will always be an anti-pattern, but there are reasons
         | for more complex setups to enable concurrent development.
        
           | latchkey wrote:
           | Due to the unwillingness for corporations to work with us, we
           | had to develop our own cross-TLD login framework, payments
           | system, affiliate tracker, micro-currency for live pay per
           | minute content, secure image/video serving across multiple
           | CDN's, and a whole ads serving network. It took years to
           | build it all and was massively complicated.
           | 
           | The point I was making is that the tooling for all of this
           | has existed for ages. People keep reinventing it. Nothing
           | wrong with that, but these sorts of blog posts are
           | entertaining to watch history repeat itself and HN to argue
           | over the best way to do things.
        
       | misiek08 wrote:
       | Looks like crypto ad for the library and showing probably worst,
       | most over-engineered method for ,,solving" transactions in more
       | diverse environment. Eventually consistency is big tradeoff not
       | possible to accept in many payment and stock related areas.
       | Working in company where all described problems exist and were
       | solved the worst way possible I see this article as very
       | misleading. You don't want the events instead of transactions -
       | if something has to be commited together - you need to reachitect
       | system and that's it. Of course people who were building this
       | monster for years will block anyone from doing this. Over-
       | engineered AF, because most of the parts where transactions are
       | required could be handled by single database, even SQL and
       | currently are split between dozens of separate Mongo clusters.
       | 
       | "Event based consistency" leaves us in state where you can't
       | restore system to a stable, safe, consistent state. And of course
       | you have a lot more fun in debugging and developing, because you
       | (we, here) can't test locally anything. Hundreds of mini-clones
       | of prod setup running, wasting resources and always out of sync
       | are ready to see the change and tell you a little more than
       | nothing. Great DevEx...
        
       ___________________________________________________________________
       (page generated 2024-10-06 23:02 UTC)