[HN Gopher] Async message-oriented architectures compared to syn...
___________________________________________________________________
Async message-oriented architectures compared to synchronous REST-
based systems
Author : stolsvik
Score : 102 points
Date : 2023-02-12 18:48 UTC (4 hours ago)
(HTM) web link (mats3.io)
(TXT) w3m dump (mats3.io)
| mdaniel wrote:
| Making your own license is the new JavaScript framework, I guess:
| https://github.com/centiservice/mats3/blob/v0.19.4-2023-02-1...
| Pet_Ant wrote:
| Not open source.
|
| > Noncompete
|
| > Any purpose is a permitted purpose, except for providing to
| others any product that competes with the software.
| stolsvik wrote:
| I do not say that it is _Open Source_ either.
|
| From the front page: "Free to use, source on github.
| Noncompete licensed - PolyForm Perimeter."
|
| Feel free to comment on this: Is this a complete deal breaker
| for all potential users?
| mixedCase wrote:
| Can't speak for all potential users, but the license is in
| fact a complete deal-breaker for me and any client I've
| worked with given the FOSS tools available in the
| ecosystem.
|
| But then, there's also the "Java-only" which is a complete
| deal-breaker in any client I've worked with doing
| {micro,}services.
|
| Then there's the "what the hell does this actually do"
| deal-breaker when trying to explain it to some decision
| makers, and the "we already have queues and K8s to solve
| all of those issues" deal-breaker when explaining it to
| most fellow SWEs/SREs.
| stolsvik wrote:
| Hahaha, that's rough! :-)
|
| I'll tell you one thing: "What the hell does this
| actually do?!" is extremely spot on! I am close to amazed
| to how hard it is to explain this library. It really does
| provide value, but it is evidently exceptionally hard to
| explain.
|
| I first and foremost believe that this is due to the
| massive prevalence of sync REST/RPC style coding, and
| that messaging is only pulled up as a solution when you
| get massive influxes of e.g. inbound reports - where you
| actually want the _queue_ aspect of a message broker. Not
| the async-ness.
|
| I've tried to lay this out multiple times, e.g. here:
| https://mats3.io/docs/message-oriented-rpc/, and in the
| link for this post itself.
| lazyasciiart wrote:
| That comment is quoting from the Polyform license. If it
| doesn't represent your position, you may have made a bad
| choice in license.
| stolsvik wrote:
| I was referring to the "not open source". I edited my
| comment to be more specific.
| jgilias wrote:
| Your license would not be a dealbreaker for me in an SME
| commercial setting. AGPL would be a dealbreaker.
| stolsvik wrote:
| Thanks a bunch! Seriously. And I agree that AGPL is
| pretty harsh - I have a feeling that this is typically
| used in a "try before you buy" situation, where there is
| a commercial license on the side.
| tinco wrote:
| Well yeah of course, it's a direct contradiction of the
| rising tide lifts all boats principle. Do you think
| Kubernetes would have any traction at all if it had a
| clause that it couldn't be used on AWS?
|
| If it can't be adopted by the industry as a whole, then it
| can't be considered an industry standard. It wouldn't fly
| at my organization anyway, not even looking up what
| PolyForm is.
| stavros wrote:
| As far as I can see, this doesn't say it can't be used on
| AWS, it only says Amazon can't launch its own service
| that uses this software to compete with itself. It's too
| short to really tell what "compete" entails, though.
| stolsvik wrote:
| You are correct, this is meant as an AWS/GCP/Azure
| preventor. ElasticSearch situation. That is, AFAIU, the
| intention of the license I adopted. The "examples" part
| spells it pretty directly out, as I also try to do here:
| https://centiservice.com/license/
|
| You may definitely use it anywhere you like.
| stavros wrote:
| I'm really in favor of something like that. AWS using
| your own FOSS software to choke your revenue stream is a
| blight on FOSS, so good for you for using that license.
| stolsvik wrote:
| Thank you!
| mejutoco wrote:
| I see your point, and could not avoid thinking if
| ElasticSearch would have more revenue if AWS could not
| offer it directly.
| mixedCase wrote:
| Do you believe the fat ugly monster that is ElasticSearch
| would've had anywhere near its current adoption rates if
| it had a non-OSI license from the start?
|
| It would've been completely overshadowed by some other
| Lucene-based wrapper or maybe some even better
| alternative would've come along earlier.
| indymike wrote:
| I built on Elastic early on over Solr and several others
| because it was open source and seemed to be better. I
| would have selected a different Lucene wrapper if I had
| known where Elastic was going.
| mejutoco wrote:
| Algolia does pretty well I believe. I could be wrong.
| jagged-chisel wrote:
| It's muddy/unknown enough that no one in a commercial
| enterprise can entertain shipping a service using your
| project.
| marginalia_nu wrote:
| Honestly I'm pretty annoyed by the "how dare you give the
| source away under other terms than the ones I would
| prefer"-type reactions that crop up from time to time. It's
| an incredibly entitled attitude and is not a good look for
| the open source community in general.
|
| Like by all means, share code with GPL or Apache or MIT or
| whatever, but don't get mad when someone selects another
| license, including non-free ones with weird
| incompatibilities.
| jagged-chisel wrote:
| Those kinds of complaints are indeed entitled. At the
| same time, there's no problem pointing out that fewer
| people and organizations can select a dependency with an
| unconventional, unknown license.
|
| You're welcome to license your projects however you see
| fit. But when you get to a point that no one is using
| your stuff, you have to be ready to hear "it's the
| license."
| lazyasciiart wrote:
| Your comment is "how dare you complain about licensing".
| What you are responding to is "huh, weird license, won't
| use, that's a shame".
| indymike wrote:
| Yes, it is a deal breaker for me.
| delusional wrote:
| Well computing is all about redefining problems in terms of
| other atoms. A messaging service is really just a series of
| ALU operations and memory writes, which is then a series of
| nand gates.
|
| It seems incredibly muddy to me what "competing" would mean
| in that sense. If I make something with this it could be
| argued that my system built on top of MATS is just
| immaterial configuration that was intended to be done by
| the user. That the authors intention was for the end user
| to use MATS themselves, and that I'm therefore in
| competition with the product.
|
| A non programming example would be hammers and houses. You
| could imagine that if I build you a house, you'd be less
| likely to need to buy a hammer (to build your own) making
| my house competition for the hammer.
|
| I wouldn't touch this at all.
| stolsvik wrote:
| Not entirely my own:
| https://polyformproject.org/licenses/perimeter/1.0.0/
|
| https://centiservice.com/license/
| [deleted]
| mrkeen wrote:
| > Transactionality: Each endpoint has either processed a message,
| done its work (possibly including changing something in a
| database), and sent a message, or none of it.
|
| Sounds too good to be true. Would love to hear more.
| stolsvik wrote:
| Well, okay, you're right! It is _nearly_ true, though. :) I 've
| written a bit about it here: https://mats3.io/using-
| mats/transactions-and-redelivery/
| revskill wrote:
| In our production apps, all network issues is resolved by simple
| rate-limiter.
| guhcampos wrote:
| Oh yes, another "this thing I sell is an actual silver bullet"
| post.
|
| Message busses are great. RPC is too. There are use cases for
| both. Saying one is "better" than the other is silly, and in this
| case, a shame.
|
| There are loads of message passing libraries out there, based on
| all kinds of backend, from RabbitMQ, to NATS, to Redis, to Kafka.
| This does not innovate over anything, it's just shameless
| marketing.
| stolsvik wrote:
| This is unfair. I made Mats so that I could use messaging in a
| simpler form. Nothing else.
|
| Mats is an API that can be implemented on top of any _queue-
| based_ message broker - which excludes Kafka. But it definitely
| includes ApacheMQ (which is what we use), Artemis and hence
| RedHat 's MQ (which the tests runs), and RabbitMQ (whose JMS
| implementation is too limited to directly be used, but I do
| hope to implement Mats on top of it at some point). Probably
| also NATS. Probably also Apache Pulsar, which I just recently
| realized have a JMS client.
|
| You could even implement it on top of ZeroMQ, and implement it
| on top of any database, particularly Postgres since it has
| those "queue extensions" NOTIFY and SKIP LOCKED.
|
| edit: I actually have an feature-issue exploring such an
| implementation: https://github.com/centiservice/mats3/issues/15
| latchkey wrote:
| > _ApacheMQ_
|
| I hope ActiveMQ Artemis is better than the 'classic' version
| and that is what you're using. The last time I used it
| probably a decade ago now, there were so many issues with it,
| that it was a complete train wreck at scale. I would be very
| hesitant to pick that one up again.
| jeffbee wrote:
| If you pretend that your message bus has zero producer impedance
| and costs nothing then this analysis makes great sense. If you
| have ever operated or paid for this type of scheme in the real
| world then you will have some doubts.
| stolsvik wrote:
| I guess you'd say the same about cloud functions and lambdas,
| then? To which I agree.
|
| Paying per message would require the message cost to be pretty
| small. Might want to evaluate setting up a broker yourself if
| the cost starts getting high.
| robertlagrant wrote:
| Having done a reasonable amount of messaging code in my time, I
| would say the final form of this sort of thing might look more
| like Cadence[0] than anything like this.
|
| [0] https://github.com/uber/cadence
| stolsvik wrote:
| Cadence is a workflow management system. As is Temporal, Apache
| Beam, Airbnb Airflow, Netflix Conductor, Spotify Luigi, and
| even things like Github Actions, Google Cloud Workflows, Azure
| Service Fabric, AWS SWF, Power Automate.
|
| A primary difference is that those are _external systems_ ,
| where you define the flows inside that system - the system then
| "calling out" to get pieces of the flow done.
|
| Mats is an "internal" system: You code your flows inside the
| service. It is meant to directly replace synchronously calling
| out to REST services, instead enabling async messaging but with
| the added bonus of being _as simple as_ using REST services.
|
| But yes, I see the point.
| MuffinFlavored wrote:
| Is GitHub Actions really similar enough to Temporal/Cadence
| to be included in the list?
| stolsvik wrote:
| Hmm. Maybe not. But they sure have much in common: You
| define a set of things that should be done, triggered by
| something - either a schedule, an event (oftentimes a
| repository event, but it doesn't have to), or from another
| Github action.
| eBombzor wrote:
| Why is this better over Kafka?
| stolsvik wrote:
| As far as I understand, Kafka is positioning itself to be the
| leading _Event Sourcing_ solution.
|
| I view event sourcing to be fundamentally different from
| message passing. For a long time I tried to love event
| sourcing, but I see way to many problems with it. The primary
| problem I see is that you then end up with a massive source of
| events, which any service can subscribe to as they see fit. How
| is this different from having one gigantic spaghetti database?
| Also, event migrations over time.
|
| RPC and messaging feels to me to be much clearer separated: I
| own the Accounts, and you own the Orders. We explicitly
| communicate when we need to.
|
| I see benefits on both sides, but have firmly landed on _not_
| event sourcing.
| hbrn wrote:
| 1. Anything that is connected to user interface should be
| synchronous by default.
|
| 2. You can't predict which parts of your system will be connected
| to user interface.
|
| 3. Here's the worst part: _async messaging is viral_. A service
| that depends on async service becomes async too.
|
| You should be very cautions introducing async messaging to your
| systems. The only parts that should be allowed to be async are
| the ones that can afford to fail.
|
| I spend good amount of time trying to work around these dumb
| enterprise patterns when building products on top of async APIs.
| You are literally forced to build inferior products just because
| someone thought that async messaging is so great. It's great for
| everybody, _except the final user_.
|
| Async processing is not a virtue, it's a necessity for high
| load/high throughput systems.
|
| The reason SOA failed many years ago is precisely the async
| message bus.
| stolsvik wrote:
| We clearly do not agree.
|
| Wrt. sync processing when using Mats:
| https://mats3.io/docs/sync-async-bridge/
|
| But my better solution is instead to pull the async-ness all
| the way out to the client: https://matssocket.io/
|
| Also, I have another take on the SOA failure, mentioned here:
| https://mats3.io/about/
|
| It was definitely not because of async, at least as I remember
| it.
| SpaghettiX wrote:
| I appreciate some events can be asynchronous for clients, for
| example: actions taken by other users, or events generated by
| the system. However, I do think implementation details (using
| async in the server) should be encapsulated from clients:
| when users save a new document, it's much easier for the
| client to receive a useful albeit delayed response, rather
| than "event submitted", wait for the result on a stream. Of
| course, other relevant clients may need to hear about that
| event too. The service architecture should not affect / make-
| life-harder for clients.
|
| Therefore I think disagree with both parent and grandparent
| comments. Use each when they make sense, not "synchronous by
| default" (grandparent comment, though I do think there are
| good points made), or "asynchronous based on service
| architecture" (parent comment).
|
| > But my better solution is to pull the async-ness all the
| way out to the client: https://matssocket.io/
|
| Is that a solution that you use? I took a look at matssocket
| https://www.npmjs.com/package/matssocket, it currently has 2
| weekly downloads. :thinking:.
| stolsvik wrote:
| To make a point out of it: This is not _event based_ in the
| event sourcing way of thinking. It is using messages. You
| put a message on a queue, someone else picks it up. Mats
| implements a request /reply paradigm on top ("messaging
| with a call stack").
|
| In the interactive, synchronous situation, you do not "wait
| for an event" per se. You wait for a specific reply. When
| using the MatsFuturizer (https://mats3.io/docs/sync-async-
| bridge/), it is _extremely_ close to how you would have
| used a HttpClient or somesuch.
|
| MatsSocket: The Dart/Flutter implementation is used in a
| production mobile app. For the Norwegian market only,
| though.
|
| The JS implementation is used in an internal solution.
|
| Would have been really nice with a bit more usage, yes. It
| is actually pretty nice, IMHO! ;-)
| toast0 wrote:
| > Async processing is not a virtue, it's a necessity for high
| load/high throughput systems.
|
| > 1. Anything that is connected to user interface should be
| synchronous by default.
|
| If everything UI is synchronous, you prevent users from
| acheiving high throughput. Sometimes that's fine, but sometimes
| it's not.
|
| It's simple to wait for a response to a request sent via
| asynchronous messaging. It's not simple to split a synchronous
| API into send and receive parts. However, REST is HTTP and
| there's lots of async HTTP libraries out there.
| samsquire wrote:
| Thanks for this.
|
| I love the idea of breaking up a flow into separately scheduled
| but still linear message flow.
|
| I wrote about a similar idea in ideas2
|
| https://github.com/samsquire/ideas2#84-communication-code-sl...
|
| The idea is that I enrich my code with comments and a transpiler
| schedules different parts of the code to different machines and
| inserts communication between blocks.
|
| I read about how Zookeeper algorithm for transactionality and
| robustness to messages being dropped, which is interesting
| reading.
|
| https://zookeeper.apache.org/doc/r3.4.13/zookeeperInternals....
|
| How does Mats compare?
|
| LMAX disruptor has a pattern where you split up each side of an
| IO request into two events, to avoid blocking in an handler. So
| you would always insert a new event to handle an IO response.
| derefr wrote:
| > Back-pressure (e.g. slowing down the entry-points) can easily
| be introduced if queues becomes too large.
|
| ...which presumably includes load-shedding to stop misbehaving
| components from overloading the queues; at which point, unless
| you want clients to just lose track of the things they wanted
| done when they get a "we're too busy to handle this right now"
| response, you've essentially circled back around to clients
| having to use a client with REST-like "synchronous/blocking
| requests with retry/backpressure" semantics -- just where the
| requests that are being synchronously-blocked on are "register
| this as a work-item and give me an ID to check on its status"
| rather than "do this entire job and tell me the result."
|
| And if you're doing that, why force the client to think in terms
| of async messaging at all? Just let them do REST, and hide the
| queue under the API layer of the receiver.
| Supermancho wrote:
| > you've essentially circled back around to clients having to
| use a client with REST-like "synchronous/blocking requests with
| retry/backpressure" semantics
|
| Yes, they both do the same thing. That's not even the starting
| point of the discussion. The implementation from HTTP to a
| message queue (mailbox system) is the discussion point.
|
| Having the caller (who needs work done) wait to be informed
| when the work is done (or not done) is less deterministic than
| telling the callee how long before the work doesn't matter
| anymore. The callee gives back a transaction ID/is provided a
| callerID or is unavailable, and the caller knows (very quickly)
| it's not going to get done or knows where to look for the work
| (or abandon it). Either way, it allows for optimization on both
| sides.
| tass wrote:
| This is where I always end up. You can have queues which give
| you certain benefits, but there'a a lot of stuff to be built on
| top to make it as operationally simple as http.
| stolsvik wrote:
| I will argue that this simplicity is exactly what Mats
| provides. At least that is the intention.
| revskill wrote:
| I don't see the code on webpage to explain things.
| Simplicity means you can explain complex things with simple
| code.
|
| Because English is ambigous and subjective. Just use code ?
| stolsvik wrote:
| There is code here: https://mats3.io/docs/message-
| oriented-rpc/ .. and here: https://mats3.io/docs/mats-
| flow-initiation/ .. and here: https://mats3.io/docs/sync-
| async-bridge/ .. and here:
| https://mats3.io/docs/springconfig/ .. and here:
| https://mats3.io/background/what-is-mats/ .. and here:
| https://mats3.io/using-mats/endpoints-and-initiations/
|
| .. and on the github page here:
| https://github.com/centiservice/mats3/blob/main/README.md
|
| .. and you are advised to explore the code here:
| https://mats3.io/docs/explore/
| cerved wrote:
| yes but there's sadly no code in what you posted
| charrondev wrote:
| The system I'm currently on is currently moving a lot of work
| into queues. Some operations, like "change the criteria of this
| rank" could be anywhere between 5 seconds (if the number of
| users of the criteria to evaluate are small) or 10+ hours if we
| need re-evaluate the rules against 10m+ users.
|
| In this case we write our jobs as generators that can be
| paused, serialized and picked up again later. We give the job 5
| seconds synchronously, then if it passes that time, queue and
| job and let the client know a job has been registered.
|
| The users account holds the IDs of the jobs as well as some
| basic information about the the tasks they have queued. There
| is a rest endpoint to return the current status of the jobs and
| information about them (what are they doing, what's their
| progress, how much work remains).
|
| The client will negotiate a web socket connection with a
| different service to be notified whenever progress is made on
| the job and the client can then check the endpoint for the
| latest status.
| latchkey wrote:
| That 5 seconds is going to bite you.
|
| There is going to be some sort of stall in the future that
| causes all of your jobs to hit that 5 seconds and everything
| is going to start to back up and cause other problems up the
| line that are really hard to test for in advance.
|
| You're better off designing a system that doesn't rely on
| some arbitrary number of seconds (why not 4 or 6 seconds?) to
| begin with.
| naasking wrote:
| Yes, non-determinism is the bane of distributed systems. It
| should be minimized whenever possible.
| naasking wrote:
| > And if you're doing that, why force the client to think in
| terms of async messaging at all? Just let them do REST, and
| hide the queue under the API layer of the receiver.
|
| Yes, exactly. And on top of that, async messaging implicitly
| introduces DoS vulnerabilities exactly because of the buffering
| required. At least with sync messaging exposing a queue in the
| API layer, you opt-into this vulnerability.
| stolsvik wrote:
| As mentioned here: https://mats3.io/background/system-of-
| services/
|
| .. Mats is meant to be an inter-service communication
| solution.
|
| It is explicitly _not_ meant to be your front-facing
| endpoints. If you are DoS 'ed, it would be from your own
| services. Of course, that might still happen, but then things
| would not have been much better if you used sync comms.
|
| It is true that you can bridge from sync to the async world
| of Mats using the MatsFuturizer (https://mats3.io/docs/sync-
| async-bridge/), but then you still have your e.g. Servlet
| Container as the front-facing entity.
|
| (Also check out https://matssocket.io/, though)
| stolsvik wrote:
| Well, yes - there is nothing with Mats that you cannot do with
| any other communication form, if you code it up. When you say
| "register this as a work-item and give me an ID to check on its
| status", you've implemented a queue, right?
|
| The intention is that Mats gives you an easy way to perform
| async message-oriented communications. Somewhat of a bonus, you
| can also use it for synchronous tasks, using the MatsFuturizer,
| or MatsSocket. A queue can handle transient peaks of load much
| better than direct synchronous code. It is also quite simple to
| scale out. But if you do get into problems of getting too much
| traffic for the system to process, you will have to handle that
| - and Mats does not currently have any magic for performing
| e.g. load shedding, so you're on your own. (I have several
| thoughts on this. E.g. monitor the queue sizes, and deny any
| further initiations if the queues are too large).
|
| Wrt. synchronous comms, Mats do provide a nice feature, where
| you can mark a Mats Flow as "interactive", meaning that some
| human is waiting for the result. This results in the flow
| getting priority on every stage it passes through - so that if
| it competes with internal, more batchy processes, it will cut
| the lines.
| derefr wrote:
| > A queue can handle transient peaks of load much better than
| direct synchronous code.
|
| Whether a workload is being managed upon creation using a
| work queue within the backend, has nothing to do with the
| semantics of the communications protocol used to talk about
| the state of said workload. You can arbitrarily combine these
| -- for example, DBMSes have the unusual combination of having
| a stateful connection-oriented protocol for scheduling
| blocking workloads, but also having the ability to introspect
| the state of those ongoing workloads with queries on other
| connections.
|
| My point is that clients in a distributed system can
| literally never do "fire and forget" messaging _anyway_ --
| which is the supposed advantage of an "asynchronous message-
| oriented communications" protocol over a REST-like one. Any
| client built to do "fire and forget" messaging, when used at
| scale, always, always ends up needing some sort of outbox-
| queue abstraction, where the outbox controller is internally
| doing synchronous blocking retries of RPC calls to get an
| acknowledgement that a message got safely pushed into the
| queue and can be locally forgotten.
|
| And that "outbox" is a _leaky abstraction_ , because in
| trying to expose "fire and forget" semantics to its caller,
| it has no way of imposing backpressure on its caller. So the
| client's outbox overflows. Every time.
|
| This is why Google famously switched every internal protocol
| they use _away_ from using message queues /busses with
| asynchronous "fire and forget" messaging, _toward_
| synchronous blocking RPC calls between services. With an
| explicitly-synchronous workload-submission protocol (which
| may as well just be over a request-oriented protocol like
| HTTP, as gRPC is), all operational errors and backpressure
| get bubbled back up from the workload-submission client
| library to its caller, where the caller can then have logic
| to decide the business-logic-level response that is most
| appropriate, for each particular fault, in each particular
| calling context.
|
| Message queues are the quintessential "smart pipe", trying to
| make the network handle all problems itself, so that the
| nodes (clients and backends) connected via such a network can
| be naive to some operational concerns. But this will never
| truly solve the problems it sets out to solve, as the _policy
| knowledge_ to properly drive the decision-making for the
| _mechanism_ that handles operational exigencies in message-
| handling, isn 't available "within the network"; it lives
| only at the edges, in the client and backend application code
| of each service. Those exigencies -- those failures and edge-
| case states -- must be pushed out to the client or backend,
| so that policy can be applied. And if you're doing that, you
| may as well move the mechanism to enforce the policy there,
| too. At which point you're back to a dumb pipe, with smart
| nodes.
| jgilias wrote:
| Is there something I can read about Google switching to
| sync RPC? Like a blog post or something like that?
|
| Thanks!
| stolsvik wrote:
| "Not everybody is Google"
|
| These concepts has worked surprisingly well for us for
| nearly a decade. We're not Google-sized, but this
| architecture should work well for a few more orders of
| magnitude traffic.
|
| Also, you can mix and match. If you have some parts of your
| system with absolutely massive traffic, then don't use this
| there, then.
|
| Note that we very seldom use "fire and forget" (aka
| "send(..)"). We use the request-replyTo paradigm much more.
| Which is basically the basic premise of Mats, as an
| abstraction over pure "forward-only" messaging.
| derefr wrote:
| > Note that very we use "fire and forget" very seldom
| (aka "send(..)"). We use the request-replyTo paradigm
| much more. Which is basically the basic premise of Mats,
| as an abstraction over pure "forward-only" messaging.
|
| That doesn't help one bit. You're still firing-and-
| forgetting the request itself. The reply (presumably with
| a timeout) ensures that the client doesn't sit around
| forever waiting for a lost message; but it does nothing
| prevent badly-written request logic from overloading your
| backend (or overloading the queue, or "bunging up" the
| queue such that it'll be ~forever before your backend
| finishes handling the request spike and gets back to
| processing normal workloads.)
|
| > If you have some parts of your system with absolutely
| massive traffic, then don't use this there, then.
|
| I'm not talking about massive _intended_ traffic. These
| problems come from _failures in the architecture of the
| system to inherently bound requests to the current scale
| of the system_ (where autoscaling changes the "current
| scale of the system" before such limits kick in.)
|
| So, for example, there might be an endpoint in your
| system that allows the caller to trigger logic that does
| O(MN) work (the controller for that endpoint calls
| service X O(M) times, and then for each response from X,
| calls service Y O(N) times); where it's fully expected
| that this endpoint takes 60+ seconds to return a
| response. The endpoint was designed to serve the need of
| some existing internal team, who calls it for reporting
| once per day, with a batch-size N=2. But, unexpectedly, a
| new team, building a new component, with a new use-case
| for the same endpoint, writes logic that begins calling
| the endpoint once every 20 seconds, with a batch-size of
| 20. Now the queues for the services X and Y called by
| this endpoint are filling faster than they're emptying.
|
| No DDoS is happening; the requests are quite small, and
| in networking terms, quite sparse. Everything is working
| as intended -- and yet it'll all fall over, because
| you've chosen yourself into a protocol where there's no
| _inherent, by-default_ mechanism for "the backend is
| overloaded" to apply backpressure to make _new requests
| from the frontend_ stop coming (as it would in a
| synchronous RPC protocol, where 1. you can 't submit a
| request on an open socket when it's in the "waiting for
| reply" state; and 2. you can't get a new open socket if
| the backend isn't calling accept(2)); and you didn't
| think that this endpoint would be one that gets called
| much, so you didn't bother to think about explicitly
| implementing such a mechanism.
| stolsvik wrote:
| Relying on the e.g. Servlet Container not being able to
| handle requests seems rather bad to me. That is a very
| rough error handling.
|
| We seem to have come to the exact opposite conclusions
| wrt. this. Your explanations are entirely in line with
| mine, but I found this "messy" error handling to be
| exactly what I wanted to avoid.
|
| There is one particular point where we might not be in
| line: I made Mats first and formost _not_ for the
| synchronus situation, where there is a user waiting. This
| is the "bonus" part, where you can actually do that with
| the MatsFuturizer, or the MatsSocket.
|
| I first and foremost made it for internal, batch-like
| processes like "we got a new price (NAV) for this fund,
| we now need to settle these 5000 waiting orders". In that
| case, the work is bounded, and an error situation with
| not-enough-threads would be extremely messy. Queues
| solves this 100%.
|
| I've written some about my thinking on the About page:
| https://mats3.io/about/
| derefr wrote:
| > Relying on the e.g. Servlet Container not being able to
| handle requests seems rather bad to me. That is a very
| rough error handling.
|
| It's one of those situations where the simplest "what you
| get by accident with a single-threaded non-evented
| server" solution, and the most fancy-and-complex
| solution, actually look alike from a client's
| perspective.
|
| What you actually want is that each of your backends
| monitors its own resource usage, and flags itself as
| unhealthy in its readiness-check endpoint when it's
| approaching its known per-backend maximum resource
| capacity along any particular dimension -- threads,
| memory usage, DB pool checked-out connections, etc.
| (Which can be measured quite predictably, because you're
| very likely running these backends in containers or VMs
| that enforce bounds on these resources, and then scaling
| the resulting predictable-consumption workload-runners
| horizontally.) This readiness-check failure then causes
| the backend to be removed from consideration as an
| upstream for your load-balancer / routing target for your
| k8s Service / etc; but existing connected flows continue
| to flow, gradually draining the resource consumption on
| that backend, until it's low enough that the backend
| begins reporting itself as healthy again.
|
| Meanwhile, if the load-balancer gets a request and finds
| that it currently has _no_ ready upstreams it can route
| to (because they 're all unhealthy, because they're all
| at capacity) -- then it responds with a 503. Just as if
| all those upstreams had crashed.
|
| > Your explanations are entirely in line with mine, but I
| found this "messy" error handling to be exactly what I
| wanted to avoid.
|
| Well, yes, but that's my point made above: this error
| handling is "messy" precisely because it's an _encoding
| of user intent_. It 's irreducible complexity, because
| it's something where you want to make the decision of
| what to do differently in each case -- e.g. a call from A
| to X might consider the X response critical (and so
| failures should be backoff-retried, and if retries
| exceeded, the whole job failed and rescheduled for
| later); while a call from B to X might consider the X
| response only a nice-to-have optimization over
| calculating the same data itself, and so it can try once,
| give up, and keep going.
|
| > I made Mats first and formost not for the synchronus
| situation, where there is a user waiting.
|
| I said nothing about users-as-in-humans. We're presumably
| both talking about a Service-Oriented Architecture here;
| perhaps even a microservice-oriented architecture. The
| "users" of Service X, above, are Service A and Service B.
| There's a Service X client library, that both Service A
| and Service B import, and make calls to Service X
| through. But these are still, necessarily, _synchronous_
| requests, since the further computations of Services A
| and B are _dependent on_ the response from Service X.
|
| Sure, you can queue the requests to Services A and B as
| long as you like; but _once they 're running_, they're
| going to sit around waiting on the response from Service
| X (because they have nothing better to be doing while the
| Service X response-promise resolves.) Whether or not the
| Service X request is synchronous or asynchronous doesn't
| matter to them; they have a synchronous (though not
| timely) _need_ for the data, within their own
| asynchronous execution.
|
| Is this not the common pattern you see for inter-service
| requests within your own architecture? If not, then what
| is?
|
| If what you're really talking about here is forward-only
| propagation of values -- i.e. never needing _a response_
| (timely or not) from most of the messages you send in the
| first place -- then you 're not really talking about a
| messaging protocol. You're talking about a dataflow
| programming model, and/or a distributed CQRS/ES event
| store -- both of which can and often are implemented on
| top of message queues to great effect, and neither of
| which purport to be sensible to use to build RPC request-
| response code on top of.
| stolsvik wrote:
| To your latter part: This is exactly the point: Using
| messages and queues makes the flows take whatever time it
| takes. Settling of the mentioned orders are not time
| critical - well, at least not in the way a request from a
| user sitting on his phone logging in to see his holdings
| is time critical. So therefore, if it takes 1 second, or
| 1 hour, doesn't matter all that much.
|
| The big point is that _none_ of the flows will fail. They
| will all pass through _as fast as possible_ , literally,
| and will never experience any failure mode resulting from
| randomly exhausted resources. You do not need to make any
| precautions for this - backpressure, failure handling,
| retries - as it is inherent in how a messaging-based
| system works.
|
| Also, if a user logs into the system, and one of the
| login-flows need a same service as the settling flows,
| then that flow will "cut the line" since they are marked
| "interactive".
| bcrosby95 wrote:
| RPC works for just non Google and Google scale. This is
| one of the times where, IMHO, you can skip the middle
| section. Novices resort to RPC, Google resorts to RPC,
| and in the mid tier you have something where messaging
| can step in.
|
| Why not skip it? Use RPC like a novice. If it becomes
| problematic, start putting in compensating measures.
| peoplefromibiza wrote:
| > And if you're doing that, why force the client to think in
| terms of async messaging at all? Just let them do REST, and
| hide the queue under the API layer of the receiver.
|
| because REST is stupid
|
| REST is request -> response, single connection, one direction
| only, which is a very limited way to model messaging .
|
| There is more than one communication mode and bidirectional
| messaging is a thing.
|
| REST also offers no control whatsoever over the communication
| channel, so you are stuck with the configuration set on the
| server side
|
| which might or might not be correct for your use case
|
| See RSocket for an example of a message driven protocol which
| solves most of the shortcomings of REST
|
| on the bright side REST is also stupid simple
|
| which is why is so widely deployed, it doesn't require thinking
|
| > response, you've essentially circled back around to clients
| having to use a client with REST-like
|
| no, because you did not block there waiting for the timeout
| which defaults to 30 seconds for HTTP
|
| and even if you abandon on the client side, the server will
| still process the request, there's no way to abort it once it's
| been started.
| naasking wrote:
| > REST is request -> response, single connection, one
| direction only, which is a very limited way to model
| messaging
|
| You seem to be implying that being limited is a bad thing.
| Constraints are important to keep problems tractable.
| klabb3 wrote:
| In my experience the amount of serialized (network
| blocking) calls needed under the request reply paradigm
| always grows over time as the application gets larger.
|
| At least this limitation can cause massive complexity once
| perf optimizations are needed. I think that's important to
| factor in when we're talking about the issues with large
| and resilient systems in either paradigm.
|
| Personally I like message passing because it's more true to
| the underlying protocol (TCP or UDP) and actually interops
| quite well (all things considered) with request-reply
| systems - it just requires two separate messages and a
| request id which is standard practice in request-response
| anyway. The inverse is not true though: we have like 10
| different Jacky solutions in the last decade for sending
| server initiated messages to clients.
| rkangel wrote:
| If I understand this right, this is basically the Erlang/Elixir
| OTP programming model, but across microservices rather than
| across a single (potentially distributed) VM. To be clear - that
| is a _good_ thing.
|
| One of the core concepts of OTP (effectively the Erlang standard
| library) is the GenServer. A GenServer processes incoming
| messages, mutates state if appropriate and sends responses. The
| OTP machinery means that this "send a message and wait for a
| response" is just a straight function call with return value to
| the caller. OTP takes care of all the edge cases (like when the
| process at the other end goes away half way through). This means
| that your code is just a straight series of synchronous function
| calls, which may be sending messages underneath to do things or
| get data, but you don't have to care. It's a lovely system to
| work in, and makes complicated systems feel simple.
|
| The elements communicating are in Erlang terminology 'processes'
| - but not OS processes, they are instead lightweight userspace
| schduled things - very lightweight to create. Erlang has built in
| distribution that allows you to connect multiple running
| machines, and then the same message passing works across network
| boundaries. You're still limited to the BEAM VM though. This is
| the 'full' microservice version of that.
| lp4vn wrote:
| I think this article is kind of misleading.
|
| You use messaging for asynchronous communication and REST for
| synchronous communication. The article makes me believe that
| using REST for synchronous communication is a kind of deprecated
| alternative in front of message passing.
| zmmmmm wrote:
| This reminds me more of Apache Camel[0] than other things it's
| being compared to.
|
| > The process initiator puts a message on a queue, and another
| processor picks that up (probably on a different service, on a
| different host, and in different code base) - does some
| processing, and puts its (intermediate) result on another queue
|
| This is almost exactly the definition of message routing (ie:
| Camel).
|
| I'm a bit doubtful about the pitch because the solution is
| presented as enabling you to maintain synchronous style
| programming while achieving benefits of async processing. This
| just isn't true, these are fundamental tradeoffs. If you need a
| synchronous answer back then no amount of queuing, routing,
| prioritisation, etc etc will save you when the fundamental
| resource providing that is unavailable, and the ultimate outcome
| that your synchronous client now hangs indefinitely waiting for a
| reply message instead of erroring hard and fast is not desirable
| at all. If you go into this ad hoc, and build in a leaky
| abstraction that asynchronous things are are actually synchronous
| and vice versa, before you know it you are going to have unstable
| behaviour or even worse, deadlocks all over your system and the
| worst part - the true state of the system is now hidden in which
| messages are pending in transient message queues everywhere.
|
| What really matters here is to fundamentally design things from
| the start with patterns that allow you to be very explicit about
| what needs to be synchronous vs async (building on principles of
| idempotency, immutability, coherence, to maximise the cases where
| async is the answer).
|
| The notion of Apache Camel is to make all these decisions a first
| class elements of your framework and then to extract out the
| routing layer as a dedicated construct. The fact it generalises
| beyond message queues (treating literally anything that can
| provide a piece of data as a message provider) is a bonus.
|
| [0] https://camel.apache.org/
| hummus_bae wrote:
| > The ultimate outcome that your synchronous client now hangs
| indefinitely waiting for a reply message instead of erroring
| hard and fast is not desirable at all.
|
| Async frameworks don't eliminate the possibility of long
| running processes, that continue to process long after
| responding a request - this is still possible with specific
| libraries/frameworks, they'll only take away the synchronous
| interface and provide as asychonous one instead.
|
| It is also important to note that error handling will be
| different between these 2 paradigms and it's important,
| whatever the most suitable one is, to acknowledge this since it
| forces us (developers) to handle the potential errors
| differently depending on the approach we choose.
| zmmmmm wrote:
| I think you're stating exactly my point?
|
| The pitch of MATS is that it let's:
|
| > developers code message-based endpoints that themselves may
| "invoke" other such endpoints, in a manner that closely
| resembles the familiar synchronous "straight down" linear
| code style
|
| In other words, they want to encourage you to feel like you
| are coding a synchronous workflow while actually coding an
| asynchronous one. You are pointing out that error handling
| needs to be different b/w these paradigms and you are
| correct, but that is only the start of it. A framework that
| papers over the differences is at very high risk of just
| creating a massive number of leaky abstractions that don't
| show up in the happy scenario but come back and bite you
| heavily when things go wrong.
|
| (I'm saying this as a long time user of Camel which models
| this exact concept heavily and also experiences many of these
| issues)
| stolsvik wrote:
| Hmm. I want to distance this library pretty far from Camel!
|
| Wrt. "papering over": Not really. I make it "feel like"
| you're coding straight down, sequential, linear, _as if_
| you 're coding synchronously.
|
| But if you look at the examples, e.g. at the very start of
| the Walkthrough: https://mats3.io/docs/message-oriented-
| rpc/, you'll understand that you are actually coding
| completely message-driven: Each stage is a completely
| separate little "server", picking up messages from one
| queue, and most often putting a new message onto another
| queue.
|
| It is true that the error handling is very different. Don't
| code errors! You cannot throw an exception back to the
| caller. You can however make for "error return" style DTOs,
| but otherwise, if you have an actual error, it'll "pop out"
| of the _Mats Fabric_ and end up on a DLQ. This is nice! It
| is not just a WARN or ERROR log-line in some log that no-
| one will see until way later, if ever: It immediately
| demands your attention.
|
| I wrote quite a long answer to something similar on a
| Reddit-thread a month ago: https://www.reddit.com/r/program
| ming/comments/1059jpv/messag...
| zmmmmm wrote:
| > Hmm. I want to distance this library pretty far from
| Camel!
|
| I'm curious what makes you distinguish it heavily from
| Camel? From everything you say it sounds to me like you
| are building Camel - or at least the routing part of it
| :-)
| ngrilly wrote:
| How is it different from NATS.io that solves most of the problems
| listed? (except the transactional aspect but I'm not convinced
| it's a good thing to have the same tool do everything)
| stolsvik wrote:
| I see references to NATS multiple times, but I fail to see how
| it solves what Mats aims to solve?
|
| Mats could be implemented on top of NATS, i.e. use NATS as a
| backend, instead of JMS. (We use ActiveMQ as the broker)
| adamckay wrote:
| The articles notes about async messaging architectures being
| superior to REST-based systems seems rather disingenuous, in my
| opinion, as it's seemingly only considering the most basic REST
| API deployed on a single node as the alternative.
|
| For example:
|
| > High Availability: For each queue, you can have listeners on
| several service instances on different physical servers, so that
| if one service instance or one server goes down, the others are
| still handling messages.
|
| This is negated in a REST-based system with the use of an API
| gateway / simple load balancer and multiple upstream nodes.
|
| > Location Transparency [Elastic systems need to be adaptive and
| continuously react to changes in demand, they need to gracefully
| and efficiently increase and decrease scale.]: Service Location
| Discovery is avoided, as messages only targets the logical queue
| name, without needing information about which nodes are currently
| consuming from that queue.
|
| Fair enough, service discovery is another challenge, but it's not
| hugely complex with modern API gateways and arguably no more
| complex with running and maintaining a message queue with
| associated workers. You've also got a risk in a distributed
| messaging system used by multiple teams that one service
| publishes messages into a queue that has been deprecated and has
| no consumers listening anymore.
|
| > Scalability / Elasticity: It is easy to increase the number of
| nodes (or listeners per node) for a queue, thereby increasing
| throughput, without any clients needing reconfiguration. This can
| be done runtime, thus you get elasticity where the cluster grows
| or shrinks based on the load, e.g. by checking the size of
| queues.
|
| Same as HA, solved with a load balancer.
|
| > Transactionality: Each endpoint has either processed a message,
| done its work (possibly including changing something in a
| database), and sent a message, or none of it.
|
| > Resiliency / Fault Tolerance: If a node goes down mid-way in
| processing, the transactional aspect kicks in and rolls back the
| processing, and another node picks up. Due to the automatic
| retry-mechanism you get in a message based system, you also get
| fault tolerance: If you get a temporary failure (database is
| restarted, network is reconfigured), or you get a transient error
| (e.g. a concurrency situation in the database), both the database
| change and the message reception is rolled back, and the message
| broker will retry the message.
|
| These seem to be arguing the same point, and perhaps this is
| solved in the Mats library but as a general advantage of async
| message queues over synchronous REST calls, the message broker
| retrying the message or messages being lost isn't a given -
| they're difficult to get entirely right in both architectures.
|
| > Monitoring: All messages pass by the Message Broker, and can be
| logged and recorded, and made statistics on, to whatever degree
| one wants.
|
| >Debugging: The messages between different parts typically share
| a common format (e.g. strings and JSON), and can be inspected
| centrally on the Message Broker.
|
| Centralising via an API gateway can also offer these.
| stolsvik wrote:
| Well. My point is that messaging _inherently_ have all these
| features, without needing any other tooling.
|
| The combination of transactionality and retrying is hard to
| achieve with REST, don't you think? It is actually pretty
| mesmerizing how our system handles screwups like a database
| going down, or some nodes crashing, or pretty much any failure:
| The flows might stop up for a few moments, but once things are
| back in place, all the flows just complete as if nothing
| happened. I shudder when thinking of how we would have handled
| such failures if we used sync processing.
|
| The one big deal is the concept of "state is on the wire": The
| process/flow "lives in the message" - not as a transient
| memory-bound concept on the stack of a thread.
| ay wrote:
| Makes me think of https://grugbrain.dev/:
|
| Microservices
|
| grug wonder why big brain take hardest problem, factoring system
| correctly, and introduce network call too
|
| seem very confusing to grug
| eterps wrote:
| "better than RPC" would be a more accurate title.
| stolsvik wrote:
| Well, I actually call Mats "Message-Oriented Async RPC".
| eikenberry wrote:
| Reminds me of the old Protocol vs. API debate.
|
| http://wiki.c2.com/?ApiVsProtocol=
| stolsvik wrote:
| Isn't that more of the difference between JMS and AMQP?
| weatherlight wrote:
| _chuckles in erlang_
| drkrab wrote:
| Yep
| weatherlight wrote:
| "Virding's First Rule of Programming:
|
| Any sufficiently complicated concurrent program in another
| language contains an ad hoc informally-specified bug-ridden slow
| implementation of half of Erlang." -- Robert Virding
| nitwit005 wrote:
| > Messaging naturally provides high availability, scalability,
| location transparency, prioritization, stage transactionality,
| fault tolerance, great monitoring, simple error handling, and
| efficient and flexible resource management.
|
| What is "stage transactionality"? If I do a Google search for it,
| I just find this page.
| stolsvik wrote:
| Hehe, okay. It was meant to mean "Each stage is processed in a
| transaction". Kinda hard to get down into a list. But my
| wording evidently didn't make anything clearer!
|
| If you read a few more pages, then it should hopefully become
| clearer. This page is specifically talking about it:
| https://mats3.io/using-mats/transactions-and-redelivery/ - but
| as it is one of the primary points of why I made Mats, it is
| mentioned multiple places, e.g. here:
| https://mats3.io/background/what-is-mats/
|
| This is not Mats-specific - it is directly using functionality
| provided by the message broker, via JMS.
___________________________________________________________________
(page generated 2023-02-12 23:00 UTC)