[HN Gopher] Reliable event dispatching using a transactional outbox
___________________________________________________________________
Reliable event dispatching using a transactional outbox
Author : losfair
Score : 27 points
Date : 2023-02-02 06:35 UTC (2 days ago)
(HTM) web link (blog.frankdejonge.nl)
(TXT) w3m dump (blog.frankdejonge.nl)
| svieira wrote:
| This is interesting, but you've not actually solved the problem,
| just moved it. You still need cross-service transactions to
| publish the event only-once. Consider the case of "Publish the
| event to the queue. Fail to update / delete the entry in the
| event-buffer table." This is the "bad" pattern of push-then-store
| (with one exception - if you are fine with at-least-once message
| delivery instead of only-once). Likewise the "good" pattern of
| store-then-push has the same failure mode "Delete the entry from
| the buffer. Fail to publish the entry to the queue".
|
| That said, this does decouple the two operations which allows you
| to scale the publish side of the service separately from produce
| side (which can help when your architecture can produce multiple
| messages per storage event)
| ryanjshaw wrote:
| Maybe you are using different terminology, because IMO you
| don't need "cross-service transactions to publish the event
| only-once"
|
| "Exactly-once" message _processing_ is what you typically want
| in business operations, and that is straightforward to achieve
| with "at-least-once" message _delivery_ : 1.
| send(message with unique .id): a. insert into DB queue
| where not exists message.id 2. dispatcher:
| a. fetch messages from DB queue where unpublished = 1
| b. publish message to integration queue [at-least-once-
| delivery] c. update message set unpublished = 0 (and
| ideally record some audit information e.g. the sent timestamp +
| Kafka message partition/offset or IBM MQ message ID)
| 3. receive(message): a. check if message.id has been
| processed [exactly-once-processing] b. perform
| operation, including linking it back to message.id, in a single
| DB transaction
|
| If failure occurs: 1. between 2.a and 2.b, you
| fetch messages again and nothing is lost 2. between
| 2.b and 2.c, you republish the same message; 3.a ensures
| exactly-once *processing* 3. between 3.a and 3.b,
| you retry the message (with IBM MQ, you use transactions or
| peek then read; with Kafka, you don't update offsets until 3.b
| completes) 4. on the integration queue: set
| unpublished = 1 on all messages published sent in the timeframe
| where integration queue lost data
|
| In practice: SQL DBs will surprise you with lock and
| concurrency semantics; I suggest using something battle tested
| under high load, or an event store database.
|
| In a sense, this is a "cross-service transaction", but it is
| not a "transaction protocol", rather it is an eventually
| consistent design.
| bob1029 wrote:
| This reminds me of an old requirement we had to enforce
| printing of a specific document in our B2B product.
|
| The feature had to get halfway to production before someone had
| the gall to ponder what might happen if a printer was out of
| paper (or worse).
|
| Predicating business logic upon successful delivery of
| something over a "I don't give a fuck" wall is sketchy in my
| view.
| tsss wrote:
| With Kafka's idempotent producer config, which is the default,
| this is a non-problem.
| bastawhiz wrote:
| > You still need cross-service transactions to publish the
| event only-once. Consider the case of "Publish the event to the
| queue. Fail to update / delete the entry in the event-buffer
| table."
|
| I'm not sure this is the problem the author has set out to
| solve. Exactly once dispatching is really hard and requires
| much more than what's written about here. Even though the
| system the author wrote about is at-least-once, it guarantees
| an event isn't dispatched for data that's not stored or that
| data is stored for an event that's never dispatched.
|
| I do think the author is incorrect in one place:
|
| > While consumers should ideally be idempotent
|
| They _must_ be idempotent. For this very reason.
| groodt wrote:
| Any thoughts how to detect direct DML on the state table?
| Presumably allowing direct DML on the state table without the
| same on the Outbox table would lead to silent data corruption or
| lost updates.
| [deleted]
| mkleczek wrote:
| I must say not mentioning two phase commit and distributed
| transactions in this context seems strange.
|
| It looks like nowadays people forgot about XA and that MQ and
| DBMS can participate in a distributed transaction.
| ryanjshaw wrote:
| When I hit commercial software development in the late 00s, at
| least in the Microsoft space, distributed transactions were
| considered an anti-pattern because the complexity of it all >>
| the benefit. "Application-level" distributed transactions (i.e.
| asynchronous, eventually consistent, business-transaction
| messages with exactly-once processing semantics) are best
| practice at all the places I've worked in the past 15 years.
| kgeist wrote:
| Oh, it took us around 2 years to have somewhat reliable event
| dispatching with transactional outboxes. We've run into so many
| edge cases under highload:
|
| - using the autoincrement ID as a pointer to the "last processed
| event" not knowing that MySQL reserves IDs non-atomically with
| inserts so it's possible for the relay to "see" an event with a
| bigger ID before an event with a smaller ID becomes visible, thus
| skipping some events
|
| - implementing event handlers with "only once" semantics instead
| of "at least once" (retries on error corrupting data)
|
| - event queue processing completely halting for the entire
| application when an event handler throws an error due to a bug
| and so gets retried infinitely
|
| - some other race conditions in the implementation when the order
| of events gets messed up
|
| - too fine-grained events without batching overflowing the queue
| (takes too much time for an event to get processed)
|
| - the relay getting OOMs due to excessive batching
|
| - once we had a funny bug when code which updated the last
| processed ID of the current DB shard (each client has their own
| shard) wrote to the wrong shards and so our relay started
| replaying thousands events from years ago
|
| - some event handlers always sending mail as part of processing
| events, so when it's retried on error or replayed (see the bug
| above) clients receive same emails multiple times
|
| And still we have sometimes weird bugs like once a month for some
| reason we see a random event getting replayed in a deleted
| account, still tracking it down.
| ryanjshaw wrote:
| Nice list, brings back fond memories.
|
| On MS SQL you have a similar scenario IIRC - you can READ
| COMMITTED to block the query when there are dirty writes in the
| range of rows, but you're also encouraged in general to
| actually use READ COMMITTED SNAPSHOT, which _won 't_ block...
|
| And then you get the problem where a different thread is
| inserting a message into the queue table, but it's taking a
| while, and the dispatcher is reading the queue table, and
| misses the uncommitted message... and never sees it again
| because you're using "last processed offset" as your "resume
| point", and you've moved past that point by the time the
| concurrent message insert completes. That's why I use
| "IsUnpublished = 1" as my flag, which also makes replay very
| simple to do (just create a _filtered index_ on that column for
| performance reasons).
|
| When I first started building applications like this, I
| eventually learned that the people discussing these patterns
| placed very little explicit emphasis on distinguishing between
| "application events" and "integration events" -- if you are
| struggling with too many events, maybe this is where your issue
| is.
|
| A final point - your _business logic_ event handler shouldn 't
| send mail. You should produce an "operation completed (ID)"
| event and send the email off the back of that in an _email
| notification_ event handler.
| chucke wrote:
| On the topic, I created this ruby gem called tobox, essentially a
| transactional outbox framework: https://gitlab.com/os85/tobox
|
| It actually circumvents most of the limitations mentioned in the
| article. Been successfully using it at work as an sns relay, for
| another app which is not even ruby.
___________________________________________________________________
(page generated 2023-02-04 23:01 UTC)