[HN Gopher] Reliable event dispatching using a transactional outbox
       ___________________________________________________________________
        
       Reliable event dispatching using a transactional outbox
        
       Author : losfair
       Score  : 27 points
       Date   : 2023-02-02 06:35 UTC (2 days ago)
        
 (HTM) web link (blog.frankdejonge.nl)
 (TXT) w3m dump (blog.frankdejonge.nl)
        
       | svieira wrote:
       | This is interesting, but you've not actually solved the problem,
       | just moved it. You still need cross-service transactions to
       | publish the event only-once. Consider the case of "Publish the
       | event to the queue. Fail to update / delete the entry in the
       | event-buffer table." This is the "bad" pattern of push-then-store
       | (with one exception - if you are fine with at-least-once message
       | delivery instead of only-once). Likewise the "good" pattern of
       | store-then-push has the same failure mode "Delete the entry from
       | the buffer. Fail to publish the entry to the queue".
       | 
       | That said, this does decouple the two operations which allows you
       | to scale the publish side of the service separately from produce
       | side (which can help when your architecture can produce multiple
       | messages per storage event)
        
         | ryanjshaw wrote:
         | Maybe you are using different terminology, because IMO you
         | don't need "cross-service transactions to publish the event
         | only-once"
         | 
         | "Exactly-once" message _processing_ is what you typically want
         | in business operations, and that is straightforward to achieve
         | with  "at-least-once" message _delivery_ :                 1.
         | send(message with unique .id):         a. insert into DB queue
         | where not exists message.id            2. dispatcher:
         | a. fetch messages from DB queue where unpublished = 1
         | b. publish message to integration queue [at-least-once-
         | delivery]         c. update message set unpublished = 0 (and
         | ideally record some audit information e.g. the sent timestamp +
         | Kafka message partition/offset or IBM MQ message ID)
         | 3. receive(message):         a. check if message.id has been
         | processed [exactly-once-processing]         b. perform
         | operation, including linking it back to message.id, in a single
         | DB transaction
         | 
         | If failure occurs:                 1. between 2.a and 2.b, you
         | fetch messages again and nothing is lost            2. between
         | 2.b and 2.c, you republish the same message; 3.a ensures
         | exactly-once *processing*            3. between 3.a and 3.b,
         | you retry the message (with IBM MQ, you use transactions or
         | peek then read;  with Kafka, you don't update offsets until 3.b
         | completes)            4. on the integration queue: set
         | unpublished = 1 on all messages published sent in the timeframe
         | where integration queue lost data
         | 
         | In practice: SQL DBs will surprise you with lock and
         | concurrency semantics; I suggest using something battle tested
         | under high load, or an event store database.
         | 
         | In a sense, this is a "cross-service transaction", but it is
         | not a "transaction protocol", rather it is an eventually
         | consistent design.
        
         | bob1029 wrote:
         | This reminds me of an old requirement we had to enforce
         | printing of a specific document in our B2B product.
         | 
         | The feature had to get halfway to production before someone had
         | the gall to ponder what might happen if a printer was out of
         | paper (or worse).
         | 
         | Predicating business logic upon successful delivery of
         | something over a "I don't give a fuck" wall is sketchy in my
         | view.
        
         | tsss wrote:
         | With Kafka's idempotent producer config, which is the default,
         | this is a non-problem.
        
         | bastawhiz wrote:
         | > You still need cross-service transactions to publish the
         | event only-once. Consider the case of "Publish the event to the
         | queue. Fail to update / delete the entry in the event-buffer
         | table."
         | 
         | I'm not sure this is the problem the author has set out to
         | solve. Exactly once dispatching is really hard and requires
         | much more than what's written about here. Even though the
         | system the author wrote about is at-least-once, it guarantees
         | an event isn't dispatched for data that's not stored or that
         | data is stored for an event that's never dispatched.
         | 
         | I do think the author is incorrect in one place:
         | 
         | > While consumers should ideally be idempotent
         | 
         | They _must_ be idempotent. For this very reason.
        
       | groodt wrote:
       | Any thoughts how to detect direct DML on the state table?
       | Presumably allowing direct DML on the state table without the
       | same on the Outbox table would lead to silent data corruption or
       | lost updates.
        
       | [deleted]
        
       | mkleczek wrote:
       | I must say not mentioning two phase commit and distributed
       | transactions in this context seems strange.
       | 
       | It looks like nowadays people forgot about XA and that MQ and
       | DBMS can participate in a distributed transaction.
        
         | ryanjshaw wrote:
         | When I hit commercial software development in the late 00s, at
         | least in the Microsoft space, distributed transactions were
         | considered an anti-pattern because the complexity of it all >>
         | the benefit. "Application-level" distributed transactions (i.e.
         | asynchronous, eventually consistent, business-transaction
         | messages with exactly-once processing semantics) are best
         | practice at all the places I've worked in the past 15 years.
        
       | kgeist wrote:
       | Oh, it took us around 2 years to have somewhat reliable event
       | dispatching with transactional outboxes. We've run into so many
       | edge cases under highload:
       | 
       | - using the autoincrement ID as a pointer to the "last processed
       | event" not knowing that MySQL reserves IDs non-atomically with
       | inserts so it's possible for the relay to "see" an event with a
       | bigger ID before an event with a smaller ID becomes visible, thus
       | skipping some events
       | 
       | - implementing event handlers with "only once" semantics instead
       | of "at least once" (retries on error corrupting data)
       | 
       | - event queue processing completely halting for the entire
       | application when an event handler throws an error due to a bug
       | and so gets retried infinitely
       | 
       | - some other race conditions in the implementation when the order
       | of events gets messed up
       | 
       | - too fine-grained events without batching overflowing the queue
       | (takes too much time for an event to get processed)
       | 
       | - the relay getting OOMs due to excessive batching
       | 
       | - once we had a funny bug when code which updated the last
       | processed ID of the current DB shard (each client has their own
       | shard) wrote to the wrong shards and so our relay started
       | replaying thousands events from years ago
       | 
       | - some event handlers always sending mail as part of processing
       | events, so when it's retried on error or replayed (see the bug
       | above) clients receive same emails multiple times
       | 
       | And still we have sometimes weird bugs like once a month for some
       | reason we see a random event getting replayed in a deleted
       | account, still tracking it down.
        
         | ryanjshaw wrote:
         | Nice list, brings back fond memories.
         | 
         | On MS SQL you have a similar scenario IIRC - you can READ
         | COMMITTED to block the query when there are dirty writes in the
         | range of rows, but you're also encouraged in general to
         | actually use READ COMMITTED SNAPSHOT, which _won 't_ block...
         | 
         | And then you get the problem where a different thread is
         | inserting a message into the queue table, but it's taking a
         | while, and the dispatcher is reading the queue table, and
         | misses the uncommitted message... and never sees it again
         | because you're using "last processed offset" as your "resume
         | point", and you've moved past that point by the time the
         | concurrent message insert completes. That's why I use
         | "IsUnpublished = 1" as my flag, which also makes replay very
         | simple to do (just create a _filtered index_ on that column for
         | performance reasons).
         | 
         | When I first started building applications like this, I
         | eventually learned that the people discussing these patterns
         | placed very little explicit emphasis on distinguishing between
         | "application events" and "integration events" -- if you are
         | struggling with too many events, maybe this is where your issue
         | is.
         | 
         | A final point - your _business logic_ event handler shouldn 't
         | send mail. You should produce an "operation completed (ID)"
         | event and send the email off the back of that in an _email
         | notification_ event handler.
        
       | chucke wrote:
       | On the topic, I created this ruby gem called tobox, essentially a
       | transactional outbox framework: https://gitlab.com/os85/tobox
       | 
       | It actually circumvents most of the limitations mentioned in the
       | article. Been successfully using it at work as an sns relay, for
       | another app which is not even ruby.
        
       ___________________________________________________________________
       (page generated 2023-02-04 23:01 UTC)