[HN Gopher] No such thing as exactly-once delivery
       ___________________________________________________________________
        
       No such thing as exactly-once delivery
        
       Author : todsacerdoti
       Score  : 72 points
       Date   : 2024-09-30 16:00 UTC (7 hours ago)
        
 (HTM) web link (blog.sequinstream.com)
 (TXT) w3m dump (blog.sequinstream.com)
        
       | dondraper36 wrote:
       | It's admirable when vendors don't try to abuse theory to make
       | their product look better.
       | 
       | There was some blog post last week that supposedly proved the
       | impossibility of exactly once delivery wrong, but that was quite
       | expectedly another attempt to rename the standard terms.
        
         | convolvatron wrote:
         | if you employ persistence to keep sequence numbers, and/or
         | retransmissions and redundancy, you can drive the likelihood of
         | distributed consensus over time arbitrarily close to
         | probabilities where it doesn't matter anyways (like the earth
         | suddenly being coincident with the sun).
         | 
         | so what are we arguing about?
        
           | sethammons wrote:
           | there is a group of people that say that there is no such
           | thing as exactly-once-delivery and they are correct. There is
           | also a group of people who cannot hear nor read nor process
           | the word "delivery" in that term and instead insist that
           | exactly-once-processing is easily achievable via
           | [deduplication, idempotent actions, counters, ...]. They are
           | correct as long as they acknowledge "processing" vs
           | "delivery" -- and there are several who refuse to acknowledge
           | that and don't believe the distinction matters and they are
           | wrong.
        
             | rwiggins wrote:
             | Out of curiosity, how would you describe TCP in these
             | terms? Does the TCP stack's handling of sequence numbers
             | constitute processing (on the client and server both I
             | assume)? Which part(s) of a TCP connection could be
             | described as delivery?
        
               | wmf wrote:
               | From TCP's perspective it has delivered the data when
               | read() returns successfully. This is also the point at
               | which TCP frees the packet buffers. From the app's
               | perspective, TCP is an at-most-once system because data
               | can be lost in corner cases caused by failures.
               | 
               | (Of course plenty of people on the Internet use different
               | definitions to arrive at the opposite conclusion but they
               | are all wrong and I am right.)
        
               | YZF wrote:
               | I agree this is what you'd consider delivery. Also agree
               | everyone else is wrong an I am right ;)
               | 
               | The similar question in TCP is what happens when the
               | sender writes to their socket and then loses the
               | connection. At this point the sender doesn't know whether
               | the receiver has the data or not. Both are possible. For
               | the sender to recover it needs to re-connect and re-send
               | the data. Thus the data is potentially delivered more
               | than once and the receiver can use various strategies to
               | deal with that.
               | 
               | But sure, within the boundary of a single TCP connection
               | data (by definition) is never delivered twice.
        
             | wellpast wrote:
             | is there a real difference between delivery and processing?
             | 
             | In a streaming app framework (like Flink or Beam, or Kafka
             | Streams), I can get an experience of exactly once
             | "processing" by reverting to checkpoints on failure and re-
             | processing.
             | 
             | But what's the difference between doing this within the
             | internal stores of a streaming app or coordinating a
             | similar checkpointing/recovery/"exactly-once" mechanism
             | with an external "delivery destination"?
        
               | ithkuil wrote:
               | Yes there is a difference and it's quite crucial to
               | understanding how two groups of people can claim
               | seemingly opposite things and both be correct.
               | 
               | It's important to note that by processing here we don't
               | just mean any "computing" but the act of "committing the
               | side effects of the computation"
        
               | wellpast wrote:
               | A typical delivery target is a data store or cache.
               | Writing to such a delivery can be construed of as a "side
               | effect". But if you accept eventual consistency,
               | achieving exactly once semantics is possible.
               | 
               | Eg a streaming framework will let you tally an exactly
               | once counter. You can flush the value of that tally to a
               | data store. External observers will see an eventually
               | consistent exactly-once-delivery result.
        
             | andrewflnr wrote:
             | Not that I disagree with getting terms exactly right, but
             | if the "processing" that happens exactly once in your
             | messaging system is firing a "delivery" event to a higher
             | level of the stack (over a local channel), that's
             | observationally indistinguishable from exactly-once
             | delivery to that higher level, isn't it? What am I missing?
        
               | wmf wrote:
               | Nope, because _that_ event could also be lost
               | /duplicated.
        
               | Izkata wrote:
               | In the systems I'm used to, that "passing up" is a
               | function call within the same codebase. There isn't a
               | second delivery system for a message to be
               | lost/duplicated, though sometimes they obscure it by
               | making it look more like such a system.
        
             | ViewTrick1002 wrote:
             | The problems always stem from the side effects. You can
             | achieve exactly once like properties in a contained system.
             | 
             | But if the processing contains for example sending an email
             | then you need to apply said actions for all downstream
             | actions, which is where exactly once as seen by the outside
             | viewer gets hard.
        
           | SpicyLemonZest wrote:
           | What you _can 't_ do, and what naive consumers of distributed
           | systems often expect, is drive the probability arbitrarily
           | low solely from the producer side. The consumer has to be
           | involved, even if they consider themselves to be a single-
           | node process and don't want to bother with transactionality
           | or idempotency keys.
        
           | mgsouth wrote:
           | Vendor: "Exactly once is possible. We do it!"
           | 
           | No, as a vendor you're either deliberately lying, or
           | completely incompentent. Let's insert the words they don't
           | want to.
           | 
           | "We do exactly once good enough for your application!"
           | 
           | How could you possibly know that?
           | 
           | "We do exactly once good enough for all practical purposes!"
           | 
           | So are my purposes practical? _Am I_ a true scotsman?
           | 
           | "We do exactly once good enough for applications we're good
           | enough for!"
           | 
           | Finally, a truthful claim. I'm not holding my breath.
           | 
           | Edit: More directly address parent:
           | 
           | It's not possible to even guarentee at-least-once in a finite
           | amount of time, let alone exactly-once. For example, a high-
           | frequency trading system wants to deliver exactly one copy of
           | that 10-million-share order. And do it in under a
           | microsecond. If you've got a 1.5 microsecond round trip, it's
           | not possible to even get confirmation an order was received
           | in that time limit, much less attempt another delivery. Your
           | options are a) send once and hope. b) send lots of copies,
           | with some kind of unique identifier so the receiver can do
           | process-at-most-once, and hope you don't wind up sending to
           | two different receiving servers.
        
         | sidewndr46 wrote:
         | This didn't stop AWS from at one point declaring "none of the
         | above" as their delivery guarantees for S3 events:
         | 
         | https://www.hydrogen18.com/blog/aws-s3-event-notifications-p...
         | 
         | From what I can tell my blog post eventually motivated them to
         | change this behavior
        
       | tombert wrote:
       | Reminds me of this:
       | https://en.wikipedia.org/wiki/Fallacies_of_distributed_compu...
       | 
       | Part of the reason I find distributed and concurrent computing so
       | fascinating is that we lose so many assumptions about how
       | computers are supposed to work; ordering guarantees, data
       | actually arriving to destinations, fairness, etc.
       | 
       | While it is of course possible for data to get lost or mangled in
       | a single-threaded, non-distributed application, it's rare enough
       | to where you often don't need to plan much around it. Once you
       | add concurrency, and especially if you make it distributed, you
       | suddenly have to plan around the _eventuality_ of the messages
       | not arriving, or arriving in the wrong order, or arriving
       | multiple times.
       | 
       | Personally, I've gotten into the habit of PlusCal and TLA+ for
       | everything, and modeling in the assumption that stuff might never
       | arrive or arrive multiple times, which can work as a very good
       | sanity check (though it's of course not a silver bullet).
        
       | Animats wrote:
       | There's no such thing as finite-time arbitration for
       | multiprocessor requests to memory, but it works anyway. This is
       | similar. You can't guarantee time-bounded exactly once delivery,
       | but after enough interchanges the probability of being wrong
       | approaches zero.
       | 
       | This shows up explicitly in blockchain systems, where confidence
       | improves as the number of confirmation cycles increases.
        
         | YZF wrote:
         | This is the first time I'm hearing about the topic of
         | arbitrating multiprocessor access to memory. Isn't there a
         | trivial counterexample where the bandwidth is divided by the
         | number of processors and each processor gets a timeslot?
         | 
         | Processors, despite being a distributed system (well really
         | every ASIC is), don't typically suffer from your classical
         | distributed systems style issues because they have no failures,
         | or rather they only have certain catastrophic failure modes
         | that are low probability and well controlled for. Your typical
         | distributed system to contrast has constant failures.
        
           | Animats wrote:
           | > This is the first time I'm hearing about the topic of
           | arbitrating multiprocessor access to memory.
           | 
           | See [1]. This was a very real problem in early multiprocessor
           | computer design.
           | 
           | [1] https://en.wikipedia.org/wiki/Arbiter_%28electronics%29
        
           | rcxdude wrote:
           | It's more to do with the underlying electronics: digital is
           | an abstraction, everything is analog underneath, and logic
           | gates and registers are actually just very non-linear analog
           | circuits. What this means is that it's always in principle
           | possible for an asynchronous edge (i.e. one that is not
           | aligned to the other edges in the system) to be arbitrarily
           | non-determinate: never actually being decided as a 1 or a 0
           | as it propagates through the digital circuit. This isn't an
           | entirely theoretical issue, either: it can and will cause
           | very strange behaviour in a digital system because a marginal
           | or metastable value can be interpreted differently by
           | different parts of the circuit that should be seeing the same
           | value. It's possible in practice to push the chance of this
           | happening extremely low, but it comes at the cost of latency:
           | you need to spend more time to allow the circuit to actually
           | settle into one state or the other.
           | 
           | Within a fully synchronous system, it's possible to avoid
           | this, though. But most multi-processor systems are not
           | synchronous.
        
             | londons_explore wrote:
             | > But most multi-processor systems are not synchronous.
             | 
             | Is this actually true? A modern AMD or Intel CPU doesn't
             | have one quartz crystal per core, so therefore all per-core
             | clocks have to be derived from one clock source, meaning
             | the relationship between all the clock frequencies and
             | phases is deterministic, even if not specced in the
             | datasheet.
        
       | dang wrote:
       | Recent and related:
       | 
       |  _Yes, you can have exactly-once delivery_ -
       | https://news.ycombinator.com/item?id=41598006 - Sept 2024 (133
       | comments)
       | 
       | Also. Others?
       | 
       |  _You cannot have exactly-once delivery (2015)_ -
       | https://news.ycombinator.com/item?id=34986691 - March 2023 (217
       | comments)
       | 
       |  _Show HN: Exactly-once delivery counter using Watermill
       | messaging library_ -
       | https://news.ycombinator.com/item?id=26888303 - April 2021 (9
       | comments)
       | 
       |  _High-performance, exactly-once, failure-oblivious distributed
       | programming (2018)_ -
       | https://news.ycombinator.com/item?id=20562733 - July 2019 (26
       | comments)
       | 
       |  _Exactly-once Semantics: How Kafka Does it_ -
       | https://news.ycombinator.com/item?id=14670801 - June 2017 (39
       | comments)
       | 
       |  _Delivering Billions of Messages Exactly Once_ -
       | https://news.ycombinator.com/item?id=14664405 - June 2017 (133
       | comments)
       | 
       |  _You Cannot Have Exactly-Once Delivery_ -
       | https://news.ycombinator.com/item?id=9266725 - March 2015 (55
       | comments)
       | 
       |  _Exactly-Once Messaging in Kafka_ -
       | https://news.ycombinator.com/item?id=8880612 - Jan 2015 (8
       | comments)
       | 
       |  _Exactly-Once Delivery May Not Be What You Want_ -
       | https://news.ycombinator.com/item?id=8614264 - Nov 2014 (10
       | comments)
        
         | notfed wrote:
         | Tangent, but: this reminds me of many philosophical debates,
         | e.g. "do we have free will?".
         | 
         | What I've found is that usually what's happening is that the
         | "debaters" assume different definitions of the thing they're
         | debating about, which means they're really debating about two
         | entirely different things, with the same name, and that's the
         | real reason why they've come to different conclusions.
        
           | rzzzt wrote:
           | One recommendation I've read is that in such cases you should
           | treat the single label you and your opponent have attached to
           | two different ideas as forbidden. Describe it in more detail
           | to reveal the difference (don't just swap words in the
           | starting combination with their synonyms).
        
             | wmf wrote:
             | That's fine for discussions between experts but when a
             | hundred noobs enter the discussion it's just unpaid
             | tutoring. These threads are bringing out a lot of "I've
             | discovered something all the distributed systems experts
             | missed" energy.
        
           | aidenn0 wrote:
           | For me "do we have free will" debate is useless because if
           | the answer is "no" then that implies the result of the debate
           | is predetermined and all involved lack any agency to change
           | the outcome.
           | 
           | Therefore any adult arguing the "no free will" side is either
           | 
           | 1. Disingenuous
           | 
           | 2. Purposefully playing Candyland as an adult.
           | 
           | Either of which is so damaging to the Ethos that I lack any
           | desire to continue the debate.
        
             | yarg wrote:
             | That doesn't follow.
             | 
             | There's nothing disingenuous or moronic about believing
             | that physical processes might be fundamentally
             | deterministic.
        
               | aidenn0 wrote:
               | > There's nothing disingenuous or moronic about believing
               | that physical processes might be fundamentally
               | deterministic.
               | 
               | My argument is that a game of Candyland is fundamentally
               | deterministic in the same way.
        
       | thadt wrote:
       | > With that definition, SQS, Kafka, and Sequin are all systems
       | that guarantee exactly-once processing. The term processing
       | captures both the delivery of the message and the successful
       | acknowledgment of the message.
       | 
       | This terminology confuses my face. Maybe we need a better name
       | for "exactly-once processing"? When I say "processing" I'm
       | thinking of the actions that my program is going to take on said
       | message - the very processing that they're talking about can
       | fail. Can we not just say 'the message is held until
       | acknowledgement'?
        
         | theptip wrote:
         | The simple way to visualize this is that all the processing can
         | happen within a serializable transaction. If anything fails,
         | the transaction rolls back. Marking the message as delivered is
         | part of the same transaction.
        
           | thadt wrote:
           | Which means that the 'processing' in my application has to
           | have the semantics of a serializable transaction. And it
           | might not - as in the very example given in the post, where a
           | processing application sends an email.
           | 
           | Hence my complaint that making statements like "our system
           | offers exactly once processing" can be completely correct -
           | from the point of view of the _messaging system_. But it can
           | be misleading to the casual reader in understanding how it
           | fits into the broader solution, because the messaging system
           | cannot guarantee that my application processes it exactly
           | once. I 'm saying the terminology doesn't feel right.
        
       | griffgeorge wrote:
       | > That's why we don't say Sequin offers exactly-once delivery
       | anywhere. We say it offers at-least-once delivery and exactly-
       | once processing.
       | 
       | The author should note that the homepage at
       | https://sequinstream.com/ does in fact claim that Sequin supports
       | "Exactly-once delivery" under the "Key features" heading.
        
         | notfed wrote:
         | Well that doesn't seem consistent.
        
           | bhaney wrote:
           | It will be eventually
        
         | _acco wrote:
         | Aye, good catch! Honest mistake. Fixed.
        
           | notfed wrote:
           | Ack.
        
             | rzzzt wrote:
             | I didn't get this one.
        
       | 0xbadcafebee wrote:
       | I mean, you _can_ do exactly-once delivery, it 's just a more
       | complex state machine with a specific design requirement. Here it
       | is using mail:                 # Receive new mail message,
       | incoming id '1234'.       # No other writer knows about this
       | message.       $ echo "hello world" > incoming/1.1234.txt
       | # If files stay in incoming/ too long, it means they probably
       | failed writing,       # and are thus stale and can be removed.
       | # When the message is done being written, the writer moves it to
       | a new state.       $ mv incoming/1.1234.txt new/1.1234.txt
       | # The writer can then report to the sender that the message was
       | accepted.       # In mail, this is seen as optional, as the whole
       | point is to deliver the message,       # not to make sure you
       | inform the sender that you will deliver the message.
       | # Here's the exactly-once bit:       #   If you need assurance of
       | exactly-once delivery, split out the delivery portion        #
       | from the confirmation portion.       #          #   Have one
       | actor talk to the sender, and another actor queue the message in
       | another        #   temporary queue, like 'accepted/'. Once the
       | message is queued there, you tell the        #   client it is
       | accepted, wait for successful termination of the connection, and
       | then       #   move the message to the 'new/' queue.       #
       | #   This way, if there is an error between accepting the message
       | and confirming with       #   the sender, you can detect that and
       | leave the message in 'accepted/' where you can       #   later
       | decide what to do with it. (But it won't get picked up for
       | processing)              # The message is now ready to be picked
       | up for processing.       # To process the message, a processor
       | picks up the new message and moves it to a new state.       #
       | # The processor adds an identifier '5678', so the processor can
       | identify which file it's working on.       # It can have extra
       | logic to re-attempt processing on this ID if it fails.       $ mv
       | new/1.1234.txt process/1.1234.5678.txt              # If the
       | processor dies or takes too long, you can identify that (stale
       | file, processor tracking its work, etc)       # and this can be
       | moved back to 'new/' for another attempt. Moving it back to
       | 'new/' is still atomic so       # there is no danger of another
       | processor continuing to work on it.              # Once the
       | processor is done processing, it will move to the complete state.
       | $ mv process/1.1234.5678.txt done/1.txt              # If the
       | processing file no longer exists, it means it was either already
       | done,       # or previously died and was moved back to 'new/'.
       | 
       | This process all depends on a database [filesystem] using
       | synchronous atomic operations [mv]. If your distributed system
       | can't handle that, yeah, you're gonna have a hard time.
        
         | mgsouth wrote:
         | So that everyone's on the same page, perhaps you could clarify
         | what "delivery" means to you? For example:
         | 
         | * The human to whom the message is addressed reads it on their
         | screen.
         | 
         | * The email is inserted into that person's inbox. It's possible
         | something will happen to the message after than point, before
         | the addressee actually reads it, but that isn't what you're
         | talking about.
         | 
         | * The email has been accepted by a server operating as an
         | endpoint for the addressee (their company's Exchange server),
         | _and_ acknowledgement has been received by the sending server.
         | 
         | * The above, but no guarentees about the sending server getting
         | the acknowledgement.
         | 
         | etc.
         | 
         | [Edit: also, what "guarentee" means to you. 100%, 99.99999% is
         | good enough, and so on.]
        
         | bcrl wrote:
         | You have made a false assumption: directory operations across
         | directories are not atomic. The typical way around that is to
         | hardlink the file into the target directory and then sync the
         | target directory then delete the old name, but now you no
         | longer know if the file is new and has been processed or old
         | when the systems is interrupted with the intermediate state on
         | disk with the file in both directories. Which is exactly the
         | problem messaging systems have. There is always a point in the
         | state machines across multiple systems where a crash at the
         | right point in time can lead to replay of a message. I worked
         | on persistent messaging for a few years, and it is impossible
         | to resolve this in real world systems. The best you can do is
         | to narrow the window by reducing latency of the network and for
         | commits to persistent storage.
        
       | magicalhippo wrote:
       | This feels a bit like those relativity puzzles, for example the
       | barn-pole paradox[1].
       | 
       | That is, delivered exactly-once seems to be a relative statement
       | where one party can state they did deliver exactly-once while
       | another party can disagree, and they're both correct relative to
       | their local frames of reference.
       | 
       | So, like how GR has no global conservation of energy, just local,
       | could one say that distributed systems have no global
       | conservation of messages, only local?
       | 
       | Or am I just late night rambling again...
       | 
       | [1]: https://en.wikipedia.org/wiki/Ladder_paradox
        
       | socketcluster wrote:
       | The "exactly once delivery is impossible" narrative got so strong
       | that people feel that they have to redefine the word 'delivery'
       | to make it fit the premise.
       | 
       | If a message can be guaranteed to be received one or more times
       | but results in a single entry being inserted into a database
       | (e.g. in an idempotent way using UUIDs to enforce uniqueness),
       | then clearly, a consumer reading the records from that database
       | would only receive the message exactly once. If you consider the
       | intended recipient to be the process, not the host, then the
       | definition of 'delivery' allows you to claim that exactly-once
       | delivery is possible.
       | 
       | Of course this is a pragmatic, not mathematical argument. In
       | theory you can't mathematically guarantee even at least once
       | delivery because you can't guarantee that there's not going to be
       | a Carrington event which will bring down the internet and drive
       | humanity to extinction; thus preventing the consumer server from
       | being rebooted.
        
         | YZF wrote:
         | What about other messages though? The question isn't usually
         | limited to one message in isolation, it's related to the state
         | of the system as a whole. So sure, trivially, setting one bit
         | where that bit is never cleared can only be done once, so any
         | process setting that bit is also trivially (and not very
         | usefully?) "exactly once delivery".
        
           | mgsouth wrote:
           | On the contrary, that is an extremely useful process, and is
           | the basis for vast swathes of multiprocessing algorithms
           | (such as "has this thing finished initializing", or "has it
           | been freed"). However, it is "at most once", not "exactly
           | once". The bit may never be set.
        
       | benlivengood wrote:
       | Messages don't have to be delivered more than once to achieve
       | exactly-once processing; the receiver can offer a query interface
       | for delivery status. If every attempt at delivery checks for a
       | unique message ID and only sends the message if it has not been
       | received then you get exactly-once delivery of the _message_ , if
       | not the _metadata_.
       | 
       | Optimistically assuming the message has not been delivered yet
       | will sometimes reduce average latency (most databases), but for
       | very expensive messages it is better to check first (rsync)
        
       | dataflow wrote:
       | If there is no such thing as exactly-once delivery because the
       | recipient might not feel like deduplicating, then there can be no
       | such thing as at-least-once delivery either, because the
       | recipient might not feel like being endlessly available for
       | delivery either. Or the mailman might not feel like
       | (re-)attempting deliveries endlessly either. Or the sender might
       | not feel like re-sending things endlessly either.
       | 
       | Is this insightful? Surprising? What's the point of the debate?
        
       ___________________________________________________________________
       (page generated 2024-09-30 23:00 UTC)