hngopher.com

       [HN Gopher] Jepsen: NATS 2.12.1
       ___________________________________________________________________
        
       Jepsen: NATS 2.12.1
        
       Author : aphyr
       Score  : 207 points
       Date   : 2025-12-08 18:51 UTC (4 hours ago)
        
 (HTM) web link (jepsen.io)
 (TXT) w3m dump (jepsen.io)
        
       | vrnvu wrote:
       | Sort of related. Jepsen and Antithesis recently released a
       | glossary of common terms which is a fantastic reference.
       | 
       | https://jepsen.io/blog/2025-10-20-distsys-glossary
        
       | merb wrote:
       | > 3.4 Lazy fsync by Default
       | 
       | Why? Why do some databases do that? To have better performance in
       | benchmarks? It's not like that it's ok to do that if you have a
       | better default or at least write a lot about it. But especially
       | when you run stuff in a small cluster you get bitten by stuff
       | like that.
        
         | thinkharderdev wrote:
         | > To have better performance in benchmarks
         | 
         | Yes, exactly.
        
         | millipede wrote:
         | I always wondered why the fsync has to be lazy. It seems like
         | the fsync's can be bundled up together, and the notification
         | messages held for a few millis while the write completes.
         | Similar to TCP corking. There doesn't need to be one fsync per
         | consensus.
        
           | aphyr wrote:
           | Yes, good call! You can batch up multiple operations into a
           | single call to fsync. You can also tune the number of
           | milliseconds or bytes you're willing to buffer before calling
           | `fsync` to balance latency and throughput. This is how
           | databases like Postgres work by default--see the
           | `commit_delay` option here:
           | https://www.postgresql.org/docs/8.1/runtime-config-wal.html
        
             | to11mtm wrote:
             | > This is how databases like Postgres work by default--see
             | the `commit_delay` option here:
             | https://www.postgresql.org/docs/8.1/runtime-config-wal.html
             | 
             | I must note that the default for Postgres is that there is
             | NO delay, which is a sane default.
             | 
             | > You can batch up multiple operations into a single call
             | to fsync.
             | 
             | Ive done this in various messaging implementations for
             | throughput, and it's actually fairly easy to do in most
             | languages;
             | 
             | Basically, set up 1-N writers (depends on how you are
             | storing data really) that takes a set of items containing
             | the data to be written alongside a TaskCompletionSource
             | (Promise in Java terms), when your stuff wants to write it
             | shoots it to that local queue, the worker(s) on the queue
             | will write out messages in batches based on whatever else
             | (i.e. tuned for write size, number of records, etc for both
             | throughput and guaranteeing forward progress,) and then
             | when the write completes you either complete or fail the
             | TCS/Promise.
             | 
             | If you've got the right 'glue' with your language/libraries
             | it's not that hard; this example [0] from Akka.NET's SQL
             | persistence layer shows how simple the actual write
             | processor's logic can be... Yeah you have to think about
             | queueing a little bit however I've found this basic pattern
             | _very_ adaptable (i.e. queueing op can just send a bunch of
             | ready-to-go-bytes and you work off that for threshold
             | instead, add framing if needed, etc.)
             | 
             | [0] https://github.com/akkadotnet/Akka.Persistence.Sql/blob
             | /7bab...
        
               | aphyr wrote:
               | Ah, pardon me, spoke too quickly! I remembered that it
               | fsynced by default, and offered batching, and forgot that
               | the batch size is 0 by default. My bad!
        
               | to11mtm wrote:
               | Well the write is still tunable so you are still correct.
               | 
               | Just wanted to clarify that the default is still at least
               | safe in case people perusing this for things to worry
               | about, well, were thinking about worrying.
               | 
               | Love all of your work and writings, thank you for all you
               | do!
        
           | senderista wrote:
           | In practice, there must be a delay (from batching) if you
           | fsync every transaction before acknowledging commit. The
           | database would be unusably slow otherwise.
        
         | aaronbwebber wrote:
         | It's not just better performance on latency benchmarks, it
         | likely improves throughput as well because the writes will be
         | batched together.
         | 
         | Many applications do not require true durability and it is
         | likely that many applications benefit from lazy fsync. Whether
         | it should be the default is a lot more questionable though.
        
           | johncolanduoni wrote:
           | It's like using a non-cryptographically secure RNG: if you
           | don't know enough to look for the fsync flag off yourself,
           | it's unlikely you know enough to evaluate the impact of
           | durability on your application.
        
             | traceroute66 wrote:
             | > if you don't know enough to look for the fsync flag off
             | yourself,
             | 
             | Yeah, it should use safe-defaults.
             | 
             | Then you can always go read the corners of the docs for the
             | "go faster" mode.
             | 
             | Just like Postgres's infamous "non-durable settings"
             | page... https://www.postgresql.org/docs/18/non-
             | durability.html
        
           | senderista wrote:
           | For transactional durability, the writes will definitely be
           | batched ("group commit"), because otherwise throughput would
           | collapse.
        
         | dilyevsky wrote:
         | Massively improves benchmark performance. Like 5-10x
        
           | speedgoose wrote:
           | /dev/null is even faster.
        
             | formerly_proven wrote:
             | /dev/null tends to lose a lot more data.
        
               | onionisafruit wrote:
               | Just wait until the jepsen report on /dev/null. It's
               | going to be brutal.
        
               | orthoxerox wrote:
               | /dev/null works according to spec, can't accuse it of not
               | doing something it has never promised
        
         | mrkeen wrote:
         | One of the perks of being distributed, I guess.
         | 
         | The kind of failure that a system can tolerate with strict
         | fsync but can't tolerate with lazy fsync (i.e. the software
         | 'confirms' a write to its caller but then crashes) is probably
         | not the kind of failure you'd expect to encounter on a majority
         | of your nodes all at the same time.
        
           | johncolanduoni wrote:
           | It is if they're in the same physical datacenter. Usually the
           | way this is done is to wait for at least M replicas to fsync,
           | but only require the data to be in memory for the rest. It
           | smooths out the tail latencies, which are quite high for
           | SSDs.
        
             | senderista wrote:
             | You can push the safety envelope a bit further and wait for
             | your data to only be in memory in N separate fault domains.
             | Yes, your favorite ultra-reliable cloud service may be
             | doing this.
        
       | clemlesne wrote:
       | NATS is a fantastic piece of software. But doc's unpractical and
       | half backed. That's a shame to be required to retro engineer the
       | software from GitHub to know the auth schemes.
        
         | belter wrote:
         | > NATS is a fantastic piece of software.
         | 
         | - ACKed messages can be silently lost due to minority-node
         | corruption.
         | 
         | - Single-bit corruption can erase up to 78% of stored messages
         | on some replicas.
         | 
         | - Snapshot corruption may trigger full-stream deletion across
         | the cluster.
         | 
         | - Default lazy-fsync wipes minutes of acknowledged writes on
         | crash.
         | 
         | - Crash + delay can produce persistent split-brain and
         | divergent logs
         | 
         | Are you the Mother? Because only a Mother could love such a
         | ugly baby....
        
           | hurturue wrote:
           | do you have a better solution?
           | 
           | as they would say, NATS is a terrible message bus system, but
           | all the others are worse
        
             | adhamsalama wrote:
             | Are RabbitMQ's durable queues worse?
        
             | johncolanduoni wrote:
             | Pulsar can do most of what NATS can, but at a much higher
             | cost in both compute and operations (though I haven't seen
             | a head-to-head of each with durability turned on), along
             | with some simply different characteristics (like NATS being
             | suitable for sidecar deployment). NATS is fantastic for
             | ephemeral messaging, but some of this report is really
             | concerning when JetStream has been shipping for years.
        
           | Thaxll wrote:
           | "PostgreSQL used fsync incorrectly for 20 years"
           | 
           | https://archive.fosdem.org/2019/schedule/event/postgresql_fs.
           | ..
           | 
           | It did not prevent people from using it. You won't find a
           | database that has the perfect durability, ease of use,
           | performance ect.. It's all about tradeoffs.
        
             | dijit wrote:
             | Realistically speaking, postgresql wasn't handling a failed
             | call to fsync, which is wrong: but materially different
             | from a bad design or errors in logic stemming from many
             | areas.
             | 
             | Postgresql was able to fix their bug in 3 lines of code,
             | how many for the parent system?
             | 
             | I understand your core thesis (sometimes durability
             | guarantees aren't as needed as we think) but in
             | postgresql's case, the edge was incredibly thin. It would
             | have had to have been: a failed call to fsync and a system
             | level failure of the host _before another call to fsync_
             | (which are reasonably common).
             | 
             | It's far too apples to oranges to be meaningful to bring up
             | I am afraid.
        
               | Thaxll wrote:
               | NATS allows you to fsync every calls, it's not just the
               | default value.
        
           | cedws wrote:
           | Interested to know if you found these issues yourself or from
           | a source. Is Kafka any more robust?
        
             | rockwotj wrote:
             | Redpanda is https://jepsen.io/analyses/redpanda-21.10.1
        
           | mring33621 wrote:
           | NATS was originally made for simple, fast, ephemeral
           | messaging.
           | 
           | The persistence stuff is kinda new and it's not a surprise
           | that there are limitations and bugs.
           | 
           | You should see this report as a good thing, as it will add
           | pressure for improvements.
        
             | njuw wrote:
             | > The persistence stuff is kinda new and it's not a
             | surprise that there are limitations and bugs.
             | 
             | It's not really that new. The precursor to JetStream was
             | NATS Streaming Server [1], which was first tagged almost 10
             | years ago [2].
             | 
             | [1] https://github.com/nats-io/nats-streaming-server
             | 
             | [2] https://github.com/nats-io/nats-streaming-
             | server/releases/ta...
        
           | KaiserPro wrote:
           | NATS is ephemeral. if you can accept that, then you'll be
           | fine.
        
           | tptacek wrote:
           | This is just a tl;dr of the article with a mean-spirited barb
           | added.
        
       | gostsamo wrote:
       | Thanks, those reports are always a quiet pleasure to read even if
       | one is a bit far from the domain.
        
       | rdtsc wrote:
       | > By default, NATS only flushes data to disk every two minutes,
       | but acknowledges operations immediately. This approach can lead
       | to the loss of committed writes when several nodes experience a
       | power failure, kernel crash, or hardware fault concurrently--or
       | in rapid succession (#7564).
       | 
       | I am getting strong early MongoDB vibes. "Look how fast it is,
       | it's web-scale!". Well, if you don't fsync, you'll go fast, but
       | you'll go even faster piping customer data to /dev/null, too.
       | 
       | Coordinated failures shouldn't be a novelty or a surprise any
       | longer these days.
       | 
       | I wouldn't trust a product that doesn't default to safest
       | options. It's fine to provide relaxed modes of consistency and
       | durability but just don't make them default. Let the user
       | configure those themselves.
        
         | CuriouslyC wrote:
         | NATS data is ephemeral in many cases anyhow, so it makes a bit
         | more sense here. If you wanted something fully durable with a
         | stronger persistence story you'd probably use Kafka anyhow.
        
           | nchmy wrote:
           | Core nats is ephemeral. Jetstream is meant to be persisted,
           | and presented as a replacement for kafka
        
           | petre wrote:
           | So is MQTT, why bother with NATS then?
        
             | KaiserPro wrote:
             | MQTT doesn't have the same semantics.
             | https://docs.nats.io/nats-concepts/core-nats/reqreply
             | request reply is really useful if you need low latency, but
             | reasonably efficient queuing. (making sure to mark your
             | workers as busy when processing otherwise you get latency
             | spikes. )
        
               | RedShift1 wrote:
               | You can do request/reply with MQTT too, you just have to
               | implement more bits yourself, whilst NATS has a nice API
               | that abstracts that away for you.
        
               | KaiserPro wrote:
               | oh indeed, and clusters nicely.
        
           | traceroute66 wrote:
           | > NATS data is ephemeral in many cases anyhow, so it makes a
           | bit more sense here
           | 
           | Dude ... the guy was testing JetStream.
           | 
           | Which, I quote from the first phrase from the first paragraph
           | on the NATS website:                   NATS has a built-in
           | persistence engine called JetStream which enables messages to
           | be stored and replayed at a later time.
        
         | 0xbadcafebee wrote:
         | Not flushing on every write is a very common tradeoff of speed
         | over durability. Filesystems, databases, all kinds of systems
         | do this. They have some hacks to prevent it from corrupting the
         | entire dataset, but lost writes are accepted. You can often
         | prevent this by enabling an option or tuning a parameter.
         | 
         | > I wouldn't trust a product that doesn't default to safest
         | options
         | 
         | This would make most products suck, and require a crap-ton of
         | manual fixes and tuning that most people would hate, if they
         | even got the tuning right. You have to actually do some work
         | yourself to make a system behave the way you require.
         | 
         | For example, Postgres' isolation level is weak by default,
         | leading to race conditions. You have to explicitly enable
         | serialization to avoid it, which is a performance penalty.
         | (https://martin.kleppmann.com/2014/11/25/hermitage-testing-
         | th...)
        
           | zbentley wrote:
           | I think "most people will have to turn on the setting to make
           | things fast at the expense of durability" is a dubious
           | assertion (plenty of system, even high-criticality ones, do
           | not have a very high data rate and thus would not necessarily
           | suffer unduly from e.g. fsync-every-write).
           | 
           | Even if most users do turn out to want "fast_and_dangerous =
           | true", that's not a particularly onerous burden to place on
           | users: flip one setting, and hopefully learn from the setting
           | name or the documentation consulted when learning about it
           | that it poses operational risk.
        
           | to11mtm wrote:
           | In the defense of PG, for better or worse as far as I know,
           | the 'what is RDBMS default' falls into two categories;
           | 
           | - Read Committed default with MVCC (Oracle, Postgres,
           | Firebird versions with MVCC, I -think- SQLite with WAL falls
           | under this)
           | 
           | - Read committed with write locks one way or another (MSSQL
           | default, SQLite default, Firebird pre MVCC, probably Sybase
           | given MSSQL's lineage...)
           | 
           | I'm not aware of any RDBMS that treats 'serializable' as the
           | default transaction level OOTB (I'd love to learn though!)
           | 
           | ....
           | 
           | All of that said, 'Inconsistent read because you don't know
           | RDBMS and did not pay attention to the transaction model' has
           | a very different blame direction than 'We YOLO fsync on a
           | timer to improve throughput'.
           | 
           | If anything it scares me that there's no other tuning options
           | involved such as number of bytes or number of events.
           | 
           | If I get a write-ack from a middleware I expect it to be
           | written one way or another. Not 'It is written within X
           | seconds'.
           | 
           | AFAIK there's no RDBMS that will just 'lose a write' unless
           | the disk happens to be corrupted (or, IDK, maybe someone
           | YOLOing with chaos mode on DB2?)
        
             | hansihe wrote:
             | CockroachDB does Serializable by default
        
           | TheTaytay wrote:
           | > Filesystems, databases, all kinds of systems do this. They
           | have some hacks to prevent it from corrupting the entire
           | dataset, but lost writes are accepted.
           | 
           | Woah, those are _really_ strong claims. "Lost writes are
           | accepted"? Assuming we are talking about "acknowledged
           | writes", which the article is discussing, I don't think it's
           | true that this is a common default for databases and
           | filesystems. Perhaps databases or K/V stores that are
           | marketed as in-memory caches might have defaults like this,
           | but I'm not familiar with other systems that do.
           | 
           | I'm also getting MongoDB vibes from deciding not to flush
           | except once every two minutes. Even deciding to wait a second
           | would be pretty long, but two minutes? A lot happens in a
           | busy system in 120 seconds...
        
         | KaiserPro wrote:
         | NATS is very upfront in that the only thing that is guaranteed
         | is the cluster being up.
         | 
         | I like that, and it allows me to build things around it.
         | 
         | For us when we used it back in 2018, it performed well and was
         | easy to administer. The multi-language APIs were also good.
        
           | traceroute66 wrote:
           | > NATS is very upfront in that the only thing that is
           | guaranteed is the cluster being up.
           | 
           | Not so fast.
           | 
           | Their docs makes some pretty bold claims about JetStream....
           | 
           | They talk about JetStream addressing the _" fragility"_ of
           | other streaming technology.
           | 
           | And _" This functionality enables a different quality of
           | service for your NATS messages, and enables fault-tolerant
           | and high-availability configurations."_
           | 
           | And one of their big selling-points for JetStream is the
           | whole "stora and replay" thing. Which implies the storage bit
           | should be trustworthy, no ?
        
             | KaiserPro wrote:
             | oh sorry I was talking about NATS core. not jetstream. I'd
             | be pretty sceptical about persistence
        
               | billywhizz wrote:
               | the OP was specifically about jetstream so i guess you
               | just didn't read it?
        
               | KaiserPro wrote:
               | just imagine I'm claude,
               | 
               |  _smoke bomb_
        
         | gopalv wrote:
         | > Well, if you don't fsync, you'll go fast, but you'll go even
         | faster piping customer data to /dev/null, too.
         | 
         | The trouble is that you need to specifically optimize for
         | fsyncs, because usually it is either no brakes or hand-brake.
         | 
         | The middle-ground of multi-transaction group-commit fsync seems
         | to not exist anymore because of SSDs and massive IOPS you can
         | pull off in general, but now it is about syscall context
         | switches.
         | 
         | Two minutes is a bit too too much (also fdatasync vs fsync).
        
         | Thaxll wrote:
         | I don't think there is a modern database that have the safest
         | options all turned on by default. For instance the default
         | transaction model for PG is read commited not serializable
         | 
         | One of the most used DB in the world is Redis, and by default
         | they fsync every seconds not every operations.
        
           | hobs wrote:
           | Pretty sure SQL Server won't acknowledge a write until its in
           | the WAL (you can go the opposite way and turn on delayed
           | durability though.)
        
         | lubesGordi wrote:
         | I don't know about Jetstream, but redis cluster would only ack
         | writes after replicating to a majority of nodes. I think there
         | is some config on standalone redis too where you can ack after
         | fsync (which apparently still doesn't guarantee anything
         | because of buffering in the OS). In any case, understanding
         | what the ack implies is important, and I'd be frustrated if
         | jetstream docs were not clear on that.
        
       | maxmcd wrote:
       | > > You can force an fsync after each messsage [sic] with always,
       | this will slow down the throughput to a few hundred msg/s.
       | 
       | Is the performance warning in the NATS possible to improve on?
       | Couldn't you still run fsync on an interval and queue up a
       | certain number of writes to be flushed at once? I could imagine
       | latency suffering, but batches throughput could be preserved to
       | some extent?
        
         | scottlamb wrote:
         | > Is the performance warning in the NATS possible to improve
         | on? Couldn't you still run fsync on an interval and queue up a
         | certain number of writes to be flushed at once? I could imagine
         | latency suffering, but batches throughput could be preserved to
         | some extent?
         | 
         | Yes, and you shouldn't even need a fixed interval. Just queue
         | up any writes while an `fsync` is pending; then do all those in
         | the next batch. This is the same approach you'd use for rounds
         | of Paxos, particularly between availability zones or regions
         | where latency is expected to be high. You wouldn't say "oh,
         | I'll ack and then put it in the next round of Paxos", or "I'll
         | wait until the next round in 2 seconds then ack"; you'd start
         | the next batch as soon as the current one is done.
        
       | stmw wrote:
       | Every time someone builds one of these things and skips over
       | "overcomplicated theory", aphyr destroys them. At this point, I
       | wonder if we could train an AI to look over a project's
       | documentation, and predict whether it's likely to lose commmitted
       | writes just based on the marketing / technical claims. We
       | probably can.
        
         | awesome_dude wrote:
         | /me strokes my long grey beard and nods
         | 
         | People always think "theory is overrated" or "hacking is better
         | than having a school education"
         | 
         | And then proceed to shoot themselves in the foot with
         | "workarounds" that break well known, well documented, well
         | traversed problem spaces
        
           | whimsicalism wrote:
           | certainly a narrative that is popular among the grey beard
           | crowd, yes. in pretty much every field i've worked on, the
           | opposite problem has been much much more common.
        
         | dboreham wrote:
         | I've asked LLMs to do similar tasks and the results were very
         | useful.
        
       | dzonga wrote:
       | nats jetstream vs say redis streams - which one have people found
       | easier to work with ?
        
         | ViewTrick1002 wrote:
         | When I worked with bounded Redis streams a couple of years ago
         | we had to implement hand rolled backpressure which was quite
         | tricky to get right.
         | 
         | To implement backpressure without relying on out of band
         | signals (distributed systems beware) you need to have a deep
         | understanding of the entire redis streams architecture and how
         | the the pending entries list, consumers groups, consumers etc.
         | works and interacts to not lose data by overwriting yourself.
         | 
         | Unbounded would have been fine if we could spill to disk and
         | periodically clean up the data, but this is redis.
         | 
         | Not sure if that has improved.
        
       | johncolanduoni wrote:
       | Wow. I've used NATS for best-effort in-memory pub/sub, which it
       | has been great for, including getting subtle scaling details
       | right. I never touched their persistence and would have
       | investigated more before I did, but I wouldn't have expected it
       | to be this bad. Vulnerability to simple single-bit file
       | corruption is embarrassing.
        
       | rishabhaiover wrote:
       | NATS be trippin, no CAP.
        
         | veverkap wrote:
         | Underrated
        
       | williamstein wrote:
       | https://github.com/williamstein/nats-bugs
        
       ___________________________________________________________________
       (page generated 2025-12-08 23:00 UTC)