[HN Gopher] Jepsen: NATS 2.12.1
___________________________________________________________________
Jepsen: NATS 2.12.1
Author : aphyr
Score : 207 points
Date : 2025-12-08 18:51 UTC (4 hours ago)
(HTM) web link (jepsen.io)
(TXT) w3m dump (jepsen.io)
| vrnvu wrote:
| Sort of related. Jepsen and Antithesis recently released a
| glossary of common terms which is a fantastic reference.
|
| https://jepsen.io/blog/2025-10-20-distsys-glossary
| merb wrote:
| > 3.4 Lazy fsync by Default
|
| Why? Why do some databases do that? To have better performance in
| benchmarks? It's not like that it's ok to do that if you have a
| better default or at least write a lot about it. But especially
| when you run stuff in a small cluster you get bitten by stuff
| like that.
| thinkharderdev wrote:
| > To have better performance in benchmarks
|
| Yes, exactly.
| millipede wrote:
| I always wondered why the fsync has to be lazy. It seems like
| the fsync's can be bundled up together, and the notification
| messages held for a few millis while the write completes.
| Similar to TCP corking. There doesn't need to be one fsync per
| consensus.
| aphyr wrote:
| Yes, good call! You can batch up multiple operations into a
| single call to fsync. You can also tune the number of
| milliseconds or bytes you're willing to buffer before calling
| `fsync` to balance latency and throughput. This is how
| databases like Postgres work by default--see the
| `commit_delay` option here:
| https://www.postgresql.org/docs/8.1/runtime-config-wal.html
| to11mtm wrote:
| > This is how databases like Postgres work by default--see
| the `commit_delay` option here:
| https://www.postgresql.org/docs/8.1/runtime-config-wal.html
|
| I must note that the default for Postgres is that there is
| NO delay, which is a sane default.
|
| > You can batch up multiple operations into a single call
| to fsync.
|
| Ive done this in various messaging implementations for
| throughput, and it's actually fairly easy to do in most
| languages;
|
| Basically, set up 1-N writers (depends on how you are
| storing data really) that takes a set of items containing
| the data to be written alongside a TaskCompletionSource
| (Promise in Java terms), when your stuff wants to write it
| shoots it to that local queue, the worker(s) on the queue
| will write out messages in batches based on whatever else
| (i.e. tuned for write size, number of records, etc for both
| throughput and guaranteeing forward progress,) and then
| when the write completes you either complete or fail the
| TCS/Promise.
|
| If you've got the right 'glue' with your language/libraries
| it's not that hard; this example [0] from Akka.NET's SQL
| persistence layer shows how simple the actual write
| processor's logic can be... Yeah you have to think about
| queueing a little bit however I've found this basic pattern
| _very_ adaptable (i.e. queueing op can just send a bunch of
| ready-to-go-bytes and you work off that for threshold
| instead, add framing if needed, etc.)
|
| [0] https://github.com/akkadotnet/Akka.Persistence.Sql/blob
| /7bab...
| aphyr wrote:
| Ah, pardon me, spoke too quickly! I remembered that it
| fsynced by default, and offered batching, and forgot that
| the batch size is 0 by default. My bad!
| to11mtm wrote:
| Well the write is still tunable so you are still correct.
|
| Just wanted to clarify that the default is still at least
| safe in case people perusing this for things to worry
| about, well, were thinking about worrying.
|
| Love all of your work and writings, thank you for all you
| do!
| senderista wrote:
| In practice, there must be a delay (from batching) if you
| fsync every transaction before acknowledging commit. The
| database would be unusably slow otherwise.
| aaronbwebber wrote:
| It's not just better performance on latency benchmarks, it
| likely improves throughput as well because the writes will be
| batched together.
|
| Many applications do not require true durability and it is
| likely that many applications benefit from lazy fsync. Whether
| it should be the default is a lot more questionable though.
| johncolanduoni wrote:
| It's like using a non-cryptographically secure RNG: if you
| don't know enough to look for the fsync flag off yourself,
| it's unlikely you know enough to evaluate the impact of
| durability on your application.
| traceroute66 wrote:
| > if you don't know enough to look for the fsync flag off
| yourself,
|
| Yeah, it should use safe-defaults.
|
| Then you can always go read the corners of the docs for the
| "go faster" mode.
|
| Just like Postgres's infamous "non-durable settings"
| page... https://www.postgresql.org/docs/18/non-
| durability.html
| senderista wrote:
| For transactional durability, the writes will definitely be
| batched ("group commit"), because otherwise throughput would
| collapse.
| dilyevsky wrote:
| Massively improves benchmark performance. Like 5-10x
| speedgoose wrote:
| /dev/null is even faster.
| formerly_proven wrote:
| /dev/null tends to lose a lot more data.
| onionisafruit wrote:
| Just wait until the jepsen report on /dev/null. It's
| going to be brutal.
| orthoxerox wrote:
| /dev/null works according to spec, can't accuse it of not
| doing something it has never promised
| mrkeen wrote:
| One of the perks of being distributed, I guess.
|
| The kind of failure that a system can tolerate with strict
| fsync but can't tolerate with lazy fsync (i.e. the software
| 'confirms' a write to its caller but then crashes) is probably
| not the kind of failure you'd expect to encounter on a majority
| of your nodes all at the same time.
| johncolanduoni wrote:
| It is if they're in the same physical datacenter. Usually the
| way this is done is to wait for at least M replicas to fsync,
| but only require the data to be in memory for the rest. It
| smooths out the tail latencies, which are quite high for
| SSDs.
| senderista wrote:
| You can push the safety envelope a bit further and wait for
| your data to only be in memory in N separate fault domains.
| Yes, your favorite ultra-reliable cloud service may be
| doing this.
| clemlesne wrote:
| NATS is a fantastic piece of software. But doc's unpractical and
| half backed. That's a shame to be required to retro engineer the
| software from GitHub to know the auth schemes.
| belter wrote:
| > NATS is a fantastic piece of software.
|
| - ACKed messages can be silently lost due to minority-node
| corruption.
|
| - Single-bit corruption can erase up to 78% of stored messages
| on some replicas.
|
| - Snapshot corruption may trigger full-stream deletion across
| the cluster.
|
| - Default lazy-fsync wipes minutes of acknowledged writes on
| crash.
|
| - Crash + delay can produce persistent split-brain and
| divergent logs
|
| Are you the Mother? Because only a Mother could love such a
| ugly baby....
| hurturue wrote:
| do you have a better solution?
|
| as they would say, NATS is a terrible message bus system, but
| all the others are worse
| adhamsalama wrote:
| Are RabbitMQ's durable queues worse?
| johncolanduoni wrote:
| Pulsar can do most of what NATS can, but at a much higher
| cost in both compute and operations (though I haven't seen
| a head-to-head of each with durability turned on), along
| with some simply different characteristics (like NATS being
| suitable for sidecar deployment). NATS is fantastic for
| ephemeral messaging, but some of this report is really
| concerning when JetStream has been shipping for years.
| Thaxll wrote:
| "PostgreSQL used fsync incorrectly for 20 years"
|
| https://archive.fosdem.org/2019/schedule/event/postgresql_fs.
| ..
|
| It did not prevent people from using it. You won't find a
| database that has the perfect durability, ease of use,
| performance ect.. It's all about tradeoffs.
| dijit wrote:
| Realistically speaking, postgresql wasn't handling a failed
| call to fsync, which is wrong: but materially different
| from a bad design or errors in logic stemming from many
| areas.
|
| Postgresql was able to fix their bug in 3 lines of code,
| how many for the parent system?
|
| I understand your core thesis (sometimes durability
| guarantees aren't as needed as we think) but in
| postgresql's case, the edge was incredibly thin. It would
| have had to have been: a failed call to fsync and a system
| level failure of the host _before another call to fsync_
| (which are reasonably common).
|
| It's far too apples to oranges to be meaningful to bring up
| I am afraid.
| Thaxll wrote:
| NATS allows you to fsync every calls, it's not just the
| default value.
| cedws wrote:
| Interested to know if you found these issues yourself or from
| a source. Is Kafka any more robust?
| rockwotj wrote:
| Redpanda is https://jepsen.io/analyses/redpanda-21.10.1
| mring33621 wrote:
| NATS was originally made for simple, fast, ephemeral
| messaging.
|
| The persistence stuff is kinda new and it's not a surprise
| that there are limitations and bugs.
|
| You should see this report as a good thing, as it will add
| pressure for improvements.
| njuw wrote:
| > The persistence stuff is kinda new and it's not a
| surprise that there are limitations and bugs.
|
| It's not really that new. The precursor to JetStream was
| NATS Streaming Server [1], which was first tagged almost 10
| years ago [2].
|
| [1] https://github.com/nats-io/nats-streaming-server
|
| [2] https://github.com/nats-io/nats-streaming-
| server/releases/ta...
| KaiserPro wrote:
| NATS is ephemeral. if you can accept that, then you'll be
| fine.
| tptacek wrote:
| This is just a tl;dr of the article with a mean-spirited barb
| added.
| gostsamo wrote:
| Thanks, those reports are always a quiet pleasure to read even if
| one is a bit far from the domain.
| rdtsc wrote:
| > By default, NATS only flushes data to disk every two minutes,
| but acknowledges operations immediately. This approach can lead
| to the loss of committed writes when several nodes experience a
| power failure, kernel crash, or hardware fault concurrently--or
| in rapid succession (#7564).
|
| I am getting strong early MongoDB vibes. "Look how fast it is,
| it's web-scale!". Well, if you don't fsync, you'll go fast, but
| you'll go even faster piping customer data to /dev/null, too.
|
| Coordinated failures shouldn't be a novelty or a surprise any
| longer these days.
|
| I wouldn't trust a product that doesn't default to safest
| options. It's fine to provide relaxed modes of consistency and
| durability but just don't make them default. Let the user
| configure those themselves.
| CuriouslyC wrote:
| NATS data is ephemeral in many cases anyhow, so it makes a bit
| more sense here. If you wanted something fully durable with a
| stronger persistence story you'd probably use Kafka anyhow.
| nchmy wrote:
| Core nats is ephemeral. Jetstream is meant to be persisted,
| and presented as a replacement for kafka
| petre wrote:
| So is MQTT, why bother with NATS then?
| KaiserPro wrote:
| MQTT doesn't have the same semantics.
| https://docs.nats.io/nats-concepts/core-nats/reqreply
| request reply is really useful if you need low latency, but
| reasonably efficient queuing. (making sure to mark your
| workers as busy when processing otherwise you get latency
| spikes. )
| RedShift1 wrote:
| You can do request/reply with MQTT too, you just have to
| implement more bits yourself, whilst NATS has a nice API
| that abstracts that away for you.
| KaiserPro wrote:
| oh indeed, and clusters nicely.
| traceroute66 wrote:
| > NATS data is ephemeral in many cases anyhow, so it makes a
| bit more sense here
|
| Dude ... the guy was testing JetStream.
|
| Which, I quote from the first phrase from the first paragraph
| on the NATS website: NATS has a built-in
| persistence engine called JetStream which enables messages to
| be stored and replayed at a later time.
| 0xbadcafebee wrote:
| Not flushing on every write is a very common tradeoff of speed
| over durability. Filesystems, databases, all kinds of systems
| do this. They have some hacks to prevent it from corrupting the
| entire dataset, but lost writes are accepted. You can often
| prevent this by enabling an option or tuning a parameter.
|
| > I wouldn't trust a product that doesn't default to safest
| options
|
| This would make most products suck, and require a crap-ton of
| manual fixes and tuning that most people would hate, if they
| even got the tuning right. You have to actually do some work
| yourself to make a system behave the way you require.
|
| For example, Postgres' isolation level is weak by default,
| leading to race conditions. You have to explicitly enable
| serialization to avoid it, which is a performance penalty.
| (https://martin.kleppmann.com/2014/11/25/hermitage-testing-
| th...)
| zbentley wrote:
| I think "most people will have to turn on the setting to make
| things fast at the expense of durability" is a dubious
| assertion (plenty of system, even high-criticality ones, do
| not have a very high data rate and thus would not necessarily
| suffer unduly from e.g. fsync-every-write).
|
| Even if most users do turn out to want "fast_and_dangerous =
| true", that's not a particularly onerous burden to place on
| users: flip one setting, and hopefully learn from the setting
| name or the documentation consulted when learning about it
| that it poses operational risk.
| to11mtm wrote:
| In the defense of PG, for better or worse as far as I know,
| the 'what is RDBMS default' falls into two categories;
|
| - Read Committed default with MVCC (Oracle, Postgres,
| Firebird versions with MVCC, I -think- SQLite with WAL falls
| under this)
|
| - Read committed with write locks one way or another (MSSQL
| default, SQLite default, Firebird pre MVCC, probably Sybase
| given MSSQL's lineage...)
|
| I'm not aware of any RDBMS that treats 'serializable' as the
| default transaction level OOTB (I'd love to learn though!)
|
| ....
|
| All of that said, 'Inconsistent read because you don't know
| RDBMS and did not pay attention to the transaction model' has
| a very different blame direction than 'We YOLO fsync on a
| timer to improve throughput'.
|
| If anything it scares me that there's no other tuning options
| involved such as number of bytes or number of events.
|
| If I get a write-ack from a middleware I expect it to be
| written one way or another. Not 'It is written within X
| seconds'.
|
| AFAIK there's no RDBMS that will just 'lose a write' unless
| the disk happens to be corrupted (or, IDK, maybe someone
| YOLOing with chaos mode on DB2?)
| hansihe wrote:
| CockroachDB does Serializable by default
| TheTaytay wrote:
| > Filesystems, databases, all kinds of systems do this. They
| have some hacks to prevent it from corrupting the entire
| dataset, but lost writes are accepted.
|
| Woah, those are _really_ strong claims. "Lost writes are
| accepted"? Assuming we are talking about "acknowledged
| writes", which the article is discussing, I don't think it's
| true that this is a common default for databases and
| filesystems. Perhaps databases or K/V stores that are
| marketed as in-memory caches might have defaults like this,
| but I'm not familiar with other systems that do.
|
| I'm also getting MongoDB vibes from deciding not to flush
| except once every two minutes. Even deciding to wait a second
| would be pretty long, but two minutes? A lot happens in a
| busy system in 120 seconds...
| KaiserPro wrote:
| NATS is very upfront in that the only thing that is guaranteed
| is the cluster being up.
|
| I like that, and it allows me to build things around it.
|
| For us when we used it back in 2018, it performed well and was
| easy to administer. The multi-language APIs were also good.
| traceroute66 wrote:
| > NATS is very upfront in that the only thing that is
| guaranteed is the cluster being up.
|
| Not so fast.
|
| Their docs makes some pretty bold claims about JetStream....
|
| They talk about JetStream addressing the _" fragility"_ of
| other streaming technology.
|
| And _" This functionality enables a different quality of
| service for your NATS messages, and enables fault-tolerant
| and high-availability configurations."_
|
| And one of their big selling-points for JetStream is the
| whole "stora and replay" thing. Which implies the storage bit
| should be trustworthy, no ?
| KaiserPro wrote:
| oh sorry I was talking about NATS core. not jetstream. I'd
| be pretty sceptical about persistence
| billywhizz wrote:
| the OP was specifically about jetstream so i guess you
| just didn't read it?
| KaiserPro wrote:
| just imagine I'm claude,
|
| _smoke bomb_
| gopalv wrote:
| > Well, if you don't fsync, you'll go fast, but you'll go even
| faster piping customer data to /dev/null, too.
|
| The trouble is that you need to specifically optimize for
| fsyncs, because usually it is either no brakes or hand-brake.
|
| The middle-ground of multi-transaction group-commit fsync seems
| to not exist anymore because of SSDs and massive IOPS you can
| pull off in general, but now it is about syscall context
| switches.
|
| Two minutes is a bit too too much (also fdatasync vs fsync).
| Thaxll wrote:
| I don't think there is a modern database that have the safest
| options all turned on by default. For instance the default
| transaction model for PG is read commited not serializable
|
| One of the most used DB in the world is Redis, and by default
| they fsync every seconds not every operations.
| hobs wrote:
| Pretty sure SQL Server won't acknowledge a write until its in
| the WAL (you can go the opposite way and turn on delayed
| durability though.)
| lubesGordi wrote:
| I don't know about Jetstream, but redis cluster would only ack
| writes after replicating to a majority of nodes. I think there
| is some config on standalone redis too where you can ack after
| fsync (which apparently still doesn't guarantee anything
| because of buffering in the OS). In any case, understanding
| what the ack implies is important, and I'd be frustrated if
| jetstream docs were not clear on that.
| maxmcd wrote:
| > > You can force an fsync after each messsage [sic] with always,
| this will slow down the throughput to a few hundred msg/s.
|
| Is the performance warning in the NATS possible to improve on?
| Couldn't you still run fsync on an interval and queue up a
| certain number of writes to be flushed at once? I could imagine
| latency suffering, but batches throughput could be preserved to
| some extent?
| scottlamb wrote:
| > Is the performance warning in the NATS possible to improve
| on? Couldn't you still run fsync on an interval and queue up a
| certain number of writes to be flushed at once? I could imagine
| latency suffering, but batches throughput could be preserved to
| some extent?
|
| Yes, and you shouldn't even need a fixed interval. Just queue
| up any writes while an `fsync` is pending; then do all those in
| the next batch. This is the same approach you'd use for rounds
| of Paxos, particularly between availability zones or regions
| where latency is expected to be high. You wouldn't say "oh,
| I'll ack and then put it in the next round of Paxos", or "I'll
| wait until the next round in 2 seconds then ack"; you'd start
| the next batch as soon as the current one is done.
| stmw wrote:
| Every time someone builds one of these things and skips over
| "overcomplicated theory", aphyr destroys them. At this point, I
| wonder if we could train an AI to look over a project's
| documentation, and predict whether it's likely to lose commmitted
| writes just based on the marketing / technical claims. We
| probably can.
| awesome_dude wrote:
| /me strokes my long grey beard and nods
|
| People always think "theory is overrated" or "hacking is better
| than having a school education"
|
| And then proceed to shoot themselves in the foot with
| "workarounds" that break well known, well documented, well
| traversed problem spaces
| whimsicalism wrote:
| certainly a narrative that is popular among the grey beard
| crowd, yes. in pretty much every field i've worked on, the
| opposite problem has been much much more common.
| dboreham wrote:
| I've asked LLMs to do similar tasks and the results were very
| useful.
| dzonga wrote:
| nats jetstream vs say redis streams - which one have people found
| easier to work with ?
| ViewTrick1002 wrote:
| When I worked with bounded Redis streams a couple of years ago
| we had to implement hand rolled backpressure which was quite
| tricky to get right.
|
| To implement backpressure without relying on out of band
| signals (distributed systems beware) you need to have a deep
| understanding of the entire redis streams architecture and how
| the the pending entries list, consumers groups, consumers etc.
| works and interacts to not lose data by overwriting yourself.
|
| Unbounded would have been fine if we could spill to disk and
| periodically clean up the data, but this is redis.
|
| Not sure if that has improved.
| johncolanduoni wrote:
| Wow. I've used NATS for best-effort in-memory pub/sub, which it
| has been great for, including getting subtle scaling details
| right. I never touched their persistence and would have
| investigated more before I did, but I wouldn't have expected it
| to be this bad. Vulnerability to simple single-bit file
| corruption is embarrassing.
| rishabhaiover wrote:
| NATS be trippin, no CAP.
| veverkap wrote:
| Underrated
| williamstein wrote:
| https://github.com/williamstein/nats-bugs
___________________________________________________________________
(page generated 2025-12-08 23:00 UTC)