[HN Gopher] Waiting for Postgres 18: Accelerating Disk Reads wit...
       ___________________________________________________________________
        
       Waiting for Postgres 18: Accelerating Disk Reads with Asynchronous
       I/O
        
       Author : lfittl
       Score  : 342 points
       Date   : 2025-05-07 14:57 UTC (8 hours ago)
        
 (HTM) web link (pganalyze.com)
 (TXT) w3m dump (pganalyze.com)
        
       | nu11ptr wrote:
       | Is this new async. I/O feature for Linux only?
       | 
       | I know Windows has IOCP and also now an IORing implementation of
       | its own (Less familiar with macOS capabilities other than POSIX
       | AIO).
       | 
       | https://learn.microsoft.com/en-us/windows/win32/api/ioringap...
       | 
       | Update: Most of the comments below seem to be missing the fact
       | that Windows now also has an IORing implementation, as I
       | mentioned above. Comparison article here:
       | 
       | https://windows-internals.com/ioring-vs-io_uring-a-compariso...
        
         | stingraycharles wrote:
         | Sounds like this feature is based on io_uring which is a Linux
         | feature. I would be surprised if they implemented async io on
         | Windows before they would on Linux given the user/deployment
         | base being very Linux-heavy.
        
           | PaulHoule wrote:
           | For a long time ago there have been APIs to do asynchronous
           | file I/O on the books for Linux but they weren't worth using
           | because they didn't really speed anything up.
        
             | immibis wrote:
             | IIRC they literally just did the sync I/O on a worker
             | thread.
        
             | anarazel wrote:
             | They sped up things for a long time - but only when using
             | unbuffered IO. The new thing with io_uring is that it also
             | accelerates buffered IO. In the initial version it was all
             | through kernel worker threads, but these days several
             | filesystems have better paths for common cases.
        
           | p_ing wrote:
           | Yeah, surprise Linux had to play catch up to a Windows 1994
           | release! Same with the scheduler, I'd argue Windows does OOM
           | better than Linux today...
           | 
           | Windows even had the concept of io_uring before, but network
           | only with Registered I/O back in the Windows 8 (8.1?) days.
           | 
           | Linux still lacks the "all I/O is async" NT has.
           | 
           | The underlying kernel and executive of Windows aren't
           | primitive pieces of trash. They're quite advanced, ruined by
           | PMs and the Ads division throwing crap on top.
           | 
           | And yes, Windows' I/O Ring is a near 1:1 copy of the Linux
           | implementation, but IOCP/OVERLAPPED I/O data structure
           | preceded it since NT's first release.
           | 
           | This isn't a pissing match, just we all hope that kernel devs
           | learn from each other and take the best ideas. Sometimes we,
           | IT, don't get to choose the OS we run -- it's dictated by the
           | apps the business requires.
        
             | greenavocado wrote:
             | How difficult would it be to completely tear out the
             | Windows desktop experience and just use the system and
             | display drivers without the rest? Has anybody attempted
             | such a feat?
        
               | dboreham wrote:
               | There are several such things. The Windows installer uses
               | one. X-Box uses another.
        
               | p_ing wrote:
               | There is Windows Server Core which removes everything but
               | a CLI, but you still have the normal login experience,
               | you still have a "desktop" (no start menu, taskbar, etc),
               | you can still launch normal Win32 apps... for the most
               | part (task manager, notepad, and so on).
               | 
               | Win32 is also responsible for core Services, which means
               | you can't de-Windows-ify Windows and strip it down to an
               | NT API-only. All other personalities (OS/2, POSIX, SFU)
               | have a dependency on Win32, as well.
               | 
               | You're still running the WindowServer of course; it's
               | part of the Executive.
               | 
               | That said, with a bunch of modifications, NTDEV did get
               | Windows 11 down to it's bare minimum, and text only to
               | boot. So I guess it's technically possible, though not
               | useful.
               | 
               | https://www.youtube.com/watch?v=SL6t_iuitxM
        
               | smileybarry wrote:
               | > There is Windows Server Core which removes everything
               | but a CLI, but you still have the normal login
               | experience, you still have a "desktop" (no start menu,
               | taskbar, etc), you can still launch normal Win32 apps...
               | for the most part (task manager, notepad, and so on).
               | 
               | Yep, they've replaced nearly every UI with text (the
               | login window is a TUI), though there's still some shell
               | DLLs and the whole thing still uses a window manager.
               | That's honestly for the best, since it allows you to
               | migrate full installations with some UI-based apps to a
               | Core installation with them intact.
               | 
               | > That said, with a bunch of modifications, NTDEV did get
               | Windows 11 down to it's bare minimum, and text only to
               | boot. So I guess it's technically possible, though not
               | useful.
               | 
               | Windows has had a "text mode" since at least Windows XP
               | IIRC, but it's really not that useful, if at all. Even
               | for rescue operations you're better off with Windows PE.
        
               | Ericson2314 wrote:
               | There is always https://en.wikipedia.org/wiki/ReactOS !
               | :)
        
               | password4321 wrote:
               | Free-as-in-beer Hyper-V Server 2019 is in extended
               | support (only security updates) until 2029.
        
             | cyberax wrote:
             | > Same with the scheduler
             | 
             | Windows does OOM far better than Linux because it doesn't
             | really overcommit RAM.
             | 
             | But the CPU _scheduler_ in Linux is far, far, far better
             | than in Windows. Linux can even do hard-realtime, after
             | all.
        
           | nu11ptr wrote:
           | > this feature is based on io_uring which is a Linux feature
           | 
           | And now also a Windows feature, see my comment above for info
        
         | lfittl wrote:
         | It depends on the I/O method - as described in the article,
         | "io_uring" is only available on Linux (and requires building
         | with liburing, as well as io_uring to be enabled in the
         | Kernel), but the default (as of beta1) is actually "worker",
         | which works on any operating system.
         | 
         | The "worker" method uses a dedicated pool of I/O worker
         | processes that run in the background, and whilst not as
         | performant as io_uring in our benchmark, did clearly outperform
         | the "sync" method (which is the same as what Postgres currently
         | has in 17 and older).
        
           | nu11ptr wrote:
           | > "io_uring" is only available on Linux
           | 
           | Windows now also has IORing (see my comment above)
        
         | spwa4 wrote:
         | Yes, although Windows has had async I/O since Windows NT 3.1,
         | their API is still not supported by Postgres.
        
           | nu11ptr wrote:
           | Yes, that was IOCP, however, Windows now also has IORing (see
           | my comment above)
        
           | anarazel wrote:
           | FWIW, there are prototype patches for an IOCP based
           | io_method. We just couldn't get them into an acceptable state
           | for PG 18. I barely survived getting in what we did...
        
       | niux wrote:
       | I recently deployed Postgres on a dedicated Hetzner EX-44 server
       | (20 cores, 64GB RAM, 2x 512GB NVMe SSDs in RAID 1) for
       | EUR39/month. The price-to-performance ratio is exceptional,
       | providing enterprise-level capacity at a fraction of typical
       | cloud costs.
       | 
       | For security, I implemented TailScale which adds only ~5ms of
       | latency while completely eliminating public network exposure - a
       | worthwhile tradeoff for the significant security benefits.
       | 
       | My optimization approach includes:
       | 
       | - Workload-specific configuration generated via PGTune
       | (https://pgtune.leopard.in.ua/)
       | 
       | - Real-time performance monitoring with PgHero for identifying
       | bottlenecks
       | 
       | - Automated VACUUM ANALYZE operations scheduled via pgcron
       | targeting write-heavy tables, which prevents performance
       | degradation and helps me sleep soundly
       | 
       | - A custom CLI utility I built for ZSTD-compressed backups that
       | achieves impressive compression ratios while maintaining high
       | throughput, with automatic S3 uploading:
       | https://github.com/overflowy/pgbackup
       | 
       | This setup has been remarkably stable and performant, handling
       | our workloads with substantial headroom for growth.
        
         | trollied wrote:
         | That's great, but you need a solid HA/backup and recovery
         | strategy if you even remotely care about your data.
        
         | Tostino wrote:
         | I would absolutely use another backup utility (additionally if
         | you want) if I were you (barman, pgbackrest, etc).
         | 
         | You are just wrapping pgdump, which is not a full featured
         | backup solution. Great for a snapshot...
         | 
         | Use some of the existing tools and you get point-in-time
         | recovery, easy restores to hot standbys for replication, a good
         | failover story, backup rotations, etc.
        
           | niux wrote:
           | The reason I wrote my own tool is because I couldn't find
           | anything for Pg17 at the time and pgbackrest seemed overkill
           | for my needs. Also, the CLI handles backup rotations as well.
           | Barman looks interesting though, I'll definitely have a look,
           | thanks!
        
             | Tostino wrote:
             | pgbackrest was always easy to use in my experience. Not
             | very hard to setup or configure, and low overhead. Supports
             | spool directories for WAL shipping, compression, and block
             | incremental backups (YAY!!!). I ran my last company on it
             | for the last ~6 years I was there. Never any complaints,
             | solid software (which is what you want for backups).
             | 
             | I have been using barman indirectly through CloundNativePG
             | with my latest company, but don't have the operational
             | experience to speak on it yet.
        
             | 9dev wrote:
             | pgbackrest only looks scary because it's so flexible, but
             | the defaults work great in almost all cases. The most
             | complex thing you'll need to do is creating a storage
             | bucket to write to, and configure the appropriate storage
             | provider in pgbackrest's config file.
             | 
             | When it's set up properly, it's solid as rock. I'd really
             | recommend you check it out again; it likely solves
             | everything you did more elegantly, and also covers a ton of
             | things you didn't think of. Been there, done that :)
        
           | arp242 wrote:
           | Snapshots are backups. Not sufficient backups for some cases,
           | but perfectly fine for others.
        
           | lousken wrote:
           | do you still need it? haven't managed pg for a while, but
           | shouldn't pg17 have some solution for backups?
        
         | codegeek wrote:
         | Would you be willing to open source the setup ? I would love to
         | learn from it.
        
           | digdugdirk wrote:
           | Seconded. This corner of the programming world is a deep dark
           | and scary place for people who haven't had solid industry
           | experience. It'd be hugely helpful to have a barebones
           | starting point to begin learning best practices.
        
       | the8472 wrote:
       | On linux there also is preadv2(..., RWF_NOWAIT) which can be used
       | to do optimistic non-blocking read from the page cache. That
       | might be useful for io_method = worker to shave off a bit of
       | latency. Try reading on the main thread with NOWAIT and only
       | offload to a worker thread when that fails.
        
         | anarazel wrote:
         | FWIW, I played with that - unfortunately it seems that the the
         | overhead of doing twice the page cache lookups is a cure worse
         | than the disease.
         | 
         | Note that we do not offload IO to workers when doing I/O that
         | the caller will synchronously wait for, just when the caller
         | actually can do IO asynchronously. That reduces the need to
         | avoid the offload cost.
         | 
         | It turns out, as some of the results in Lukas' post show, that
         | the offload to the worker is often actually beneficial
         | _particularly_ when the data is in the kernel page cache - it
         | parallelizes the memory copy from kernel to userspace and
         | postgres ' checksum computation. Particularly on Intel server
         | CPUs, which have had pretty mediocre per-core memory bandwidth
         | in the last ~ decade, memory bandwidth turns out to be a
         | bottleneck for page cache access and checksum computations.
         | 
         | Edit: Fix negation
        
           | gavinray wrote:
           | Do you think there's a possibility of Direct IO being adopted
           | at some point in the future now that AIO is available?
        
             | anarazel wrote:
             | > Do you think there's a possibility of Direct IO being
             | adopted at some point in the future now that AIO is
             | available?
             | 
             | Explicitly a goal.
             | 
             | You can turn it on today, with a bunch of caveats (via
             | debug_io_direct=data). If you have the right workload -
             | e.g. read only and lots of seqscans, bitmap index scans etc
             | you can see rather substantial perf gains. But it'll suck
             | in any cases in 18.
             | 
             | We need at least:
             | 
             | - AIO writes in checkpointer, bgwriter and backend buffer
             | replacement (think bulk loading data with COPY)
             | 
             | - readahead support in a few more places, most crucially
             | index range scan (works out ok today if the heap is
             | correlated with the index, sucks badly otherwise)
             | 
             | EDIT: Formatting
        
           | the8472 wrote:
           | Ah yeah, getting good kernel<>userspace oneshot memcpy
           | performance for large files is surprisingly hard. mmap has
           | setup/teardown overhead that's significant for oneshot
           | transfers, regular read/write calls suffer from page
           | cache/per page overhead. Hopefully all the large folio work
           | in the kernel will help with that.
        
             | anarazel wrote:
             | From what I've seen a surprisingly large part of the
             | overhead is due to SMAP when doing larger reads from the
             | page cache - i.e. if I boot with clearcpuid=smap (not for
             | prod use!), larger reads go _significantly_ faster. On both
             | Intel and AMD CPUs interestingly.
             | 
             | On Intel it's also not hard to simply reach the per-core
             | memory bandwidth with modern storage HW. This matters most
             | prominently for writes by the checkpointing process, which
             | needs to compute data checksums given the current postgres
             | implementation (if enabled). But even for reads it can be a
             | bottleneck, e.g. when prewarming the buffer pool after a
             | restart.
        
             | yxhuvud wrote:
             | Well, nowadays there is
             | https://www.phoronix.com/news/Linux-RWF_UNCACHED-2024
        
               | the8472 wrote:
               | That doesn't speed up uerspace<>kernel memcopy, it just
               | reduces cache churn. Despite its name it still goes
               | through the page cache, it just triggers writeback and
               | drops the pages once that's done. For example when
               | copying to a tmpfs it makes zero difference since that
               | lives entirely in memory.
        
               | senderista wrote:
               | So you're less dependent on the page replacement
               | algorithm being scan-resistant, since you can use this
               | flag for scan/loop workloads, right?
        
         | zX41ZdbW wrote:
         | The usage in ClickHouse database:
         | https://github.com/ClickHouse/ClickHouse/blob/d2697c0ea112fd...
         | 
         | Aside from a few problems in specific Linux kernel versions, it
         | works great.
        
       | pseudopersonal wrote:
       | Does anyone know when the update allowing more concurrent
       | connections is dropping, so we can stop using pgbouncer?
        
         | __s wrote:
         | That'll likely need conversion from process per connection, so
         | not any time soon
        
       | skeptrune wrote:
       | How close is this to the way MySQL does it with InnoDB? It
       | appears to be about the same.
        
         | seunosewa wrote:
         | Yep. It's a low hanging fruit they should've picked years ago.
         | 
         | They will eventually figure out using b-trees for tables too.
        
           | ARandomerDude wrote:
           | Can you elaborate on this B-tree part of your comment? I know
           | B-tree is the default index type in pg, but it sounds like
           | there's more to the story that I'm not familiar with.
        
             | greenavocado wrote:
             | PostgreSQL uses heap files for the primary table storage,
             | not B-trees. In PostgreSQL table data is primarily stored
             | in heap files (unordered collections of pages/blocks).
             | Indexes (including primary key indexes) use B-trees
             | (specifically B+ trees). When you query a table via an
             | index, the B-tree index points to locations in the heap
             | file
             | 
             | InnoDB uses a clustered index approach. The primary key
             | index is a B-tree. The actual table data is stored in the
             | leaf nodes of this B-tree. Secondary indexes point to the
             | primary key.
             | 
             | One is not better than the other in general terms. InnoDB's
             | clustered B-tree approach shines when:
             | 
             | You frequently access data in primary key order
             | 
             | Your workload has many range scans on the primary key
             | 
             | You need predictable performance for primary key lookups
             | 
             | Your data naturally has a meaningful ordering that matches
             | your access patterns
             | 
             | PostgreSQL's heap approach excels when:
             | 
             | You frequently update non-key columns (less page
             | splits/reorganization)
             | 
             | You have many secondary indexes (they're smaller without
             | primary keys)
             | 
             | Your access patterns vary widely and don't follow one
             | particular field
             | 
             | You need faster table scans when indexes aren't applicable
             | 
             | I personally find PostgreSQL's approach more flexible for
             | complex analytical workloads with unpredictable access
             | patterns, while InnoDB's clustered approach feels more
             | optimized for OLTP workloads with predictable key-based
             | access patterns. The "better" system depends entirely on
             | your specific workload, data characteristics, and access
             | patterns.
        
               | farazbabar wrote:
               | Don't forget high speed committed writes to append only
               | tables (the opposite of scans), postgres approach is
               | better here as well.
        
               | saltcured wrote:
               | It's also deeply entwined with the MVCC concurrency
               | control and the ability to do DDL in transactions, right?
        
               | jiggawatts wrote:
               | SQL Server supports every combination of heap storage,
               | clustered storage, MVCC, and DDL in transactions.
        
               | Sesse__ wrote:
               | Indexes that point directly to the disk column are also
               | significantly faster to access; it is a persistent pain
               | point for OLAP on InnoDB that all secondary indexes are
               | indirect. You can work around it by adding additional
               | columns to the index to make your lookups covering, but
               | it's kludgy and imprecise and tends to bloat the index
               | even further. (The flip side is that if you have tons of
               | indexes, and update some unrelated column, InnoDB doesn't
               | need to update those indexes to point to the location of
               | the new row. But I'm generally very rarely annoyed by
               | that in comparison.)
        
           | tehlike wrote:
           | There's a lot more PG needs to do for storage layout / access
           | pov, and they have been working on it for a while. Orioledb
           | has shown what might be opssible, and they have been
           | upstreaming it.
           | 
           | Having the ability to do something LSM as a storage engine
           | would be great - and potentially allow better compression
           | than what we currently get with TOAST - which is not a lot...
           | PG doesn't even have oob page compression...
        
       | shayonj wrote:
       | Very nicely written post! I'd love to start running these in
       | production on NVMe and hope its something major cloud providers
       | start to offer ASAP. The performance gains are _extremely_
       | attractive
        
       | martinald wrote:
       | I sort of had to chuckle at the 20k IOPS AWS instance, given even
       | a consumer $100-200 NVMe gives ~1million+ IOPS these days. I
       | suspect now we have PCIe 5.0 NVMes this will go up to
       | 
       | I always do wonder how much "arbitrary" cloud limits on things
       | like this cause so many issues. I'm sure that async IO is very
       | helpful anyway, but I bet on a 1million IOPS NVMe it is nowhere
       | near as important.
       | 
       | We're effectively optimising critical infrastructure tech for
       | ~2010 hardware because that's when big cloud got going and there
       | has been so few price reductions on things since then vs the
       | underlying hardware costs.
       | 
       | Obviously a consumer NVMe is not "enterprise" but my point is we
       | are 3+ orders of magnitude off performance on cheap consumer
       | hardware vs very expensive 'enterprise' AWS/big cloud costs.
        
         | binary132 wrote:
         | Meanwhile people are running things on raspberry pi home
         | clusters thinking they're winning
        
           | brulard wrote:
           | Maybe they are. With NVMe hat, you get decent IO performance
        
             | adgjlsfhk1 wrote:
             | It's still moderately bad. Raspberry pi is limited to 2
             | gen3 pcie lanes which is ~4-8x slower than the drive (and
             | you will likely be further limited by cpu speed)
        
         | lfittl wrote:
         | Yep, I find cloud storage performance to be quite frustrating,
         | but its the reality for many production database deployments
         | I've seen.
         | 
         | Its worth noting that even on really fast local NVMe drives the
         | new asynchronous I/O work delivers performance benefits, since
         | its so much more efficient at issuing I/Os and reducing syscall
         | overhead (for io_uring).
         | 
         | Andres Freund (one of the principal authors of the new
         | functionality) did a lot of benchmarking on local NVMe drives
         | during development. Here is one mailinglist thread I could find
         | that shows a 2x and better benefit with the patch set at the
         | time: https://www.postgresql.org/message-
         | id/flat/uvrtrknj4kdytuboi...
        
         | anonymars wrote:
         | > given even a consumer $100-200 NVMe gives ~1million+ IOPS
         | these days
         | 
         | In the face of sustained writes? For how long?
        
           | merb wrote:
           | sustained reads would not even give 1 mio iops in that case.
           | Maybe wen you only read the same file that fits into the nvme
           | cache. Which probably never happens in a production
           | database..
        
             | p_ing wrote:
             | Samsung 9910 has a 1:1 TB:GB cache size of LPDDR4X memory.
             | I won't pretend to understand the magic NVMe drives
             | possess, but if you got a 4TB or 8TB 9910, could you not in
             | theory pull in all of the data you require to cache?
             | 
             | I would assume, and it might be a poor assumption, that
             | NVMe controllers don't pull in files, but rather blocks, so
             | even if you had a database that exceeded cache size, in
             | theory if the active blocks of that database did not exceed
             | cache size, it could be "indefinitely" cached for a read-
             | only pattern.
        
               | wtallis wrote:
               | The DRAM on a SSD like that isn't for caching user data,
               | it's for caching the drive's metadata about which logical
               | blocks (as seen by the OS) correspond to which physical
               | locations in the flash memory.
        
             | jauntywundrkind wrote:
             | I think you'd be surprised. Sustained write performance has
             | gotten pretty good. Decent but not fancy consumer drives
             | will often do 1GBps sustained, for bulkier writes. That's
             | much better than we used to expect: flash has gotten much
             | better with so many layers! This mid-range PCIe5 drive
             | sustains a nice 1.5GBps:
             | https://www.techpowerup.com/review/team-group-ge-
             | pro-2-tb/6....
             | 
             | I don't think sustained reads are a problem? Benches like
             | the CrystalDiskMark do a full disk random read test;
             | they're designed to bust through cache afaik. 7.2GBp of 4k
             | reads would translate to 1.8MIOps. Even if this is
             | massively optimistic, you need to slash a _lot_ of zeroes
             | /orders of magnitude to get down to 20kIOps, which you will
             | also pay >$100/mo for.
        
         | __s wrote:
         | Even worse on Azure where we had to ask customers to scale up
         | vcpu to increase iops
         | 
         | https://azure.microsoft.com/en-us/pricing/details/managed-di...
         | 
         | Increasing vcpu also opened up more disk slots to try improve
         | situation with disk striping
        
         | maherbeg wrote:
         | instance store on aws can give up to 3.3mil iops
         | https://aws.amazon.com/blogs/aws/now-available-i3-instances-...
         | - the main problem is just using networked storage.
        
           | perching_aix wrote:
           | Instance store is also immediately wiped when the instance is
           | halted / restarted, which can theoretically happen at any
           | time, for example by a mystery instance failure, or a
           | patching tool that's helpfully restarting your boxes during
           | offhours.
        
             | slashdev wrote:
             | My understanding is this not true, only when the instance
             | permanently fails and is moved.
        
               | the8472 wrote:
               | stop or hibernate kills it. https://docs.aws.amazon.com/A
               | WSEC2/latest/UserGuide/instance...
        
               | slashdev wrote:
               | Yeah, so restart does not.
               | 
               | Which means you can count on it about as much as a server
               | of your own, if you could not repair the server.
               | 
               | I know a database company that uses instance storage as
               | the primary storage. It's common.
        
           | coder543 wrote:
           | That quoted IOPS number is only with an 8-disk stripe
           | (requiring the full instance), even if you don't need 488GB
           | of RAM or a $3600/mo instance, I believe.
           | 
           | The per-disk performance is still nothing to write home
           | about, and 8 _actually fast_ disks would blow this instance
           | type out of the water.
        
           | the8472 wrote:
           | The NVMe on other instance types is quite throttled. E.g. on
           | a G5.4xlarge instance EBS is limited to 593MB/s and 20000IOPS
           | while instance-attached NVMe is limited to 512MB/s (read) at
           | 125000IOPS, a fraction of IO what a workstation or gaming PC
           | with similar GPU and RAM would have. And stopping the
           | instance wipes it, which means you can't do instance warmup
           | with those, everything must be populated at boot.
        
         | codegeek wrote:
         | You probably already know this but I will say it anyway. These
         | cloud services like AWS are not succeeding in enterprise
         | because they have outdated hardware. They succeed because in
         | enterprise, CIOs and CTOs want something that is known, has a
         | brand and everyone else uses it. It's like the old adage of "No
         | one got fired for using IBM". Now it is "No one gets fired for
         | hosting with AWS no matter how ridiculous the cost and
         | corresponding feature is".
        
           | the8472 wrote:
           | > No one gets fired for hosting with AWS
           | 
           | But consider the counterfactual: Non-realized customers
           | because AWS certified solutions architect(tm) software
           | couldn't deliver the price/perf they would have needed.
           | 
           | At $work this is a very real problem because a software
           | system was built on api gateway, lambdas, sqs and a whole
           | bunch of other moving pieces (serverless! scalable! easy
           | compliance!) that combined resulted in way too much latency
           | to meet a client's goal.
        
           | 9dev wrote:
           | > No one gets fired for hosting with AWS no matter how
           | ridiculous the cost and corresponding feature is
           | 
           | Actually, AWS is so expensive, hosting everything we ran on
           | Hetzner there would have simply depleted our funding, and the
           | company would not exist anymore.
        
         | gopalv wrote:
         | > had to chuckle at the 20k IOPS AWS instance, given even a
         | consumer $100-200 NVMe gives ~1million+ IOPS these days
         | 
         | The IOPS figure usually hides the fact that it is not a single
         | IOP that is really fast, but a collection of them.
         | 
         | More IOPS generally is done best by reducing latency of a
         | single operation but the average latency is what actually
         | contributes to the "fast query" experience. Because a lot of
         | the next IO is branchy from the last one (like an index or
         | filter lookup).
         | 
         | As more and more disks to CPU connectivity goes over the
         | network, we can really deliver a large IOPS even when we have
         | very high latencies (by spreading the data across hundreds of
         | SSDs and routing it fast), because with the network storage we
         | pay a huge latency cost for durability of the data simply
         | because of location diversification.
         | 
         | Every foot is a nanosecond, approximately.
         | 
         | That the tradeoff is worth it, because you don't need clusters
         | to deal with a bad CPU or two. Stop & start, to fix memory/cpu
         | errors.
         | 
         | The AWS model pushes the latency problem to the customer and we
         | see it in the IOPS measurements, but it is really the latency x
         | queue depth we're seeing not the hardware capacity.
        
         | Hilift wrote:
         | Everything in the cloud is throttled. Network, IOPS, CPU. And
         | probably implemented incorrectly. AWS makes billions if the
         | customer infrastructure is great or terrible. I found that
         | anything smaller than an AWS EC2 m5.8xlarge had noticeably bad
         | performance on loaded servers (Windows). The list price for
         | that would be about $13k per year, but most organizations get
         | lower than list prices.
         | 
         | This also applies to services, not only compute. Anything
         | associated with Microsoft Office 365 Exchange, scripts may run
         | 10x slower against the cloud using the MSOnline cmdlets. It's
         | absolute insanity, I used to perform a dump of all mailbox
         | statistics that would take about one hour, it could take almost
         | 24 hours against Office 365. You have to be careful to not use
         | the same app or service account in multiple places, because the
         | throttle limits are per-account.
        
         | immibis wrote:
         | I noticed this with bandwidth. AWS price for bandwidth:
         | $90.00/TB after 0.1TB/month. Price everywhere else (low cost
         | VPSes): $1.50/TB after 1-5TB/month. Price some places
         | (dedicated servers): $0.00/TB up to ~100TB/month, $1.50/TB
         | after.
         | 
         | You pay 60 times the price for the privilege of being on AWS.
         | 
         | Bandwidth is just their most egregious price difference. The
         | servers are more expensive too. The storage is more expensive
         | (except for Glacier). The serverless platforms are mostly more
         | expensive than using a cheap server.
         | 
         | There are only two AWS products that I understand to have good
         | prices: S3 Glacier (and only if you never restore!), and
         | serverless apps (Lambda / API Gateway) if your traffic is low
         | enough to fit in the Always Free tier. For everything else, it
         | appears you get ripped off by using AWS.
        
         | yxhuvud wrote:
         | FWIW, using the same approach as in the article, ie io_uring,
         | is one of the few ways to actually reach anywhere close to that
         | 1 million, so it is not as if they are competing concerns.
        
       | Tostino wrote:
       | Thank you for the effort that went into getting this committed. I
       | remember seeing the first discussions about async I/O (and using
       | io_uring) like 6 or 7 years ago. Amazing amount of work to get
       | the design right.
       | 
       | Looking forward to the other places that async I/O can be used in
       | future Postgres releases now that the groundwork is done.
        
       | p_ing wrote:
       | Is io_uring still plagued by security issues enabled by it's use?
       | Or have those largely been fixed? My understanding was many Linux
       | admins (or even distros by default?) were disabling io_uring.
        
         | hansvm wrote:
         | https://github.com/axboe/liburing/discussions/1047
        
           | p_ing wrote:
           | Thanks. It looks like it is still going through growing
           | pains.
           | 
           | https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=io_uring
           | 
           | https://www.theregister.com/2025/04/29/linux_io_uring_securi.
           | ..
           | 
           | But most of the 'off by default' are from ~2023 and not a
           | current concern.
        
         | znpy wrote:
         | Disabling io_uring because "guy on the internet said so" or
         | "$faang_company says so" is beyond dumb.
         | 
         | One should evaluate the risk according to their specific use
         | case.
         | 
         | It can be a good idea to disable it of you run untrusted
         | workloads (eg: other people's containers, sharing the same
         | kernel) but if you have a kernel on a machine (virtual or real)
         | dedicated to your own workload you can pretty much keep using
         | io_uring. There are other technologies to enforce security (eg:
         | selinux emand similar).
        
           | wtallis wrote:
           | I think at this point, those "other technologies to enforce
           | security" are the main area of concern for io_uring users: if
           | those other security layers don't know about io_uring they
           | won't apply any restrictions to it.
        
       | song wrote:
       | Are there good performance comparisons between postgres, mariadb
       | and percona? I'm really curious at this point in which case each
       | of those database shine.
        
         | KronisLV wrote:
         | Probably depends on the particular workload, but there are at
         | least some attempts at benchmarking vaguely typical workloads:
         | https://datasystemreviews.com/postgresql-vs-mariadb-performa...
        
       | kev009 wrote:
       | A lot of work has gone into FreeBSD's aio(4) so it will be
       | interesting to see how that works, because it doesn't have the
       | drawbacks of Linux/glibc aio.
        
         | tiffanyh wrote:
         | Would you mind expanding more on this topic.
         | 
         | Is FreeBSD doing anything significantly different and/or
         | better?
        
       | WhyNotHugo wrote:
       | It's pretty disappointing that simply using O_NONBLOCK doesn't
       | work as expected on regular files. It would be such a simple and
       | portable mechanism to do async I/O using the same interfaces that
       | we already use for networking.
        
       | dbbk wrote:
       | This looks promising! Wonder if it's coming to Neon
        
       | gitroom wrote:
       | insane how long it took postgres to get async i/o right - feels
       | like all the big changes spark a million little tradeoffs, right?
       | you think stuff like io_uring is finally gonna push postgres to
       | catch up with the clouds
        
       ___________________________________________________________________
       (page generated 2025-05-07 23:00 UTC)