[HN Gopher] Replacing EBS and Rethinking Postgres Storage from F...
       ___________________________________________________________________
        
       Replacing EBS and Rethinking Postgres Storage from First Principles
        
       Author : mfreed
       Score  : 92 points
       Date   : 2025-10-29 15:49 UTC (1 days ago)
        
 (HTM) web link (www.tigerdata.com)
 (TXT) w3m dump (www.tigerdata.com)
        
       | cpt100 wrote:
       | pretty cool
        
       | thr0w wrote:
       | Postgres for agents, of course! It makes too much sense.
        
         | jacobsenscott wrote:
         | The agent stuff is BS for the pointy hairs. This seems to
         | address real problems I've had with PG though.
        
           | akulkarni wrote:
           | Yeah, I know what you mean. I used to roll my eyes every time
           | someone said "agentic," too. But after using Claude Code
           | myself, and seeing how our best engineers build with it, I
           | changed my mind. Agents aren't hype, they're genuinely
           | useful, make us more productive, and honestly, fun to work
           | with. I've learned to approach this with curiosity rather
           | than skepticism.
        
         | akulkarni wrote:
         | Thanks! We agree :-)
         | 
         | We just launched a bunch around "Postgres for Agents" [0]:
         | 
         | forkable databases, an MCP server for Postgres (with semantic +
         | full-text search over the PG docs), a new BM25 text search
         | extension (pg_textsearch), pgvectorscale updates, and a free
         | tier.
         | 
         | [0] https://www.tigerdata.com/blog/postgres-for-agents
        
       | the8472 wrote:
       | Though AWS instance-attached NVMe(oF?) still has less IOPS per TB
       | than bare metal NVMe does.                   E.g. i8g.2xlarge,
       | 1875 GB, 300k IOPS read         vs. WD_BLACK SN8100, 2TB, 2300k
       | IOPS read
        
         | everfrustrated wrote:
         | You can't do those rates 24x7 on a WD_BLACK tho.
        
       | 0xbadcafebee wrote:
       | There's a ton of jargon here. Summarized...
       | 
       | Why EBS didn't work:                 - EBS costs for allocation
       | - EBS is slow at restores from snapshot (faster to spin up a
       | database from a Postgres backup stored in S3 than from an EBS
       | snapshot in S3)       - EBS only lets you attach 24 volumes per
       | instance       - EBS only lets you resize once every 6-24 hours,
       | you can't shrink or adjust continuously       - Detaching and
       | reattaching EBS volumes can take 10s for healthy volumes to 20m
       | for failed ones, so failover takes longer
       | 
       | Why all this matters:                 - their AI agents are all
       | ephemeral snapshots; they constantly destroy and rebuild EBS
       | volumes
       | 
       | What didn't work:                 - local NVMe/bare metal: need
       | 2-3x nodes for durability, too expensive; snapshot restores are
       | too slow       - custom page-server psql storage architecture:
       | too complex/expensive to maintain
       | 
       | Their solution:                 - block COWs       - volume
       | changes (new/snapshot/delete) are a metadata change       -
       | storage space is logical (effectively infinite) not bound to disk
       | primitives       - multi-tenant by default       - versioned,
       | replicated k/v transactions, horizontally scalable       -
       | independent service layer abstracts blocks into volumes, is the
       | security/tenant boundary, enforces limits       - user-space
       | block device, pins i/o queues to cpus, supports zero-copy,
       | resizing; depends on Linux primitives for performance limits
       | 
       | Performance stats (single volume):                 -
       | (latency/IOPS benchmarks: 4 KB blocks; throughput benchmarks: 512
       | KB blocks)       - read: 110,000 IOPS and 1.375 GB/s
       | (bottlenecked by network bandwidth       - write: 40,000-67,000
       | IOPS and 500-700 MB/s, synchronousy replicated       - single-
       | block read latency ~1 ms, write latency ~5 ms
        
         | hedora wrote:
         | Thanks for the summary.
         | 
         | Note that those numbers are terrible vs. a physical disk,
         | especially latency, which should be < 1ms read, << 1ms write.
         | 
         | (That assumes async replication of the write ahead log to a
         | secondary. Otherwise, write latency should be ~ 1 rtt, which is
         | still << 5ms.)
         | 
         | Stacking storage like this isn't great, but PG wasn't really
         | designed for performance or HA. (I don't have a better concrete
         | solution for ansi SQL that works today.)
        
           | graveland wrote:
           | (I'm on the team that made this)
           | 
           | The raw numbers are one thing, but the overall performance of
           | pg is another. If you check out
           | https://planetscale.com/blog/benchmarking-postgres-17-vs-18
           | for example, in the average QPS chart, you can see that there
           | isn't a very large difference in QPS between GP3 at 10k iops
           | and NVMe at 300k iops.
           | 
           | So currently I wouldn't recommend this new storage for the
           | highest end workloads, but it's also a beta project that's
           | still got a lot of room for growth! I'm very enthusiastic
           | about how far we can take this!
        
             | samlambert wrote:
             | it's a 70% difference at lower cost. i know math is hard
             | but c'mon try and be serious.
        
           | mfreed wrote:
           | A few datapoints that might help frame this:
           | 
           | - EBS typically operates in the millisecond range. AWS' own
           | documentation suggests "several milliseconds"; our own
           | experience with EBS is 1-2 ms. Reads/writes to local disk
           | alone are certainly faster, but it's more meaningful to
           | compare this against other forms of network-attached storage.
           | 
           | - If durability matters, async replication isn't really the
           | right baseline for local disk setups. Most production
           | deployments of Postgres/databases rely on synchronous
           | replication -- or "semi-sync," which still waits for at least
           | one or a subset of acknowledgments before committing -- which
           | in the cloud lands you in the single-digit millisecond range
           | for writes again.
        
         | znpy wrote:
         | Reminds me of about ten years ago when a large media customer
         | was running NetApp on cloud to get most of what you just wrote
         | on AWS (because EBS features sucked/sucks very bad and are also
         | crazy expensive).
         | 
         | I did not set that up myself, but the colleague that worked on
         | that told me that enabling tcp multipath for iscsi yielded
         | significant performance gains.
        
         | bradyd wrote:
         | > EBS only lets you resize once every 6-24 hours
         | 
         | Is that even true? I've resized an EBS instance a few minutes
         | after another resize before.
        
           | electroly wrote:
           | AWS documents it as "After modifying a volume, you must wait
           | at least six hours and ensure that the volume is in the in-
           | use or available state before you can modify the same volume"
           | but community posts suggest you can get up to 8 resizes in
           | the six hour window.
        
           | jasonthorsness wrote:
           | The 6-hour counter is most certainly, painfully true. If you
           | work with an AWS rep please complain about this in every
           | session; maybe if we all do they will reduce the counter :P.
        
         | thesz wrote:
         | What does EBS mean?
         | 
         | It is used in first line of the text but no explanation was
         | given.
        
           | karanbhangui wrote:
           | https://aws.amazon.com/ebs/
        
         | lisperforlife wrote:
         | The 5ms write latency and 1ms write latency sounds like they
         | are using S3 to store and retrieve data with some local cache.
         | My guess is a S3 based block storage exposed as a network block
         | device. S3 supports compare-and-swap operations (Put-If-Match),
         | so you can do a copy-on-write scenario quite easily. May be
         | somebody from TigerData can give a little bit more insight into
         | this. I know slatedb supports S3 as a backend for their key-
         | value store. We can build a block device abstraction using
         | that.
        
           | mfreed wrote:
           | None of this. It's in the blog post in a lot of detail =)
           | 
           | The 5ms write latency is because the backend distributed
           | block storage layer is doing synchronous replication to
           | multiple servers for high availability and durability before
           | ack'ing a write. (And this path has not yet been super-
           | performance-optimized for latency, to be honest.)
        
         | _rs wrote:
         | > Detaching and reattaching EBS volumes can take 10s for
         | healthy volumes to 20m for failed ones
         | 
         | Is there a source for the 20m time limit for failed EBS
         | volumes? I experienced this at work for the first time recently
         | but couldn't find anything documenting the 20m SLA (and it did
         | take just about 20 full minutes).
        
           | mfreed wrote:
           | I'm not aware of any published source for this time limit,
           | nor ways to reduce it.
           | 
           | The docs do say, however, "If the volume has been impaired
           | for more than 20 minutes, you can contact the AWS Support
           | Center." [0] which suggests its some expected cleanup/remount
           | interval.
           | 
           | That is, it is something that we regularly encounter when EC2
           | instances fail, so we were sharing from personal experience.
           | 
           | [0] https://docs.aws.amazon.com/ebs/latest/userguide/work_vol
           | ume...
        
         | jread wrote:
         | I'm working on graduate research evaluating AWS control and
         | data plane performance.
         | 
         | EBS volume attachment is typically ~11s for GP2/GP3 and ~20-25s
         | for other types.
         | 
         | 1ms read / 5ms write latencies seem high for 4k blocks. IO1/IO2
         | is typically ~0.5ms RW, and GP2/GP3 ~0.6ms read and ~0.94ms
         | write.
         | 
         | References: https://cloudlooking.glass/matrix/#aws.ebs.us-
         | east-1--cp--at...
         | https://cloudlooking.glass/matrix/#aws.ebs.*--dp--rand-*&aws...
        
         | samat wrote:
         | Excellent tl;dr! Would pay to get them for every worthwhile
         | tech article.
        
           | akulkarni wrote:
           | <3
        
       | maherbeg wrote:
       | This has a similar flavor to xata.io's SimplyBlock based storage
       | system * https://xata.io/blog/xata-postgres-with-data-branching-
       | and-p... * https://www.simplyblock.io/
       | 
       | It's a great way to mix copy on write and effectively logical
       | splitting of physical nodes. It's something I've wanted to build
       | at a previous role.
        
       | stefanha wrote:
       | @graveland Which Linux interface was used for the userspace block
       | driver (ublk, nbd, tcmu-runner, NVMe-over-TCP, etc)? Why did you
       | choose it?
       | 
       | Also, were existing network or distributed file systems not
       | suitable? This use case sounds like Ceph might fit, for example.
        
         | graveland wrote:
         | There's some secret sauce there I don't know if I'm allowed to
         | talk about yet, so I'll just address the existing tech that we
         | didn't use: most things either didn't have a good enough
         | license, cost too much, would take a TON of ramp-up and
         | expertise we don't currently have to manage and maintain, but
         | generally speaking, our stuff allows us to fully control it.
         | 
         | Entirely programmable storage so far has allowed us to try a
         | few different things to try and make things efficient and give
         | us the features we want. We've been able to try different dedup
         | methods, copy-on-write styles, different compression methods
         | and types, different sharding strategies... All just as a
         | start. We can easily and quickly create a new experimental
         | storage backends and see exactly how pg performs with it side-
         | by-side with other backends.
         | 
         | We're a kubernetes shop, and we have our own CSI plugin, so we
         | can also transparently run a pg HA pair with one pg server
         | using EBS and the other running in our new storage layer, and
         | easily bounce between storage types with nothing but a
         | switchover event.
        
         | kjetijor wrote:
         | I was struck by how similar this seems to Ceph/RADOS/RBD. I.e.
         | how they implemented snapshotted block storage on top, sounds
         | more or less exactly the same as how RBD is implemented on top
         | of RADOS in ceph.
        
       | unsolved73 wrote:
       | TimescaleDB was such a great project!
       | 
       | I'm really sad to see them waste the opportunity and instead
       | build an nth managed cloud on top of AWS, chasing buzzword after
       | buzzword.
       | 
       | Had they made deals with cloud providers to offer managed
       | TimescaleDB so they can focus on their core value proposition
       | they could have won the timeseries business, but ClickHouse made
       | them irrelevant and Neon already has won the "Postgres for
       | agents" business thanks to a better architecture than this.
        
         | akulkarni wrote:
         | Thanks for the kind words about TimescaleDB :-)
         | 
         | We think we're still building great things, and our customers
         | seem to agree.
         | 
         | Usage is at an all-time high, revenue is at an all-time high,
         | and we're having more fun than ever.
         | 
         | Hopefully we'll win you back soon.
        
           | NewJazz wrote:
           | Does Tiger Cloud support multi-region clusters? We are using
           | aurora postgresql currently but it is straining (our budget
           | and itself).
        
             | mfreed wrote:
             | Currently support multi-AZ clusters and multi-region
             | disaster recovery (continuous PITR between regions).
             | 
             | We're continuing to evaluate demand for multi-region
             | clusters, love to hear from you.
        
       | tayo42 wrote:
       | Are they not using aws anymore? I found that confusing. It says
       | they're not using ebs, not using attached nvme, but I didn't
       | think there were other options in aws?
        
         | wrs wrote:
         | There weren't, so they built one. (It is NVMe at the bottom,
         | though.)
        
         | mfreed wrote:
         | Tiger Cloud certainly continues to run on AWS. We have built it
         | to rely on fairly low-level AWS primitives like EC2, EBS, and
         | S3 (as opposed to some of the higher-level service offerings).
         | 
         | Our existing Postgres fleet, which uses EBS for storage, still
         | serves thousands of customers today; nothing has changed there.
         | 
         | What's new is Fluid Storage, our disaggregated storage layer
         | that currently powers the new free tier (while in beta). In
         | this architecture, the compute nodes running Postgres still
         | access block storage over the network. But instead of that
         | being AWS EBS, it's our own distributed storage system.
         | 
         | From a hardware standpoint, the servers that make up the Fluid
         | Storage layer are standard EC2 instances with fast local disks.
        
       | runako wrote:
       | Thanks for the writeup.
       | 
       | I'm curious whether you evaluated solutions like zfs/Gluster?
       | Also curious whether you looked at Oracle Cloud given their
       | faster block storage?
        
       | 7e wrote:
       | Yes, EBS sucks, but plenty of cloud providers already implemented
       | the same thing Tiger Data has a decade ago. Like Google.
        
       | electroly wrote:
       | EC2 instances have dedicated throughput to EBS via Nitro that you
       | lose out on when you run your own EBS equivalent over the regular
       | network. You only get 5Gbps maximum between two EC2 instances in
       | the same AZ that aren't in the same placement group[1], and
       | you're limited by the instance type's general networking
       | throughput. Dedicated throughput to EBS from a typical EC2
       | instance is multiple times this figure. It's an interesting
       | tradeoff--I assume they must be IOPS-heavy and the throughput is
       | not a concern.
       | 
       | [1]
       | https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-inst...
        
         | huntaub wrote:
         | I _believe_ this is also changing with instances that now allow
         | you to adjust the ratio of throughput on the NIC that 's
         | dedicated to EBS vs. general network traffic (with the
         | intention, I'm sure, that people would want _more_ EBS
         | throughput than the default).
        
       | DenisM wrote:
       | IUUC they built a EBS replacement on top of NVME attached to a
       | dynamically sized fleet of EC2 instances.
       | 
       | The advantage is that it's allocating pages on demand from an
       | elastic pool of storage so it appears as an infinite block
       | device. Another advantage is cheap COW clones.
       | 
       | The downside is (probably) specialized tuning for Postgres access
       | patterns. I shudder to think what went into page metadata
       | management. Perhaps it's similar to e.g. SQL Server buffer pool
       | manager).
       | 
       | It's not clear to me why it's better than Aurora design - on the
       | surface page servers are higher level concepts and should allow
       | more holistic optimizations (and less page write traffic due to
       | shipping log in lieu of whole pages). Is also not clear what
       | stopped Amazon from doing the same (perhaps EBS serving more
       | diverse access patterns?).
       | 
       | Very cool!
        
       | kristianp wrote:
       | So they've built a competitor to EBS that runs on EC2 and nvme.
       | Seems like their prices will need to be much higher than those of
       | AWS to get decent profit margins. I really hate being in the
       | high-cost ecosystem of the large cloud providers, so I wouldn't
       | make use of this.
        
       ___________________________________________________________________
       (page generated 2025-10-30 23:01 UTC)