[HN Gopher] Replacing EBS and Rethinking Postgres Storage from F...
___________________________________________________________________
Replacing EBS and Rethinking Postgres Storage from First Principles
Author : mfreed
Score : 92 points
Date : 2025-10-29 15:49 UTC (1 days ago)
(HTM) web link (www.tigerdata.com)
(TXT) w3m dump (www.tigerdata.com)
| cpt100 wrote:
| pretty cool
| thr0w wrote:
| Postgres for agents, of course! It makes too much sense.
| jacobsenscott wrote:
| The agent stuff is BS for the pointy hairs. This seems to
| address real problems I've had with PG though.
| akulkarni wrote:
| Yeah, I know what you mean. I used to roll my eyes every time
| someone said "agentic," too. But after using Claude Code
| myself, and seeing how our best engineers build with it, I
| changed my mind. Agents aren't hype, they're genuinely
| useful, make us more productive, and honestly, fun to work
| with. I've learned to approach this with curiosity rather
| than skepticism.
| akulkarni wrote:
| Thanks! We agree :-)
|
| We just launched a bunch around "Postgres for Agents" [0]:
|
| forkable databases, an MCP server for Postgres (with semantic +
| full-text search over the PG docs), a new BM25 text search
| extension (pg_textsearch), pgvectorscale updates, and a free
| tier.
|
| [0] https://www.tigerdata.com/blog/postgres-for-agents
| the8472 wrote:
| Though AWS instance-attached NVMe(oF?) still has less IOPS per TB
| than bare metal NVMe does. E.g. i8g.2xlarge,
| 1875 GB, 300k IOPS read vs. WD_BLACK SN8100, 2TB, 2300k
| IOPS read
| everfrustrated wrote:
| You can't do those rates 24x7 on a WD_BLACK tho.
| 0xbadcafebee wrote:
| There's a ton of jargon here. Summarized...
|
| Why EBS didn't work: - EBS costs for allocation
| - EBS is slow at restores from snapshot (faster to spin up a
| database from a Postgres backup stored in S3 than from an EBS
| snapshot in S3) - EBS only lets you attach 24 volumes per
| instance - EBS only lets you resize once every 6-24 hours,
| you can't shrink or adjust continuously - Detaching and
| reattaching EBS volumes can take 10s for healthy volumes to 20m
| for failed ones, so failover takes longer
|
| Why all this matters: - their AI agents are all
| ephemeral snapshots; they constantly destroy and rebuild EBS
| volumes
|
| What didn't work: - local NVMe/bare metal: need
| 2-3x nodes for durability, too expensive; snapshot restores are
| too slow - custom page-server psql storage architecture:
| too complex/expensive to maintain
|
| Their solution: - block COWs - volume
| changes (new/snapshot/delete) are a metadata change -
| storage space is logical (effectively infinite) not bound to disk
| primitives - multi-tenant by default - versioned,
| replicated k/v transactions, horizontally scalable -
| independent service layer abstracts blocks into volumes, is the
| security/tenant boundary, enforces limits - user-space
| block device, pins i/o queues to cpus, supports zero-copy,
| resizing; depends on Linux primitives for performance limits
|
| Performance stats (single volume): -
| (latency/IOPS benchmarks: 4 KB blocks; throughput benchmarks: 512
| KB blocks) - read: 110,000 IOPS and 1.375 GB/s
| (bottlenecked by network bandwidth - write: 40,000-67,000
| IOPS and 500-700 MB/s, synchronousy replicated - single-
| block read latency ~1 ms, write latency ~5 ms
| hedora wrote:
| Thanks for the summary.
|
| Note that those numbers are terrible vs. a physical disk,
| especially latency, which should be < 1ms read, << 1ms write.
|
| (That assumes async replication of the write ahead log to a
| secondary. Otherwise, write latency should be ~ 1 rtt, which is
| still << 5ms.)
|
| Stacking storage like this isn't great, but PG wasn't really
| designed for performance or HA. (I don't have a better concrete
| solution for ansi SQL that works today.)
| graveland wrote:
| (I'm on the team that made this)
|
| The raw numbers are one thing, but the overall performance of
| pg is another. If you check out
| https://planetscale.com/blog/benchmarking-postgres-17-vs-18
| for example, in the average QPS chart, you can see that there
| isn't a very large difference in QPS between GP3 at 10k iops
| and NVMe at 300k iops.
|
| So currently I wouldn't recommend this new storage for the
| highest end workloads, but it's also a beta project that's
| still got a lot of room for growth! I'm very enthusiastic
| about how far we can take this!
| samlambert wrote:
| it's a 70% difference at lower cost. i know math is hard
| but c'mon try and be serious.
| mfreed wrote:
| A few datapoints that might help frame this:
|
| - EBS typically operates in the millisecond range. AWS' own
| documentation suggests "several milliseconds"; our own
| experience with EBS is 1-2 ms. Reads/writes to local disk
| alone are certainly faster, but it's more meaningful to
| compare this against other forms of network-attached storage.
|
| - If durability matters, async replication isn't really the
| right baseline for local disk setups. Most production
| deployments of Postgres/databases rely on synchronous
| replication -- or "semi-sync," which still waits for at least
| one or a subset of acknowledgments before committing -- which
| in the cloud lands you in the single-digit millisecond range
| for writes again.
| znpy wrote:
| Reminds me of about ten years ago when a large media customer
| was running NetApp on cloud to get most of what you just wrote
| on AWS (because EBS features sucked/sucks very bad and are also
| crazy expensive).
|
| I did not set that up myself, but the colleague that worked on
| that told me that enabling tcp multipath for iscsi yielded
| significant performance gains.
| bradyd wrote:
| > EBS only lets you resize once every 6-24 hours
|
| Is that even true? I've resized an EBS instance a few minutes
| after another resize before.
| electroly wrote:
| AWS documents it as "After modifying a volume, you must wait
| at least six hours and ensure that the volume is in the in-
| use or available state before you can modify the same volume"
| but community posts suggest you can get up to 8 resizes in
| the six hour window.
| jasonthorsness wrote:
| The 6-hour counter is most certainly, painfully true. If you
| work with an AWS rep please complain about this in every
| session; maybe if we all do they will reduce the counter :P.
| thesz wrote:
| What does EBS mean?
|
| It is used in first line of the text but no explanation was
| given.
| karanbhangui wrote:
| https://aws.amazon.com/ebs/
| lisperforlife wrote:
| The 5ms write latency and 1ms write latency sounds like they
| are using S3 to store and retrieve data with some local cache.
| My guess is a S3 based block storage exposed as a network block
| device. S3 supports compare-and-swap operations (Put-If-Match),
| so you can do a copy-on-write scenario quite easily. May be
| somebody from TigerData can give a little bit more insight into
| this. I know slatedb supports S3 as a backend for their key-
| value store. We can build a block device abstraction using
| that.
| mfreed wrote:
| None of this. It's in the blog post in a lot of detail =)
|
| The 5ms write latency is because the backend distributed
| block storage layer is doing synchronous replication to
| multiple servers for high availability and durability before
| ack'ing a write. (And this path has not yet been super-
| performance-optimized for latency, to be honest.)
| _rs wrote:
| > Detaching and reattaching EBS volumes can take 10s for
| healthy volumes to 20m for failed ones
|
| Is there a source for the 20m time limit for failed EBS
| volumes? I experienced this at work for the first time recently
| but couldn't find anything documenting the 20m SLA (and it did
| take just about 20 full minutes).
| mfreed wrote:
| I'm not aware of any published source for this time limit,
| nor ways to reduce it.
|
| The docs do say, however, "If the volume has been impaired
| for more than 20 minutes, you can contact the AWS Support
| Center." [0] which suggests its some expected cleanup/remount
| interval.
|
| That is, it is something that we regularly encounter when EC2
| instances fail, so we were sharing from personal experience.
|
| [0] https://docs.aws.amazon.com/ebs/latest/userguide/work_vol
| ume...
| jread wrote:
| I'm working on graduate research evaluating AWS control and
| data plane performance.
|
| EBS volume attachment is typically ~11s for GP2/GP3 and ~20-25s
| for other types.
|
| 1ms read / 5ms write latencies seem high for 4k blocks. IO1/IO2
| is typically ~0.5ms RW, and GP2/GP3 ~0.6ms read and ~0.94ms
| write.
|
| References: https://cloudlooking.glass/matrix/#aws.ebs.us-
| east-1--cp--at...
| https://cloudlooking.glass/matrix/#aws.ebs.*--dp--rand-*&aws...
| samat wrote:
| Excellent tl;dr! Would pay to get them for every worthwhile
| tech article.
| akulkarni wrote:
| <3
| maherbeg wrote:
| This has a similar flavor to xata.io's SimplyBlock based storage
| system * https://xata.io/blog/xata-postgres-with-data-branching-
| and-p... * https://www.simplyblock.io/
|
| It's a great way to mix copy on write and effectively logical
| splitting of physical nodes. It's something I've wanted to build
| at a previous role.
| stefanha wrote:
| @graveland Which Linux interface was used for the userspace block
| driver (ublk, nbd, tcmu-runner, NVMe-over-TCP, etc)? Why did you
| choose it?
|
| Also, were existing network or distributed file systems not
| suitable? This use case sounds like Ceph might fit, for example.
| graveland wrote:
| There's some secret sauce there I don't know if I'm allowed to
| talk about yet, so I'll just address the existing tech that we
| didn't use: most things either didn't have a good enough
| license, cost too much, would take a TON of ramp-up and
| expertise we don't currently have to manage and maintain, but
| generally speaking, our stuff allows us to fully control it.
|
| Entirely programmable storage so far has allowed us to try a
| few different things to try and make things efficient and give
| us the features we want. We've been able to try different dedup
| methods, copy-on-write styles, different compression methods
| and types, different sharding strategies... All just as a
| start. We can easily and quickly create a new experimental
| storage backends and see exactly how pg performs with it side-
| by-side with other backends.
|
| We're a kubernetes shop, and we have our own CSI plugin, so we
| can also transparently run a pg HA pair with one pg server
| using EBS and the other running in our new storage layer, and
| easily bounce between storage types with nothing but a
| switchover event.
| kjetijor wrote:
| I was struck by how similar this seems to Ceph/RADOS/RBD. I.e.
| how they implemented snapshotted block storage on top, sounds
| more or less exactly the same as how RBD is implemented on top
| of RADOS in ceph.
| unsolved73 wrote:
| TimescaleDB was such a great project!
|
| I'm really sad to see them waste the opportunity and instead
| build an nth managed cloud on top of AWS, chasing buzzword after
| buzzword.
|
| Had they made deals with cloud providers to offer managed
| TimescaleDB so they can focus on their core value proposition
| they could have won the timeseries business, but ClickHouse made
| them irrelevant and Neon already has won the "Postgres for
| agents" business thanks to a better architecture than this.
| akulkarni wrote:
| Thanks for the kind words about TimescaleDB :-)
|
| We think we're still building great things, and our customers
| seem to agree.
|
| Usage is at an all-time high, revenue is at an all-time high,
| and we're having more fun than ever.
|
| Hopefully we'll win you back soon.
| NewJazz wrote:
| Does Tiger Cloud support multi-region clusters? We are using
| aurora postgresql currently but it is straining (our budget
| and itself).
| mfreed wrote:
| Currently support multi-AZ clusters and multi-region
| disaster recovery (continuous PITR between regions).
|
| We're continuing to evaluate demand for multi-region
| clusters, love to hear from you.
| tayo42 wrote:
| Are they not using aws anymore? I found that confusing. It says
| they're not using ebs, not using attached nvme, but I didn't
| think there were other options in aws?
| wrs wrote:
| There weren't, so they built one. (It is NVMe at the bottom,
| though.)
| mfreed wrote:
| Tiger Cloud certainly continues to run on AWS. We have built it
| to rely on fairly low-level AWS primitives like EC2, EBS, and
| S3 (as opposed to some of the higher-level service offerings).
|
| Our existing Postgres fleet, which uses EBS for storage, still
| serves thousands of customers today; nothing has changed there.
|
| What's new is Fluid Storage, our disaggregated storage layer
| that currently powers the new free tier (while in beta). In
| this architecture, the compute nodes running Postgres still
| access block storage over the network. But instead of that
| being AWS EBS, it's our own distributed storage system.
|
| From a hardware standpoint, the servers that make up the Fluid
| Storage layer are standard EC2 instances with fast local disks.
| runako wrote:
| Thanks for the writeup.
|
| I'm curious whether you evaluated solutions like zfs/Gluster?
| Also curious whether you looked at Oracle Cloud given their
| faster block storage?
| 7e wrote:
| Yes, EBS sucks, but plenty of cloud providers already implemented
| the same thing Tiger Data has a decade ago. Like Google.
| electroly wrote:
| EC2 instances have dedicated throughput to EBS via Nitro that you
| lose out on when you run your own EBS equivalent over the regular
| network. You only get 5Gbps maximum between two EC2 instances in
| the same AZ that aren't in the same placement group[1], and
| you're limited by the instance type's general networking
| throughput. Dedicated throughput to EBS from a typical EC2
| instance is multiple times this figure. It's an interesting
| tradeoff--I assume they must be IOPS-heavy and the throughput is
| not a concern.
|
| [1]
| https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-inst...
| huntaub wrote:
| I _believe_ this is also changing with instances that now allow
| you to adjust the ratio of throughput on the NIC that 's
| dedicated to EBS vs. general network traffic (with the
| intention, I'm sure, that people would want _more_ EBS
| throughput than the default).
| DenisM wrote:
| IUUC they built a EBS replacement on top of NVME attached to a
| dynamically sized fleet of EC2 instances.
|
| The advantage is that it's allocating pages on demand from an
| elastic pool of storage so it appears as an infinite block
| device. Another advantage is cheap COW clones.
|
| The downside is (probably) specialized tuning for Postgres access
| patterns. I shudder to think what went into page metadata
| management. Perhaps it's similar to e.g. SQL Server buffer pool
| manager).
|
| It's not clear to me why it's better than Aurora design - on the
| surface page servers are higher level concepts and should allow
| more holistic optimizations (and less page write traffic due to
| shipping log in lieu of whole pages). Is also not clear what
| stopped Amazon from doing the same (perhaps EBS serving more
| diverse access patterns?).
|
| Very cool!
| kristianp wrote:
| So they've built a competitor to EBS that runs on EC2 and nvme.
| Seems like their prices will need to be much higher than those of
| AWS to get decent profit margins. I really hate being in the
| high-cost ecosystem of the large cloud providers, so I wouldn't
| make use of this.
___________________________________________________________________
(page generated 2025-10-30 23:01 UTC)