[HN Gopher] Transactional Object Storage?
       ___________________________________________________________________
        
       Transactional Object Storage?
        
       Author : mbrt
       Score  : 62 points
       Date   : 2024-11-17 13:20 UTC (8 days ago)
        
 (HTM) web link (blog.mbrt.dev)
 (TXT) w3m dump (blog.mbrt.dev)
        
       | svrakitin wrote:
       | Pretty cool! Do you have any ideas already about how to make it
       | work with S3, considering it doesn't support If- headers?
        
         | boulos wrote:
         | S3 recently added basic matching support
         | (https://aws.amazon.com/about-aws/whats-
         | new/2024/08/amazon-s3..., https://docs.aws.amazon.com/AmazonS3/
         | latest/userguide/condit...).
         | 
         | They don't have the full suite of GCS's capabilities
         | (https://cloud.google.com/storage/docs/request-
         | preconditions#...) but it's something.
        
         | mbrt wrote:
         | I think it's now much easier to achieve than a year ago. The
         | critical one is conditional writes on new objects, because
         | otherwise you can't safely create transaction logs in the
         | presence of timeouts. This is not enough though.
         | 
         | My approach on S3 would be to ensure to modify the ETag of an
         | object whenever other transactions looking at it must be
         | blocked. This makes it easier to use conditional reads (https:/
         | /docs.aws.amazon.com/AmazonS3/latest/userguide/condit...) on
         | COPY or GET operations.
         | 
         | For write, I would use PUT on a temporary staging area and then
         | conditional COPY + DELETE afterward. This is certainly slower
         | than GCS, but I think it should work.
         | 
         | Locking without modifying the object is the part that needs
         | some optimization though.
        
           | mbrt wrote:
           | And I see more possibilities now that
           | https://aws.amazon.com/about-aws/whats-
           | new/2024/11/amazon-s3... is available. It will get easier and
           | easier to build serverless data lakes, streaming, queues.
        
         | choppaface wrote:
         | Not a full solution, but seeing the OP seeks to be a key-value
         | store (versus full RDBMS? despite the comparisons with Spanner
         | and Postgres?), important to weigh how Rockset (also mainly KV
         | store) dealt with S3-backed caching at scale:                 *
         | https://rockset.com/blog/separate-compute-storage-rocksdb/
         | * https://github.com/rockset/rocksdb-cloud
         | 
         | Keep in mind Rockset is definitely a bit biased towards vector
         | search use cases.
        
           | mbrt wrote:
           | Nice, thanks for the reference!
           | 
           | BTW, the comparison was only to give an idea about isolation
           | levels, it wasn't meant to be a feature-to-feature
           | comparison.
           | 
           | Perhaps I didn't make it prominent enough, but at some point
           | I say that many SQL databases have key-value stores at their
           | core, and implement a SQL layer on top (e.g. https://www.cock
           | roachlabs.com/docs/v22.1/architecture/overvi...).
           | 
           | Basically SQL can be a feature added later to a solid KV
           | store as a base.
        
       | Onavo wrote:
       | Congrats on reinventing the data lake? This is actually how most
       | of the newer generations of "cloud native" databases work, where
       | they separate compute and storage. The key is that they have a
       | more sophisticated caching layer so that the latency cost of a
       | query can be amortized across requests.
        
         | mbrt wrote:
         | It's my understanding that the newer generation of data lakes
         | still make use of a tiny, strongly consistent metadata database
         | to keep track of what is where. This is orders of magnitudes
         | smaller than what you'd have by putting everything in the same
         | database, but it's still there. This is also the case in newer
         | data streaming platforms (e.g.
         | https://www.warpstream.com/blog/kafka-is-dead-long-live-
         | kafk...).
         | 
         | I'm curious to hear if you have examples of any database using
         | _only_ object storage as a backend, because back when I
         | started, I couldn 't fin any.
        
           | Onavo wrote:
           | Love your article by the way. Not an expert but off the top
           | of my head:
           | 
           | https://docs.datomic.com/operation/architecture.html
           | 
           | (However they cheat with dynamo lol)
           | 
           | There's also some listed here
           | 
           | https://davidgomes.com/separation-of-storage-and-compute-
           | and...
        
             | mbrt wrote:
             | OK, thanks for the reference. Yeah, so indeed separating
             | storage and compute is nothing new. Definitely not claiming
             | I invented that :)
             | 
             | And as you mention, Datomic uses DynamoDB as well (so, not
             | a pure s3 solution). What I'm proposing is to only use
             | object storage for everything, pay the price in latency,
             | but don't give up on throughput, cost and consistency. The
             | differentiator is that this comes with strict
             | serializability guarantees, so this is not an eventually
             | consistent system
             | (https://jepsen.io/consistency/models/strong-serializable).
             | 
             | No matter how sophisticated the caching is, if you want to
             | retain strict serializability, writes must be confirmed by
             | s3 and reads must validate in s3 before returning, which
             | puts a lower bound on latency.
             | 
             | I focused a lot on throughput, which is the one we can
             | really optimize.
             | 
             | Hopefully that's clear from the blog, though.
        
               | Onavo wrote:
               | Have you seen
               | https://news.ycombinator.com/item?id=42174204
        
               | mbrt wrote:
               | I just saw it! I asked a question
               | (https://news.ycombinator.com/item?id=42180611) and it
               | seems that durability and consistency are implemented at
               | the caching layer.
               | 
               | Basically an in-memory database which uses S3 as cold
               | storage. Definitely an interesting approach, but no
               | transactions AFAICT.
        
           | eatonphil wrote:
           | > I'm curious to hear if you have examples of any database
           | using only object storage as a backend, because back when I
           | started, I couldn't fin any.
           | 
           | Take a look at Delta Lake
           | 
           | https://notes.eatonphil.com/2024-09-29-build-a-serverless-
           | ac...
        
       | victorbjorklund wrote:
       | Pretty cool and could be useful for stuff that isnt updated so
       | frequently like a CMS.
        
       | ramesh31 wrote:
       | so... Delta Lake?
        
       | jitl wrote:
       | There is also SlateDB, another work in progress take on this. HN
       | link: https://news.ycombinator.com/item?id=41714858
        
         | maxmcd wrote:
         | Yeah I think it's very interesting to compare the two. SlateDB
         | expects a single writer and fences writes. This means you can
         | make some serious savings on S3 costs because you're using S3
         | for consistency but you're batching writes.
         | 
         | GlassDB is much more accessible for smaller volume workloads,
         | but gets very costly for high volume because of requests to S3
         | per-transaction. In-turn the consistency model is easier to
         | reason about because the system is entirely stateless.
        
       | social_quotient wrote:
       | I found myself thinking about Cloudflare Durable objects new
       | SQLite offering.
       | 
       | Nicely detailed here https://simonwillison.net/2024/Oct/13/zero-
       | latency-sqlite-st... And
       | https://developers.cloudflare.com/durable-objects/best-pract...
        
       | jacobmarble wrote:
       | If I had time, I'd like to implement an Iceberg catalog this way.
        
       ___________________________________________________________________
       (page generated 2024-11-25 23:00 UTC)