[HN Gopher] Amazon S3 Adds Put-If-Match (Compare-and-Swap)
       ___________________________________________________________________
        
       Amazon S3 Adds Put-If-Match (Compare-and-Swap)
        
       Author : Sirupsen
       Score  : 500 points
       Date   : 2024-11-25 22:11 UTC (1 days ago)
        
 (HTM) web link (aws.amazon.com)
 (TXT) w3m dump (aws.amazon.com)
        
       | koolba wrote:
       | This combined with the read-after-write consistency guarantee is
       | a perfect building block (pun intended) for incremental append
       | only storage atop an object store. It solves the biggest problem
       | with coordinating multiple writers to a WAL.
        
         | IgorPartola wrote:
         | Rename for objects and "directories" also. Atomic.
        
         | ncruces wrote:
         | Both this and read-after-write consistency is single object.
         | 
         | So coordinating writes to multiple objects still requires...
         | creativity.
        
       | sillysaurusx wrote:
       | Finally. GCP has had this for a long time. Years ago I was
       | surprised S3 didn't.
        
         | ncruces wrote:
         | GCS is just missing x-amz-copy-source-range in my book.
         | 
         | Can we have this Google?
         | 
         | ...
         | 
         | Please?
        
         | mannyv wrote:
         | GCP still doesn't have triggers out of beta last time i checked
         | (which was a while ago).
        
           | fragmede wrote:
           | Gmail was in beta for five years, I don't think that label
           | really means anything.
        
             | UltraSane wrote:
             | It means that Google doesn't want to offer an SLA
        
               | sitkack wrote:
               | Not that it matters. It just changes the volume and
               | timing of "I believe I did bob"
        
           | BrandonY wrote:
           | We do have Cloud Run Functions that trigger on Cloud Storage
           | events, as well as Cloud Pub/Sub notifications for the same.
           | Is there a specific bit of functionality you're looking for?
        
         | seansmccullough wrote:
         | Azure Storage has also had this for years -
         | https://learn.microsoft.com/en-us/rest/api/storageservices/s...
        
       | 1a527dd5 wrote:
       | Be still my beating heart. I have lived to see this day.
       | 
       | Genuinely, we've wanted this for ages and we got half way there
       | with strong consistency.
        
         | ncruces wrote:
         | Might finally be possible to do this on S3:
         | https://pkg.go.dev/github.com/ncruces/go-gcp/gmutex
        
           | phrotoma wrote:
           | Huh. Does this mean that the AWS terraform provider could
           | implement state locking without the need for a DDB table the
           | way the GCP provider does?
        
             | arianvanp wrote:
             | Correct
        
         | paulddraper wrote:
         | So....given CAP, which one did they give up
        
           | johnrob wrote:
           | I'd wager that the algorithm is slightly eager to throw a
           | consistency error if it's unable to verify across partitions.
           | Since the caller is naturally ready for this error, it's
           | likely not a problem. So in short it's the P :)
        
             | alanyilunli wrote:
             | Shouldn't that be the A then? Since the network partition
             | is still there but availability is non-guaranteed.
        
               | johnrob wrote:
               | Yes, definitely. Good point (I was knee jerk assuming the
               | A is always chosen and the real "choice" is between C and
               | P).
        
               | rhaen wrote:
               | Well, P isn't really much of a choice, I don't think you
               | can opt out of acts of god.
        
               | fwip wrote:
               | You can design to minimize P, though. For instance, if
               | you have all the services running on the same physical
               | box, and make people enter the room to use it instead of
               | over the Internet, "partition" becomes much less likely.
               | (This example is a bit silly.)
               | 
               | But you're right, if you take a broad view of P, the
               | choice is really between consistency and availability.
        
               | btown wrote:
               | https://tqdev.com/2024-the-p-in-cap-is-for-performance is
               | a really interesting take on this as a response to
               | https://blog.dtornow.com/the-cap-theorem.-the-bad-the-
               | bad-th... - essentially, the only way to get CA is if
               | you're willing to say that every request will succeed
               | eventually, but it might take an unbounded amount of time
               | for partitions to heal, and you have to be willing to
               | wait indefinitely for that to happen. Which can indeed
               | make sense for asynchronous messaging, but not for real-
               | time applications as we think about them in the modern
               | day. In practice, if you're talking about CAP for high-
               | performance systems, you're choosing either CP or AP.
        
           | moralestapia wrote:
           | A tiny bit of availability, unnoticeable at web scale.
        
           | the_arun wrote:
           | I thought they have implemented Optimistic locking now to
           | coordinate concurrent writes. How does it change anything in
           | CAP?
        
             | paulddraper wrote:
             | The C stands for Consistency.
        
           | nimih wrote:
           | Based on my general experience with S3, they jettisoned A
           | years ago (or maybe never had it).
        
       | offmycloud wrote:
       | If the default ETag algorithm for non-encrypted, non-multipart
       | uploads in AWS is a plain MD5 hash, is this subject to failure
       | for object data with MD5 collisions?
       | 
       | I'm thinking of a situation in which an application assumes that
       | different (possibly adversarial) user-provided data will always
       | generate a different ETag.
        
         | revnode wrote:
         | MD5 hash collisions are unlikely to happen at random. The
         | defect was that you can make it happen purposefully, making it
         | useless for security.
        
           | aphantastic wrote:
           | Sure, but theoretically you could have a system where a
           | distributed log of user generated content is built via this
           | CAS//MD5 primitive. A malicious actor could craft the data
           | such that entries are dropped.
        
         | UltraSane wrote:
         | The default Etag is used to detect bit errors and and MD5 is
         | fine for that. S3 does support using SHA256 instead.
        
         | CobrastanJorji wrote:
         | With Google Cloud Storage, you can solve this by conditionally
         | writing based on the "generation number" of the object, which
         | always increases with each new write, so you can know whether
         | the object has been overwritten regardless of its contents. I
         | think Azure also has an equivalent.
        
       | tonymet wrote:
       | good example of how a simple feature on the surface (a header
       | comparison) requires tremendous complexity and capacity on the
       | backend.
        
         | akira2501 wrote:
         | S3 is rated as "durable" as opposed to "best effort." It has
         | lots of interesting guarantees as a result.
        
           | tonymet wrote:
           | Also they are faithful to their consistency commitments
        
       | gravitronic wrote:
       | First thing I thought when I saw the headline was "oh! I should
       | tell Sirupsen"
        
       | JoshTriplett wrote:
       | It's also possible to enforce the use of conditional writes:
       | https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3...
       | 
       | My biggest wishlist item for S3 is the ability to enforce that an
       | object is named with a name that matches its hash. (With a modern
       | hash considered secure, not MD5 or SHA1, though it isn't
       | supported for those either.) That would make it much easier to
       | build content-addressible storage.
        
         | jiggawatts wrote:
         | That will probably never happen because of the fundamental
         | nature of blob storage.
         | 
         | Individual objects are split into multiple blocks, each of
         | which can be stored independently on different underlying
         | servers. Each can see its own block, but not any other block.
         | 
         | Calculating a hash like SHA256 would require a sequential scan
         | through all blocks. This _could_ be done with a minimum of
         | network traffic if instead of streaming the bytes to a central
         | server to hash, the _hash state_ is forwarded from block server
         | to block server in sequence. Still though, it would be a very
         | slow serial operation that could be fairly chatty too if there
         | are many tiny blocks.
         | 
         | What _could_ work would be to use a Merkle tree hash
         | construction where some of subdivision boundaries match the
         | block sizes.
        
           | losteric wrote:
           | Why does the architect of blob storage matter? The hash can
           | be calculated as data streams in for the first write, before
           | data gets dispersed into multiple physically stored blocks.
        
             | willglynn wrote:
             | It is common to use multipart uploads for large objects,
             | since this both increases throughput and decreases latency.
             | Individual part uploads can happen in parallel and complete
             | in any sequence. There's no architectural requirement that
             | an entire object pass through a single system on either
             | S3's side or on the client's side.
        
           | texthompson wrote:
           | Why would you PUT an object, then download it again to a
           | central server in the first place? If a service is accepting
           | an upload of the bytes, it is already doing a pass over all
           | the bytes anyway. It doesn't seem like a ton of overhead to
           | calculate SHA256 in the 4092-byte chunks as the upload
           | progresses. I suspect that sort of calculation would happen
           | anyways.
        
             | danielheath wrote:
             | S3 supports multipart uploads which don't necessarily send
             | all the parts to the same server.
        
               | texthompson wrote:
               | Why does it matter where the bytes are stored at rest?
               | Isn't everything you need for SHA-256 just the results of
               | the SHA-256 algorithm on every 4096-byte block? I think
               | you could just calculate that as the data is streamed in.
        
               | jiggawatts wrote:
               | The data is not necessarily "streamed" in! That's a
               | significant design feature to allow _parallel_ uploads of
               | a single object using many parts ( "blocks"). See: https:
               | //docs.aws.amazon.com/AmazonS3/latest/API/API_CreateMu...
        
               | Dylan16807 wrote:
               | > Isn't everything you need for SHA-256 just the results
               | of the SHA-256 algorithm on every 4096-byte block?
               | 
               | No, you need the hash of the previous block before you
               | can start processing the next block.
        
             | willglynn wrote:
             | You're right, and in fact S3 does this with the `ETag:`
             | header... in the simple case.
             | 
             | S3 also supports more complicated cases where the entire
             | object may not be visible to any single component while it
             | is being written, and in those cases, `ETag:` works
             | differently.
             | 
             | > * Objects created by the PUT Object, POST Object, or Copy
             | operation, or through the AWS Management Console, and are
             | encrypted by SSE-S3 or plaintext, have ETags that are an
             | MD5 digest of their object data.
             | 
             | > * Objects created by the PUT Object, POST Object, or Copy
             | operation, or through the AWS Management Console, and are
             | encrypted by SSE-C or SSE-KMS, have ETags that are not an
             | MD5 digest of their object data.
             | 
             | > * If an object is created by either the Multipart Upload
             | or Part Copy operation, the ETag is not an MD5 digest,
             | regardless of the method of encryption. If an object is
             | larger than 16 MB, the AWS Management Console will upload
             | or copy that object as a Multipart Upload, and therefore
             | the ETag will not be an MD5 digest.
             | 
             | https://docs.aws.amazon.com/AmazonS3/latest/API/API_Object.
             | h...
        
           | Salgat wrote:
           | Isn't that the point of the metadata? Calculate the hash
           | ahead of time and store it in the metadata as part of the
           | atomic commit for the blob (at least for S3).
        
           | flakes wrote:
           | You have just re-invented IPFS!
           | https://en.m.wikipedia.org/wiki/InterPlanetary_File_System
        
         | cmeacham98 wrote:
         | Is there any reason you can't enforce that restriction on your
         | side? Or are you saying you want S3 to automatically set the
         | name for you based on the hash?
        
           | JoshTriplett wrote:
           | > Is there any reason you can't enforce that restriction on
           | your side?
           | 
           | I'd like to set IAM permissions for a role, so that that role
           | can add objects to the content-addressible store, but only if
           | their name matches the hash of their content.
           | 
           | > Or are you saying you want S3 to automatically set the name
           | for you based on the hash?
           | 
           | I'm happy to name the files myself, if I can get S3 to
           | enforce that. But sure, if it were easier, I'd be thrilled to
           | have S3 name the files by hash, and/or support retrieving
           | files by hash.
        
             | mdavidn wrote:
             | I think you can presign PutObject calls that validate a
             | particular SHA-256 checksum. An API endpoint, e.g. in a
             | Lambda, can effectively enforce this rule. It unfortunately
             | won't work on multipart uploads except on individual parts.
        
               | UltraSane wrote:
               | The hash of multipart uploads is simply the hash of all
               | the part hashes. I've been able to replicate it.
        
               | thayne wrote:
               | But in order to do that you need to already know the
               | contents of the file.
               | 
               | I suppose you could have some API to request a signed url
               | for a certain hash, but that starts getting complicated,
               | especially if you need support for multi-part uploads,
               | which you probably do.
        
               | JoshTriplett wrote:
               | Unfortunately, last I checked, the list of headers you're
               | allowed to enforce for pre-signing does not include the
               | hash.
        
         | anotheraccount9 wrote:
         | Could you use a meta field from the object and save the hash in
         | it, running a compare from it?
        
         | texthompson wrote:
         | That's interesting. Would you want it to be something like a
         | bucket setting, like "any time an object is uploaded, don't let
         | an object write complete unless S3 verifies that a pre-defined
         | hash function (like SHA256) is called to verify that the
         | object's name matches the object's contents?"
        
           | BikiniPrince wrote:
           | You can already put with a sha256 hash. If it fails it just
           | returns an error.
        
         | UltraSane wrote:
         | S3 has supported SHA-256 as a checksum algo since 2022. You can
         | calculate the hash locally and then specify that hash in the
         | PutObject call. S3 will calculate the hash and compare it with
         | the hash in the PutObject call and reject the Put if they
         | differ. The hash and algo are then stored in the object's
         | metadata. You simply also use the SHA-256 hash as the key for
         | the object.
         | 
         | https://aws.amazon.com/blogs/aws/new-additional-checksum-alg...
        
           | thayne wrote:
           | Unfortunately, for a multi-part upload it isn't a hash of the
           | total object, it is a hash of the hashes for each part, which
           | is a lot less useful. Especially if you don't know how the
           | file was partititioned during upload.
           | 
           | And even if it was for the whole file, it isn't used for the
           | ETag, so, so it can't be used for conditional PUTs.
           | 
           | I had a use case where this looked really promising, then I
           | ran into the multipart upload limitations, and ended up using
           | my own custom metadata for the sha256sum.
        
             | vdm wrote:
             | Don't the SDKs take care of computing the multi-part
             | checksum during upload?
             | 
             | > To create a trailing checksum when using an AWS SDK,
             | populate the ChecksumAlgorithm parameter with your
             | preferred algorithm. The SDK uses that algorithm to
             | calculate the checksum for your object (or object parts)
             | and automatically appends it to the end of your upload
             | request. This behavior saves you time because Amazon S3
             | performs both the verification and upload of your data in a
             | single pass. https://docs.aws.amazon.com/AmazonS3/latest/us
             | erguide/checki...
        
               | tedk-42 wrote:
               | It does and has a good default. An issue I've come across
               | though is you have the file locally and you want to check
               | the e-tag value - you'll have to do this locally first
               | and then compare the value to the S3 stored object.
        
               | vdm wrote:
               | https://github.com/peak/s3hash
               | 
               | It would be nice if this got updated for Additional
               | Checksums.
        
             | vdm wrote:
             | Ways to control etag/Additional Checksums without
             | configuring clients:
             | 
             | CopyObject writes a single part object and can read from a
             | multipart object, as long as the parts total less than the
             | 5 gibibyte limit for a single part.
             | 
             | For future writes, s3:ObjectCreated:CompleteMultipartUpload
             | event can trigger CopyObject, else defrag to policy size
             | parts. Boto copy() with multipart_chunksize configured is
             | the most convenient implementation, other SDKs lack an
             | equivalent.
             | 
             | For past writes, existing multipart objects can be selected
             | from inventory filtering ETag column length greater than 32
             | characters. Dividing object size by part size might hint if
             | part size is policy.
        
               | vdm wrote:
               | > Dividing object size by part size
               | 
               | Correction: and also part _quantity_ (parsed from etag)
               | for comparison
        
             | infogulch wrote:
             | If parts are aligned on a 1024-byte boundary and you know
             | each part's start offset, it should be possible to use the
             | internals of a BLAKE3 tree to get the final hash of all the
             | parts together even as they're uploaded separately.
             | https://github.com/C2SP/C2SP/blob/main/BLAKE3.md#13-tree-
             | has...
             | 
             | Edit: This is actually already implemented in the Bao
             | project which exploits the structure of the BLAKE3 merkle
             | tree structure to offer cool features like streaming
             | verification and verifying slices of a file as I described
             | above: https://github.com/oconnor663/bao#verifying-slices
        
         | josnyder wrote:
         | While it can't be done server-side, this can be done
         | straightforwardly in a signer service, and the signer doesn't
         | need to interact with the payloads being uploaded. In other
         | words, a tiny signer can act as a control plane for massive
         | quantities of uploaded data.
         | 
         | The client sends the request headers (including the x-amz-
         | content-sha256 header) to the signer, and the signer responds
         | with a valid S3 PUT request (minus body). The client takes the
         | signer's response, appends its chosen request payload, and
         | uploads it to S3. With such a system, you can implement a
         | signer in a lambda function, and the lambda function enforces
         | the content-addressed invariant.
         | 
         | Unfortunately it doesn't work natively with multipart: while
         | SigV4+S3 enables you to enforce the SHA256 of each individual
         | part, you can't enforce the SHA256 of the entire object. If you
         | really want, you can invent your own tree hashing format atop
         | SHA256, and enforce content-addressability on that.
         | 
         | I have a blog post [1] that goes into more depth on signers in
         | general.
         | 
         | [1]
         | https://josnyder.com/blog/2024/patterns_in_s3_data_access.ht...
        
           | JoshTriplett wrote:
           | That's incredibly interesting, thank you! That's a really
           | creative approach, and it looks like it might work for me.
        
       | Sirupsen wrote:
       | To avoid any dependencies other than object storage, we've been
       | making use of this in our database (turbopuffer.com) for
       | consensus and concurrency control since day one. Been waiting for
       | this since the day we launched on Google Cloud Storage ~1 year
       | ago. Our bet that S3 would get it in a reasonable time-frame
       | worked out!
       | 
       | https://turbopuffer.com/blog/turbopuffer
        
         | amazingamazing wrote:
         | Interesting that what's basically an ad is the top comment -
         | it's not like this is open source or anything - can't even use
         | it immediately (you have to apply for access). Totally
         | proprietary. At least elasticsearch is APGL, saying nothing of
         | open search which also supports use of S3
        
           | viraptor wrote:
           | Someone made an informed technical bet that worked out.
           | Sounds like HN material to me. (Also, is it really a useful
           | ad if you can't easily use the product?)
        
             | amazingamazing wrote:
             | Worked out how? There's no implementation. It's just
             | conjecture.
        
               | hedora wrote:
               | Pretty much all other S3 implementations (including open
               | source ones) support this or equivalent primitives, so
               | this is great for interoperability with existing
               | implementations.
        
               | viraptor wrote:
               | It's right there:
               | 
               | > Our bet that S3 would get it in a reasonable time-frame
               | worked out!
        
               | amazingamazing wrote:
               | How? This is a technical forum. Unless you're saying any
               | consumer of S3 can now spam links to their product on
               | this thread with impunity. (Hey maybe they're using cas).
        
               | richardlblair wrote:
               | Oh look, someone is mad on the internet about something
               | silly.
        
           | jauntywundrkind wrote:
           | https://github.com/slatedb/slatedb will, I expect, use this
           | at some point. Object backed DB, which is open source.
        
             | benesch wrote:
             | Yes! I'm actively working on it, in fact. We're waiting on
             | the next release of the Rust `object_store` crate, which
             | will bring support for S3's native conditional puts.
             | 
             | If you want to follow along:
             | https://github.com/slatedb/slatedb/issues/164
        
           | ramraj07 wrote:
           | No one owes anyone open source. If they can make the business
           | case work or if it works in their favor, sure.
        
           | jrochkind1 wrote:
           | I don't mind hearing another developer's use case for this
           | feature, even if it's commercial proprietary software.
           | 
           | It's no longer top comment, which is fine.
        
         | CobrastanJorji wrote:
         | I'm glad that bet worked out for you, but what made you think
         | one year ago that S3 would introduce it soon that was untrue
         | for the previous 15 years?
        
       | CubsFan1060 wrote:
       | I feel dumb for asking this, but can someone explain why this is
       | such a big deal? I'm not quite sure I am grokking it yet.
        
         | Sirupsen wrote:
         | The short of it is that building a database on top of object
         | storage has generally required a complicated, distributed
         | system for consensus/metadata. CAS makes it possible to build
         | these big data systems without any other dependencies. This is
         | a win for simplicity and reliability.
        
           | CubsFan1060 wrote:
           | Thanks! Do they mention when the comparison is done? Is it
           | before, after, or during an upload? (For instance, if I have
           | a 4tb file in a multi part upload, would I only know it would
           | fail as soon as the whole file is uploaded?)
        
             | poincaredisk wrote:
             | I imagine, for it to make sense, that the comparison is
             | done at the last possible moment, before atomically
             | swapping the file contents.
        
               | Nevermark wrote:
               | I can imagine it might be useful to make this a choice
               | for databases with high frequency small swaps and
               | occasional large ones.
               | 
               | 1) default, load-compare-&-swap for small fast
               | load/swaps.
               | 
               | 2) optional, compare-load-&-swap to allow a large load to
               | pass its compare, and cut in front of all the fast small
               | swap that would otherwise create an un-hittable moving
               | target during its long loads for its own compare.
               | 
               | 3) If the load itself was stable relative to the compare,
               | then it could be pre-loaded and swapped into a holding
               | location, followed by as many fast compare-&-swaps as
               | needed to get it into the right location.
        
               | lxgr wrote:
               | Practically, they could do both: Do an early reject of a
               | given POST in case the ETag does not match, but re-
               | validate this just before swapping out the objects (and
               | committing to considering the given request as the
               | successful one globally).
               | 
               | That said, I'm not sure if common HTTP libraries look at
               | response headers before they're done posting a response
               | body, or if that's even allowed/possible in HTTP? It
               | seems feasible at a first glance with chunked encoding,
               | at least.
               | 
               | Edit: Upon looking a bit, it seems that informational
               | response codes, e.g. 100 (Continue) in combination with
               | Expect 100-continue in the requests, could enable just
               | that and avoid an extra GET with If-Match.
        
             | timmg wrote:
             | (I assume) it will fail if the eTag doesn't match -- the
             | instance it got the header.
             | 
             | The main point of it is: I have an object that I want to
             | mutate. I _think_ I have the latest version in memory. So I
             | update in memory and upload it to S3 _with the eTag of the
             | version I have_ and tell it to only commit _if that is the
             | latest version_. If it  "fails", I re-download the object,
             | re-apply the mutation, and try again.
        
         | lxgr wrote:
         | If my memory of parallel algorithms class serves me right, you
         | can build any synchronization algorithm on top of compare-and-
         | swap as an atomic primitive.
         | 
         | As a (horribly inefficient, in case of non-trivial write
         | contention) toy example, you could use S3 as a lock-free
         | concurrent SQLite storage backend: Reads work as expected by
         | fetching the entire database and satisfying the operation
         | locally; writes work like this:
         | 
         | - Download the current database copy
         | 
         | - Perform your write locally
         | 
         | - Upload it back using "Put-If-Match" and the pre-edit copy as
         | the matched object.
         | 
         | - If you get success, consider the transaction successful.
         | 
         | - If you get failure, go back to step 1 and try again.
        
         | jayd16 wrote:
         | When you upload a change you can know you're not clobbering
         | changes you never saw.
        
           | ramraj07 wrote:
           | Brilliant single line that is better than every other
           | description above. Kudos.
        
           | papichulo2023 wrote:
           | I think is called write after write (WAW) if I remember
           | correctly.
        
         | CobrastanJorji wrote:
         | It is often very important to know, when you write an object,
         | what the previous state was. Say you sold plushies and you had
         | 100 plushies in a warehouse. You create a file
         | "remainingPlushies.txt" that stores "100". If somebody buys a
         | plushie, you read the file, and if it's bigger than 0, you
         | subtract 1, write the new version of the file, and okay the
         | sale.
         | 
         | Without conditional writes, two instances of your application
         | might both read "100", both subtract 1, and both write "99". If
         | they checked the file afterward, both would think everything
         | was fine. But things aren't find because you've actually sold
         | two.
         | 
         | The other cloud storage providers have had these sorts of
         | conditional write features since basically forever, and it's
         | always been really weird that S3 has lacked them.
        
       | rrr_oh_man wrote:
       | Could anybody explain for the uninitiated?
        
         | msoad wrote:
         | It ensures that when you try to upload (or "put") a new version
         | of a file, the operation only succeeds if the file on the
         | server still has the exact version (ETag) you specify. If
         | someone else has updated the file in the meantime, your upload
         | is blocked to prevent overwriting their changes.
         | 
         | This is especially useful in scenarios where multiple users or
         | processes are working on the same data, as it helps maintain
         | consistency and avoids accidental overwrites.
         | 
         | This is using the same mechanism as HTTP's `If-None-Match`
         | header so it's easier to implement/learn
        
           | rrr_oh_man wrote:
           | Thank you! That was extremely helpful (and written in a way
           | that is easy to understand)!
        
       | wanderingmind wrote:
       | Does this mean, in theory we will be able to manage multiple
       | concurrent writes/updates to s3 without having to use new
       | solutions like Regatta[1] that was recently launched?
       | 
       | https://news.ycombinator.com/item?id=42174204
        
         | huntaub wrote:
         | Here's how I would think about this. Regatta isn't the best way
         | to _add_ synchronization primitives to S3, if you 're already
         | using the S3 API and able to change your code. Regatta is most
         | useful when you need a local disk, or a higher performance
         | version of S3. In this case, the addition of these new
         | primitives actually just makes Regatta work better for our
         | customers -- because we get to achieve even stronger
         | consistency.
        
       | dvektor wrote:
       | [rejected] error: failed to push some refs to remote repository
       | 
       | Finally we can have this with s3 :)
        
         | mdaniel wrote:
         | Relevant: https://github.com/awslabs/git-remote-s3#readme
         | https://news.ycombinator.com/item?id=41887004
        
       | vlovich123 wrote:
       | I implemented that extension in R2 at launch IIRC. Thanks for
       | catching up & helping move distributed storage applications a
       | meaningful step forward. Intended sincerely. I'm sure adding this
       | was non-trivial for a complex legacy codebase like that.
        
       | ipython wrote:
       | I can't wait to see what abomination Cory Quinn can come up with
       | now given this new primitive! (see previous work abusing Route53
       | as a database:
       | https://www.lastweekinaws.com/blog/route-53-amazons-premier-...)
        
       | stevefan1999 wrote:
       | So...are we closer to getting to use S3 as a...you guessed it...a
       | database? With CAS, we are probably able to get a basic level of
       | atomicity, and S3 itself is pretty durable, now we have to deal
       | with consistency and isolation...although S3 branded itself as
       | "eventually consistent"...
        
         | mr_toad wrote:
         | People who want all those features use something like Delta
         | Lake on top of object storage.
        
         | User23 wrote:
         | There was a great deal of interest in gossip protocols,
         | eventual consistency, and such at Amazon in the mid oughts. So
         | much so that they hired a certain Cornell professor along with
         | the better part of his grad students to build out those
         | technologies.
        
         | gynther wrote:
         | S3 is strongly consistent since 4 years ago.
         | https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-rea...
        
       | amazingamazing wrote:
       | Ironically with this and lambda you could make a serverless
       | sqlite by mapping pages to objects, using http range reads to
       | read the db and lambda to translate queries to the writes in the
       | appropriate pages via cas. Prior to this it would require a
       | server to handle concurrent writers, making the whole thing a
       | nonstarter for "serverless".
       | 
       | Too bad performance would be terrible without a caching layer
       | (ebs).
        
         | captn3m0 wrote:
         | For read heavy workloads, you could cache the results at
         | cloudfront. Maybe we will someday see Wordpress-on-Lambda-to-
         | Sqlite-over-S3.
        
       | m_d_ wrote:
       | s3fs's https://github.com/fsspec/s3fs/pull/917 was in response to
       | the IfNoneMatch feature from the summer. How would people imagine
       | this new feature being surfaced in a filesystem abstraction?
        
       | grahamj wrote:
       | bender_neat.gif
        
       | maglite77 wrote:
       | Noting that Azure Blob storage supports e-tag / optimistic
       | controls as well (via If-Match conditions)[1], how does this
       | differ? Or is it the same feature?
       | 
       | [1]: https://learn.microsoft.com/en-
       | us/azure/storage/blobs/concur...
        
         | simonw wrote:
         | It's the same feature. Google Cloud Storage has it too:
         | https://cloud.google.com/storage/docs/request-preconditions#...
        
       | paulsutter wrote:
       | What's amazing is that it took them so long to add these
       | functions
        
       | serbrech wrote:
       | Why is standard etag support making the frontpage?
        
       | thayne wrote:
       | Now if only you had more control over the ETag, so you could use
       | a sha256 of the total file (even for multi-part uploads), or a
       | version counter, or a global counter from an external system, or
       | a logical hash of the content as opposed to a hash of the bytes.
        
       | vytautask wrote:
       | An open-source implementation of Amazon S3 - MinIO has had it for
       | almost two years (relevant post: https://blog.min.io/leading-the-
       | way-minios-conditional-write...). Strangely, Amazon is catching
       | up just now.
        
         | topspin wrote:
         | That's not "strange" to me. Object storage has been a long time
         | coming, and it's still being figured out: the entirely typical
         | process of discovering useful and feasible primitives that
         | expand applicability to more sophisticated problems. This is
         | obviously going occur first in smaller and/or younger, more
         | agile implementations, whereas AWS has the problem of
         | implementing this at pretty much the largest conceivable scale
         | with zero risk. The lag is, therefore, entirely unsurprising.
        
         | aseipp wrote:
         | It's not surprising at all. The scale of AWS, in particular S3,
         | is nearly unfathomable, and the kind of solutions they need for
         | "simple" things are totally different at that size. S3 was
         | doing 1.1million requests a second back in 2013.[1]
         | 
         | I wouldn't be surprised if they saw over 100mil/req/sec
         | globally by now. That's 100 million requests _a second_ that
         | need strong read-your-write consistency and atomicity at global
         | scale. The number of pieces they had to move into place for
         | this to happen is probably quite the engineering tale.
         | 
         | [1] https://aws.amazon.com/blogs/aws/amazon-s3-two-trillion-
         | obje...
        
       | lttlrck wrote:
       | Isn't this compare-and-set rather than compare-and-swap?
        
       | torginus wrote:
       | Ah so its not only me that uses AWS primitives for hackily
       | implementing all sorts of synchronization primitives.
       | 
       | My other favorite pattern is implementing a pool of workers by
       | quering ec2 instances with a certain tag in a stopped state and
       | starting them. Starting the instance can succeed only once - that
       | means I managed to snatch the machine. If it fails, I try again,
       | grabbing another one.
       | 
       | This is one of those things that I never advertised out of
       | professional shame, but it works, its bulletproof and dead simple
       | and does not require additional infra to work.
        
         | _zoltan_ wrote:
         | this actually sounds interesting. do you precreate the workers
         | beforehand and then just keep them in a stopped state?
        
           | torginus wrote:
           | yeah. one of the goals was startup time, so It made sense to
           | precreate them. In practice we never ran out of free machines
           | (and if we did, I have a cdk script to make more), and
           | inifnite scaling is a pain in the butt anyways due to having
           | to manage subnets etc.
           | 
           | Cost-wise we're only paying for the EBS volumes for the
           | stopped instances which are like 4GB each, so they cost
           | practically nothing, we spend less than a dollar per month
           | for the whole bunch.
        
             | zild3d wrote:
             | Warm pools are a supported feature in AWS on auto scaling
             | groups. Works as you're describing (have a pool of
             | instances in stopped state ready to use, only pay for EBS
             | volume if relevant)
             | https://aws.amazon.com/blogs/compute/scaling-your-
             | applicatio...
        
             | merb wrote:
             | I always thought that stopped instances will cost money as
             | well?!
        
               | torginus wrote:
               | You're only paying for the hard drive (and the VPC stuff,
               | if you want to be pedantic). The downside is that if you
               | try to start your instance, they might not start if AWS
               | doesn't have the capacity (rare but have seen it happen,
               | particularly with larger, more exotic instances.)
        
             | rfoo wrote:
             | > we spend less than a dollar per month for the whole bunch
             | 
             | This does not change the point, I'm just being pedantic,
             | but:
             | 
             | 4GB of gp3 EBS takes $0.32 per month, assuming a 50%
             | discount (not unusual), less than a dollar gives only... 6
             | instances.
        
         | belter wrote:
         | If you use hourly billed machines...Sounds like the world most
         | expensive semaphore :-)
        
           | torginus wrote:
           | except we are actually using them :)
        
             | belter wrote:
             | Just don't call them before the hour and start a different
             | one again.Because otherwise within the hour, you will be
             | billed for hundreds of hours...If they are of the type
             | billed by the hour....
        
           | messe wrote:
           | EC2 bills by the second.
        
             | belter wrote:
             | Some...
             | 
             | "Your Amazon EC2 usage is calculated by either the hour or
             | the second based on the size of the instance, operating
             | system, and the AWS Region where the instances are
             | launched" - https://repost.aws/knowledge-
             | center/ec2-instance-hour-billin...
             | 
             | https://aws.amazon.com/ec2/pricing/on-demand/
        
               | QuinnyPig wrote:
               | MacOS instances appear to be the sole remaining exception
               | since RHEL got on board.
        
               | redeux wrote:
               | Thanks Corey. Always nice to get the TL;DR from an
               | authority on the subject.
        
         | williamdclt wrote:
         | What would you say would be the "clean" way to implement a pool
         | of workers (using EC2 instances too)?
        
           | ndjdjddjsjj wrote:
           | etcd?
        
           | torginus wrote:
           | not sure, probably either an eks cluster with a job scheduler
           | pod that creates jobs via the batch api. The scheduler pod
           | might be replaced by a lambda. Another possibility is
           | something cooked up with a lambda creating ec2 instances via
           | cdk and the whole thing is kept track by a dynamodb table.
           | 
           | the first one is probably cleaner (though I don't like it, it
           | means that I need the instance to be a kubernetes node, and
           | that comes with a bunch of baggage).
        
           | Cthulhu_ wrote:
           | Autoscaling and task queue based workloads, if my cloud
           | theory is still relevant.
        
             | twodave wrote:
             | Agreed. Scaling based on the length of the queue, up to
             | some maximum.
        
               | giovannibonetti wrote:
               | Even better, based on queue latency instead of length
        
               | jcrites wrote:
               | The single best metric I've found for scaling things like
               | this is the percent of concurrent capacity that's in use.
               | I wrote about this in a previous HN comment:
               | https://news.ycombinator.com/item?id=41277046
               | 
               | Scaling on things like the length of the queue doesn't
               | work very well at all in practice. A queue length of 100
               | might be horribly long in some workloads and
               | insignificant in others, so scaling on queue length
               | requires a lot of tuning that must be adjusted over time
               | as the workload changes. Scaling based on percent of
               | concurrent capacity can work for most workloads, and
               | tends to remain stable over time even as workloads
               | change.
        
       | anonymousDan wrote:
       | Would be interesting to understand how they've implemented it and
       | they whether there is any perf impact on other API calls.
        
       | londons_explore wrote:
       | So we can now implement S3-as-RAM for a worldwide million-core
       | linux VM?
        
       | juggli wrote:
       | finally
        
       | spprashant wrote:
       | I had no idea people rely on S3 beyond dumb storage. It almost
       | feels like people are trying to build out a distributed OLAP
       | database in the reverse direction.
        
         | amne wrote:
         | 1. SELECT ... INTO OUTFILE S3
         | 
         | 2. glue jobs to partition by some columns reporting uses
         | 
         | 3. query with athena
         | 
         | 4. ???
         | 
         | 5. profit (celebrate reduced cost)
         | 
         | This thing costs couple $ a month for ~500gb of data. Snowflake
         | wanted crazy amounts of money for the same thing.
        
       ___________________________________________________________________
       (page generated 2024-11-26 23:01 UTC)