[HN Gopher] Amazon S3 Adds Put-If-Match (Compare-and-Swap)
___________________________________________________________________
Amazon S3 Adds Put-If-Match (Compare-and-Swap)
Author : Sirupsen
Score : 500 points
Date : 2024-11-25 22:11 UTC (1 days ago)
(HTM) web link (aws.amazon.com)
(TXT) w3m dump (aws.amazon.com)
| koolba wrote:
| This combined with the read-after-write consistency guarantee is
| a perfect building block (pun intended) for incremental append
| only storage atop an object store. It solves the biggest problem
| with coordinating multiple writers to a WAL.
| IgorPartola wrote:
| Rename for objects and "directories" also. Atomic.
| ncruces wrote:
| Both this and read-after-write consistency is single object.
|
| So coordinating writes to multiple objects still requires...
| creativity.
| sillysaurusx wrote:
| Finally. GCP has had this for a long time. Years ago I was
| surprised S3 didn't.
| ncruces wrote:
| GCS is just missing x-amz-copy-source-range in my book.
|
| Can we have this Google?
|
| ...
|
| Please?
| mannyv wrote:
| GCP still doesn't have triggers out of beta last time i checked
| (which was a while ago).
| fragmede wrote:
| Gmail was in beta for five years, I don't think that label
| really means anything.
| UltraSane wrote:
| It means that Google doesn't want to offer an SLA
| sitkack wrote:
| Not that it matters. It just changes the volume and
| timing of "I believe I did bob"
| BrandonY wrote:
| We do have Cloud Run Functions that trigger on Cloud Storage
| events, as well as Cloud Pub/Sub notifications for the same.
| Is there a specific bit of functionality you're looking for?
| seansmccullough wrote:
| Azure Storage has also had this for years -
| https://learn.microsoft.com/en-us/rest/api/storageservices/s...
| 1a527dd5 wrote:
| Be still my beating heart. I have lived to see this day.
|
| Genuinely, we've wanted this for ages and we got half way there
| with strong consistency.
| ncruces wrote:
| Might finally be possible to do this on S3:
| https://pkg.go.dev/github.com/ncruces/go-gcp/gmutex
| phrotoma wrote:
| Huh. Does this mean that the AWS terraform provider could
| implement state locking without the need for a DDB table the
| way the GCP provider does?
| arianvanp wrote:
| Correct
| paulddraper wrote:
| So....given CAP, which one did they give up
| johnrob wrote:
| I'd wager that the algorithm is slightly eager to throw a
| consistency error if it's unable to verify across partitions.
| Since the caller is naturally ready for this error, it's
| likely not a problem. So in short it's the P :)
| alanyilunli wrote:
| Shouldn't that be the A then? Since the network partition
| is still there but availability is non-guaranteed.
| johnrob wrote:
| Yes, definitely. Good point (I was knee jerk assuming the
| A is always chosen and the real "choice" is between C and
| P).
| rhaen wrote:
| Well, P isn't really much of a choice, I don't think you
| can opt out of acts of god.
| fwip wrote:
| You can design to minimize P, though. For instance, if
| you have all the services running on the same physical
| box, and make people enter the room to use it instead of
| over the Internet, "partition" becomes much less likely.
| (This example is a bit silly.)
|
| But you're right, if you take a broad view of P, the
| choice is really between consistency and availability.
| btown wrote:
| https://tqdev.com/2024-the-p-in-cap-is-for-performance is
| a really interesting take on this as a response to
| https://blog.dtornow.com/the-cap-theorem.-the-bad-the-
| bad-th... - essentially, the only way to get CA is if
| you're willing to say that every request will succeed
| eventually, but it might take an unbounded amount of time
| for partitions to heal, and you have to be willing to
| wait indefinitely for that to happen. Which can indeed
| make sense for asynchronous messaging, but not for real-
| time applications as we think about them in the modern
| day. In practice, if you're talking about CAP for high-
| performance systems, you're choosing either CP or AP.
| moralestapia wrote:
| A tiny bit of availability, unnoticeable at web scale.
| the_arun wrote:
| I thought they have implemented Optimistic locking now to
| coordinate concurrent writes. How does it change anything in
| CAP?
| paulddraper wrote:
| The C stands for Consistency.
| nimih wrote:
| Based on my general experience with S3, they jettisoned A
| years ago (or maybe never had it).
| offmycloud wrote:
| If the default ETag algorithm for non-encrypted, non-multipart
| uploads in AWS is a plain MD5 hash, is this subject to failure
| for object data with MD5 collisions?
|
| I'm thinking of a situation in which an application assumes that
| different (possibly adversarial) user-provided data will always
| generate a different ETag.
| revnode wrote:
| MD5 hash collisions are unlikely to happen at random. The
| defect was that you can make it happen purposefully, making it
| useless for security.
| aphantastic wrote:
| Sure, but theoretically you could have a system where a
| distributed log of user generated content is built via this
| CAS//MD5 primitive. A malicious actor could craft the data
| such that entries are dropped.
| UltraSane wrote:
| The default Etag is used to detect bit errors and and MD5 is
| fine for that. S3 does support using SHA256 instead.
| CobrastanJorji wrote:
| With Google Cloud Storage, you can solve this by conditionally
| writing based on the "generation number" of the object, which
| always increases with each new write, so you can know whether
| the object has been overwritten regardless of its contents. I
| think Azure also has an equivalent.
| tonymet wrote:
| good example of how a simple feature on the surface (a header
| comparison) requires tremendous complexity and capacity on the
| backend.
| akira2501 wrote:
| S3 is rated as "durable" as opposed to "best effort." It has
| lots of interesting guarantees as a result.
| tonymet wrote:
| Also they are faithful to their consistency commitments
| gravitronic wrote:
| First thing I thought when I saw the headline was "oh! I should
| tell Sirupsen"
| JoshTriplett wrote:
| It's also possible to enforce the use of conditional writes:
| https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3...
|
| My biggest wishlist item for S3 is the ability to enforce that an
| object is named with a name that matches its hash. (With a modern
| hash considered secure, not MD5 or SHA1, though it isn't
| supported for those either.) That would make it much easier to
| build content-addressible storage.
| jiggawatts wrote:
| That will probably never happen because of the fundamental
| nature of blob storage.
|
| Individual objects are split into multiple blocks, each of
| which can be stored independently on different underlying
| servers. Each can see its own block, but not any other block.
|
| Calculating a hash like SHA256 would require a sequential scan
| through all blocks. This _could_ be done with a minimum of
| network traffic if instead of streaming the bytes to a central
| server to hash, the _hash state_ is forwarded from block server
| to block server in sequence. Still though, it would be a very
| slow serial operation that could be fairly chatty too if there
| are many tiny blocks.
|
| What _could_ work would be to use a Merkle tree hash
| construction where some of subdivision boundaries match the
| block sizes.
| losteric wrote:
| Why does the architect of blob storage matter? The hash can
| be calculated as data streams in for the first write, before
| data gets dispersed into multiple physically stored blocks.
| willglynn wrote:
| It is common to use multipart uploads for large objects,
| since this both increases throughput and decreases latency.
| Individual part uploads can happen in parallel and complete
| in any sequence. There's no architectural requirement that
| an entire object pass through a single system on either
| S3's side or on the client's side.
| texthompson wrote:
| Why would you PUT an object, then download it again to a
| central server in the first place? If a service is accepting
| an upload of the bytes, it is already doing a pass over all
| the bytes anyway. It doesn't seem like a ton of overhead to
| calculate SHA256 in the 4092-byte chunks as the upload
| progresses. I suspect that sort of calculation would happen
| anyways.
| danielheath wrote:
| S3 supports multipart uploads which don't necessarily send
| all the parts to the same server.
| texthompson wrote:
| Why does it matter where the bytes are stored at rest?
| Isn't everything you need for SHA-256 just the results of
| the SHA-256 algorithm on every 4096-byte block? I think
| you could just calculate that as the data is streamed in.
| jiggawatts wrote:
| The data is not necessarily "streamed" in! That's a
| significant design feature to allow _parallel_ uploads of
| a single object using many parts ( "blocks"). See: https:
| //docs.aws.amazon.com/AmazonS3/latest/API/API_CreateMu...
| Dylan16807 wrote:
| > Isn't everything you need for SHA-256 just the results
| of the SHA-256 algorithm on every 4096-byte block?
|
| No, you need the hash of the previous block before you
| can start processing the next block.
| willglynn wrote:
| You're right, and in fact S3 does this with the `ETag:`
| header... in the simple case.
|
| S3 also supports more complicated cases where the entire
| object may not be visible to any single component while it
| is being written, and in those cases, `ETag:` works
| differently.
|
| > * Objects created by the PUT Object, POST Object, or Copy
| operation, or through the AWS Management Console, and are
| encrypted by SSE-S3 or plaintext, have ETags that are an
| MD5 digest of their object data.
|
| > * Objects created by the PUT Object, POST Object, or Copy
| operation, or through the AWS Management Console, and are
| encrypted by SSE-C or SSE-KMS, have ETags that are not an
| MD5 digest of their object data.
|
| > * If an object is created by either the Multipart Upload
| or Part Copy operation, the ETag is not an MD5 digest,
| regardless of the method of encryption. If an object is
| larger than 16 MB, the AWS Management Console will upload
| or copy that object as a Multipart Upload, and therefore
| the ETag will not be an MD5 digest.
|
| https://docs.aws.amazon.com/AmazonS3/latest/API/API_Object.
| h...
| Salgat wrote:
| Isn't that the point of the metadata? Calculate the hash
| ahead of time and store it in the metadata as part of the
| atomic commit for the blob (at least for S3).
| flakes wrote:
| You have just re-invented IPFS!
| https://en.m.wikipedia.org/wiki/InterPlanetary_File_System
| cmeacham98 wrote:
| Is there any reason you can't enforce that restriction on your
| side? Or are you saying you want S3 to automatically set the
| name for you based on the hash?
| JoshTriplett wrote:
| > Is there any reason you can't enforce that restriction on
| your side?
|
| I'd like to set IAM permissions for a role, so that that role
| can add objects to the content-addressible store, but only if
| their name matches the hash of their content.
|
| > Or are you saying you want S3 to automatically set the name
| for you based on the hash?
|
| I'm happy to name the files myself, if I can get S3 to
| enforce that. But sure, if it were easier, I'd be thrilled to
| have S3 name the files by hash, and/or support retrieving
| files by hash.
| mdavidn wrote:
| I think you can presign PutObject calls that validate a
| particular SHA-256 checksum. An API endpoint, e.g. in a
| Lambda, can effectively enforce this rule. It unfortunately
| won't work on multipart uploads except on individual parts.
| UltraSane wrote:
| The hash of multipart uploads is simply the hash of all
| the part hashes. I've been able to replicate it.
| thayne wrote:
| But in order to do that you need to already know the
| contents of the file.
|
| I suppose you could have some API to request a signed url
| for a certain hash, but that starts getting complicated,
| especially if you need support for multi-part uploads,
| which you probably do.
| JoshTriplett wrote:
| Unfortunately, last I checked, the list of headers you're
| allowed to enforce for pre-signing does not include the
| hash.
| anotheraccount9 wrote:
| Could you use a meta field from the object and save the hash in
| it, running a compare from it?
| texthompson wrote:
| That's interesting. Would you want it to be something like a
| bucket setting, like "any time an object is uploaded, don't let
| an object write complete unless S3 verifies that a pre-defined
| hash function (like SHA256) is called to verify that the
| object's name matches the object's contents?"
| BikiniPrince wrote:
| You can already put with a sha256 hash. If it fails it just
| returns an error.
| UltraSane wrote:
| S3 has supported SHA-256 as a checksum algo since 2022. You can
| calculate the hash locally and then specify that hash in the
| PutObject call. S3 will calculate the hash and compare it with
| the hash in the PutObject call and reject the Put if they
| differ. The hash and algo are then stored in the object's
| metadata. You simply also use the SHA-256 hash as the key for
| the object.
|
| https://aws.amazon.com/blogs/aws/new-additional-checksum-alg...
| thayne wrote:
| Unfortunately, for a multi-part upload it isn't a hash of the
| total object, it is a hash of the hashes for each part, which
| is a lot less useful. Especially if you don't know how the
| file was partititioned during upload.
|
| And even if it was for the whole file, it isn't used for the
| ETag, so, so it can't be used for conditional PUTs.
|
| I had a use case where this looked really promising, then I
| ran into the multipart upload limitations, and ended up using
| my own custom metadata for the sha256sum.
| vdm wrote:
| Don't the SDKs take care of computing the multi-part
| checksum during upload?
|
| > To create a trailing checksum when using an AWS SDK,
| populate the ChecksumAlgorithm parameter with your
| preferred algorithm. The SDK uses that algorithm to
| calculate the checksum for your object (or object parts)
| and automatically appends it to the end of your upload
| request. This behavior saves you time because Amazon S3
| performs both the verification and upload of your data in a
| single pass. https://docs.aws.amazon.com/AmazonS3/latest/us
| erguide/checki...
| tedk-42 wrote:
| It does and has a good default. An issue I've come across
| though is you have the file locally and you want to check
| the e-tag value - you'll have to do this locally first
| and then compare the value to the S3 stored object.
| vdm wrote:
| https://github.com/peak/s3hash
|
| It would be nice if this got updated for Additional
| Checksums.
| vdm wrote:
| Ways to control etag/Additional Checksums without
| configuring clients:
|
| CopyObject writes a single part object and can read from a
| multipart object, as long as the parts total less than the
| 5 gibibyte limit for a single part.
|
| For future writes, s3:ObjectCreated:CompleteMultipartUpload
| event can trigger CopyObject, else defrag to policy size
| parts. Boto copy() with multipart_chunksize configured is
| the most convenient implementation, other SDKs lack an
| equivalent.
|
| For past writes, existing multipart objects can be selected
| from inventory filtering ETag column length greater than 32
| characters. Dividing object size by part size might hint if
| part size is policy.
| vdm wrote:
| > Dividing object size by part size
|
| Correction: and also part _quantity_ (parsed from etag)
| for comparison
| infogulch wrote:
| If parts are aligned on a 1024-byte boundary and you know
| each part's start offset, it should be possible to use the
| internals of a BLAKE3 tree to get the final hash of all the
| parts together even as they're uploaded separately.
| https://github.com/C2SP/C2SP/blob/main/BLAKE3.md#13-tree-
| has...
|
| Edit: This is actually already implemented in the Bao
| project which exploits the structure of the BLAKE3 merkle
| tree structure to offer cool features like streaming
| verification and verifying slices of a file as I described
| above: https://github.com/oconnor663/bao#verifying-slices
| josnyder wrote:
| While it can't be done server-side, this can be done
| straightforwardly in a signer service, and the signer doesn't
| need to interact with the payloads being uploaded. In other
| words, a tiny signer can act as a control plane for massive
| quantities of uploaded data.
|
| The client sends the request headers (including the x-amz-
| content-sha256 header) to the signer, and the signer responds
| with a valid S3 PUT request (minus body). The client takes the
| signer's response, appends its chosen request payload, and
| uploads it to S3. With such a system, you can implement a
| signer in a lambda function, and the lambda function enforces
| the content-addressed invariant.
|
| Unfortunately it doesn't work natively with multipart: while
| SigV4+S3 enables you to enforce the SHA256 of each individual
| part, you can't enforce the SHA256 of the entire object. If you
| really want, you can invent your own tree hashing format atop
| SHA256, and enforce content-addressability on that.
|
| I have a blog post [1] that goes into more depth on signers in
| general.
|
| [1]
| https://josnyder.com/blog/2024/patterns_in_s3_data_access.ht...
| JoshTriplett wrote:
| That's incredibly interesting, thank you! That's a really
| creative approach, and it looks like it might work for me.
| Sirupsen wrote:
| To avoid any dependencies other than object storage, we've been
| making use of this in our database (turbopuffer.com) for
| consensus and concurrency control since day one. Been waiting for
| this since the day we launched on Google Cloud Storage ~1 year
| ago. Our bet that S3 would get it in a reasonable time-frame
| worked out!
|
| https://turbopuffer.com/blog/turbopuffer
| amazingamazing wrote:
| Interesting that what's basically an ad is the top comment -
| it's not like this is open source or anything - can't even use
| it immediately (you have to apply for access). Totally
| proprietary. At least elasticsearch is APGL, saying nothing of
| open search which also supports use of S3
| viraptor wrote:
| Someone made an informed technical bet that worked out.
| Sounds like HN material to me. (Also, is it really a useful
| ad if you can't easily use the product?)
| amazingamazing wrote:
| Worked out how? There's no implementation. It's just
| conjecture.
| hedora wrote:
| Pretty much all other S3 implementations (including open
| source ones) support this or equivalent primitives, so
| this is great for interoperability with existing
| implementations.
| viraptor wrote:
| It's right there:
|
| > Our bet that S3 would get it in a reasonable time-frame
| worked out!
| amazingamazing wrote:
| How? This is a technical forum. Unless you're saying any
| consumer of S3 can now spam links to their product on
| this thread with impunity. (Hey maybe they're using cas).
| richardlblair wrote:
| Oh look, someone is mad on the internet about something
| silly.
| jauntywundrkind wrote:
| https://github.com/slatedb/slatedb will, I expect, use this
| at some point. Object backed DB, which is open source.
| benesch wrote:
| Yes! I'm actively working on it, in fact. We're waiting on
| the next release of the Rust `object_store` crate, which
| will bring support for S3's native conditional puts.
|
| If you want to follow along:
| https://github.com/slatedb/slatedb/issues/164
| ramraj07 wrote:
| No one owes anyone open source. If they can make the business
| case work or if it works in their favor, sure.
| jrochkind1 wrote:
| I don't mind hearing another developer's use case for this
| feature, even if it's commercial proprietary software.
|
| It's no longer top comment, which is fine.
| CobrastanJorji wrote:
| I'm glad that bet worked out for you, but what made you think
| one year ago that S3 would introduce it soon that was untrue
| for the previous 15 years?
| CubsFan1060 wrote:
| I feel dumb for asking this, but can someone explain why this is
| such a big deal? I'm not quite sure I am grokking it yet.
| Sirupsen wrote:
| The short of it is that building a database on top of object
| storage has generally required a complicated, distributed
| system for consensus/metadata. CAS makes it possible to build
| these big data systems without any other dependencies. This is
| a win for simplicity and reliability.
| CubsFan1060 wrote:
| Thanks! Do they mention when the comparison is done? Is it
| before, after, or during an upload? (For instance, if I have
| a 4tb file in a multi part upload, would I only know it would
| fail as soon as the whole file is uploaded?)
| poincaredisk wrote:
| I imagine, for it to make sense, that the comparison is
| done at the last possible moment, before atomically
| swapping the file contents.
| Nevermark wrote:
| I can imagine it might be useful to make this a choice
| for databases with high frequency small swaps and
| occasional large ones.
|
| 1) default, load-compare-&-swap for small fast
| load/swaps.
|
| 2) optional, compare-load-&-swap to allow a large load to
| pass its compare, and cut in front of all the fast small
| swap that would otherwise create an un-hittable moving
| target during its long loads for its own compare.
|
| 3) If the load itself was stable relative to the compare,
| then it could be pre-loaded and swapped into a holding
| location, followed by as many fast compare-&-swaps as
| needed to get it into the right location.
| lxgr wrote:
| Practically, they could do both: Do an early reject of a
| given POST in case the ETag does not match, but re-
| validate this just before swapping out the objects (and
| committing to considering the given request as the
| successful one globally).
|
| That said, I'm not sure if common HTTP libraries look at
| response headers before they're done posting a response
| body, or if that's even allowed/possible in HTTP? It
| seems feasible at a first glance with chunked encoding,
| at least.
|
| Edit: Upon looking a bit, it seems that informational
| response codes, e.g. 100 (Continue) in combination with
| Expect 100-continue in the requests, could enable just
| that and avoid an extra GET with If-Match.
| timmg wrote:
| (I assume) it will fail if the eTag doesn't match -- the
| instance it got the header.
|
| The main point of it is: I have an object that I want to
| mutate. I _think_ I have the latest version in memory. So I
| update in memory and upload it to S3 _with the eTag of the
| version I have_ and tell it to only commit _if that is the
| latest version_. If it "fails", I re-download the object,
| re-apply the mutation, and try again.
| lxgr wrote:
| If my memory of parallel algorithms class serves me right, you
| can build any synchronization algorithm on top of compare-and-
| swap as an atomic primitive.
|
| As a (horribly inefficient, in case of non-trivial write
| contention) toy example, you could use S3 as a lock-free
| concurrent SQLite storage backend: Reads work as expected by
| fetching the entire database and satisfying the operation
| locally; writes work like this:
|
| - Download the current database copy
|
| - Perform your write locally
|
| - Upload it back using "Put-If-Match" and the pre-edit copy as
| the matched object.
|
| - If you get success, consider the transaction successful.
|
| - If you get failure, go back to step 1 and try again.
| jayd16 wrote:
| When you upload a change you can know you're not clobbering
| changes you never saw.
| ramraj07 wrote:
| Brilliant single line that is better than every other
| description above. Kudos.
| papichulo2023 wrote:
| I think is called write after write (WAW) if I remember
| correctly.
| CobrastanJorji wrote:
| It is often very important to know, when you write an object,
| what the previous state was. Say you sold plushies and you had
| 100 plushies in a warehouse. You create a file
| "remainingPlushies.txt" that stores "100". If somebody buys a
| plushie, you read the file, and if it's bigger than 0, you
| subtract 1, write the new version of the file, and okay the
| sale.
|
| Without conditional writes, two instances of your application
| might both read "100", both subtract 1, and both write "99". If
| they checked the file afterward, both would think everything
| was fine. But things aren't find because you've actually sold
| two.
|
| The other cloud storage providers have had these sorts of
| conditional write features since basically forever, and it's
| always been really weird that S3 has lacked them.
| rrr_oh_man wrote:
| Could anybody explain for the uninitiated?
| msoad wrote:
| It ensures that when you try to upload (or "put") a new version
| of a file, the operation only succeeds if the file on the
| server still has the exact version (ETag) you specify. If
| someone else has updated the file in the meantime, your upload
| is blocked to prevent overwriting their changes.
|
| This is especially useful in scenarios where multiple users or
| processes are working on the same data, as it helps maintain
| consistency and avoids accidental overwrites.
|
| This is using the same mechanism as HTTP's `If-None-Match`
| header so it's easier to implement/learn
| rrr_oh_man wrote:
| Thank you! That was extremely helpful (and written in a way
| that is easy to understand)!
| wanderingmind wrote:
| Does this mean, in theory we will be able to manage multiple
| concurrent writes/updates to s3 without having to use new
| solutions like Regatta[1] that was recently launched?
|
| https://news.ycombinator.com/item?id=42174204
| huntaub wrote:
| Here's how I would think about this. Regatta isn't the best way
| to _add_ synchronization primitives to S3, if you 're already
| using the S3 API and able to change your code. Regatta is most
| useful when you need a local disk, or a higher performance
| version of S3. In this case, the addition of these new
| primitives actually just makes Regatta work better for our
| customers -- because we get to achieve even stronger
| consistency.
| dvektor wrote:
| [rejected] error: failed to push some refs to remote repository
|
| Finally we can have this with s3 :)
| mdaniel wrote:
| Relevant: https://github.com/awslabs/git-remote-s3#readme
| https://news.ycombinator.com/item?id=41887004
| vlovich123 wrote:
| I implemented that extension in R2 at launch IIRC. Thanks for
| catching up & helping move distributed storage applications a
| meaningful step forward. Intended sincerely. I'm sure adding this
| was non-trivial for a complex legacy codebase like that.
| ipython wrote:
| I can't wait to see what abomination Cory Quinn can come up with
| now given this new primitive! (see previous work abusing Route53
| as a database:
| https://www.lastweekinaws.com/blog/route-53-amazons-premier-...)
| stevefan1999 wrote:
| So...are we closer to getting to use S3 as a...you guessed it...a
| database? With CAS, we are probably able to get a basic level of
| atomicity, and S3 itself is pretty durable, now we have to deal
| with consistency and isolation...although S3 branded itself as
| "eventually consistent"...
| mr_toad wrote:
| People who want all those features use something like Delta
| Lake on top of object storage.
| User23 wrote:
| There was a great deal of interest in gossip protocols,
| eventual consistency, and such at Amazon in the mid oughts. So
| much so that they hired a certain Cornell professor along with
| the better part of his grad students to build out those
| technologies.
| gynther wrote:
| S3 is strongly consistent since 4 years ago.
| https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-rea...
| amazingamazing wrote:
| Ironically with this and lambda you could make a serverless
| sqlite by mapping pages to objects, using http range reads to
| read the db and lambda to translate queries to the writes in the
| appropriate pages via cas. Prior to this it would require a
| server to handle concurrent writers, making the whole thing a
| nonstarter for "serverless".
|
| Too bad performance would be terrible without a caching layer
| (ebs).
| captn3m0 wrote:
| For read heavy workloads, you could cache the results at
| cloudfront. Maybe we will someday see Wordpress-on-Lambda-to-
| Sqlite-over-S3.
| m_d_ wrote:
| s3fs's https://github.com/fsspec/s3fs/pull/917 was in response to
| the IfNoneMatch feature from the summer. How would people imagine
| this new feature being surfaced in a filesystem abstraction?
| grahamj wrote:
| bender_neat.gif
| maglite77 wrote:
| Noting that Azure Blob storage supports e-tag / optimistic
| controls as well (via If-Match conditions)[1], how does this
| differ? Or is it the same feature?
|
| [1]: https://learn.microsoft.com/en-
| us/azure/storage/blobs/concur...
| simonw wrote:
| It's the same feature. Google Cloud Storage has it too:
| https://cloud.google.com/storage/docs/request-preconditions#...
| paulsutter wrote:
| What's amazing is that it took them so long to add these
| functions
| serbrech wrote:
| Why is standard etag support making the frontpage?
| thayne wrote:
| Now if only you had more control over the ETag, so you could use
| a sha256 of the total file (even for multi-part uploads), or a
| version counter, or a global counter from an external system, or
| a logical hash of the content as opposed to a hash of the bytes.
| vytautask wrote:
| An open-source implementation of Amazon S3 - MinIO has had it for
| almost two years (relevant post: https://blog.min.io/leading-the-
| way-minios-conditional-write...). Strangely, Amazon is catching
| up just now.
| topspin wrote:
| That's not "strange" to me. Object storage has been a long time
| coming, and it's still being figured out: the entirely typical
| process of discovering useful and feasible primitives that
| expand applicability to more sophisticated problems. This is
| obviously going occur first in smaller and/or younger, more
| agile implementations, whereas AWS has the problem of
| implementing this at pretty much the largest conceivable scale
| with zero risk. The lag is, therefore, entirely unsurprising.
| aseipp wrote:
| It's not surprising at all. The scale of AWS, in particular S3,
| is nearly unfathomable, and the kind of solutions they need for
| "simple" things are totally different at that size. S3 was
| doing 1.1million requests a second back in 2013.[1]
|
| I wouldn't be surprised if they saw over 100mil/req/sec
| globally by now. That's 100 million requests _a second_ that
| need strong read-your-write consistency and atomicity at global
| scale. The number of pieces they had to move into place for
| this to happen is probably quite the engineering tale.
|
| [1] https://aws.amazon.com/blogs/aws/amazon-s3-two-trillion-
| obje...
| lttlrck wrote:
| Isn't this compare-and-set rather than compare-and-swap?
| torginus wrote:
| Ah so its not only me that uses AWS primitives for hackily
| implementing all sorts of synchronization primitives.
|
| My other favorite pattern is implementing a pool of workers by
| quering ec2 instances with a certain tag in a stopped state and
| starting them. Starting the instance can succeed only once - that
| means I managed to snatch the machine. If it fails, I try again,
| grabbing another one.
|
| This is one of those things that I never advertised out of
| professional shame, but it works, its bulletproof and dead simple
| and does not require additional infra to work.
| _zoltan_ wrote:
| this actually sounds interesting. do you precreate the workers
| beforehand and then just keep them in a stopped state?
| torginus wrote:
| yeah. one of the goals was startup time, so It made sense to
| precreate them. In practice we never ran out of free machines
| (and if we did, I have a cdk script to make more), and
| inifnite scaling is a pain in the butt anyways due to having
| to manage subnets etc.
|
| Cost-wise we're only paying for the EBS volumes for the
| stopped instances which are like 4GB each, so they cost
| practically nothing, we spend less than a dollar per month
| for the whole bunch.
| zild3d wrote:
| Warm pools are a supported feature in AWS on auto scaling
| groups. Works as you're describing (have a pool of
| instances in stopped state ready to use, only pay for EBS
| volume if relevant)
| https://aws.amazon.com/blogs/compute/scaling-your-
| applicatio...
| merb wrote:
| I always thought that stopped instances will cost money as
| well?!
| torginus wrote:
| You're only paying for the hard drive (and the VPC stuff,
| if you want to be pedantic). The downside is that if you
| try to start your instance, they might not start if AWS
| doesn't have the capacity (rare but have seen it happen,
| particularly with larger, more exotic instances.)
| rfoo wrote:
| > we spend less than a dollar per month for the whole bunch
|
| This does not change the point, I'm just being pedantic,
| but:
|
| 4GB of gp3 EBS takes $0.32 per month, assuming a 50%
| discount (not unusual), less than a dollar gives only... 6
| instances.
| belter wrote:
| If you use hourly billed machines...Sounds like the world most
| expensive semaphore :-)
| torginus wrote:
| except we are actually using them :)
| belter wrote:
| Just don't call them before the hour and start a different
| one again.Because otherwise within the hour, you will be
| billed for hundreds of hours...If they are of the type
| billed by the hour....
| messe wrote:
| EC2 bills by the second.
| belter wrote:
| Some...
|
| "Your Amazon EC2 usage is calculated by either the hour or
| the second based on the size of the instance, operating
| system, and the AWS Region where the instances are
| launched" - https://repost.aws/knowledge-
| center/ec2-instance-hour-billin...
|
| https://aws.amazon.com/ec2/pricing/on-demand/
| QuinnyPig wrote:
| MacOS instances appear to be the sole remaining exception
| since RHEL got on board.
| redeux wrote:
| Thanks Corey. Always nice to get the TL;DR from an
| authority on the subject.
| williamdclt wrote:
| What would you say would be the "clean" way to implement a pool
| of workers (using EC2 instances too)?
| ndjdjddjsjj wrote:
| etcd?
| torginus wrote:
| not sure, probably either an eks cluster with a job scheduler
| pod that creates jobs via the batch api. The scheduler pod
| might be replaced by a lambda. Another possibility is
| something cooked up with a lambda creating ec2 instances via
| cdk and the whole thing is kept track by a dynamodb table.
|
| the first one is probably cleaner (though I don't like it, it
| means that I need the instance to be a kubernetes node, and
| that comes with a bunch of baggage).
| Cthulhu_ wrote:
| Autoscaling and task queue based workloads, if my cloud
| theory is still relevant.
| twodave wrote:
| Agreed. Scaling based on the length of the queue, up to
| some maximum.
| giovannibonetti wrote:
| Even better, based on queue latency instead of length
| jcrites wrote:
| The single best metric I've found for scaling things like
| this is the percent of concurrent capacity that's in use.
| I wrote about this in a previous HN comment:
| https://news.ycombinator.com/item?id=41277046
|
| Scaling on things like the length of the queue doesn't
| work very well at all in practice. A queue length of 100
| might be horribly long in some workloads and
| insignificant in others, so scaling on queue length
| requires a lot of tuning that must be adjusted over time
| as the workload changes. Scaling based on percent of
| concurrent capacity can work for most workloads, and
| tends to remain stable over time even as workloads
| change.
| anonymousDan wrote:
| Would be interesting to understand how they've implemented it and
| they whether there is any perf impact on other API calls.
| londons_explore wrote:
| So we can now implement S3-as-RAM for a worldwide million-core
| linux VM?
| juggli wrote:
| finally
| spprashant wrote:
| I had no idea people rely on S3 beyond dumb storage. It almost
| feels like people are trying to build out a distributed OLAP
| database in the reverse direction.
| amne wrote:
| 1. SELECT ... INTO OUTFILE S3
|
| 2. glue jobs to partition by some columns reporting uses
|
| 3. query with athena
|
| 4. ???
|
| 5. profit (celebrate reduced cost)
|
| This thing costs couple $ a month for ~500gb of data. Snowflake
| wanted crazy amounts of money for the same thing.
___________________________________________________________________
(page generated 2024-11-26 23:01 UTC)