[HN Gopher] Garbage collection of object storage at scale
___________________________________________________________________
Garbage collection of object storage at scale
Author : ko_pivot
Score : 18 points
Date : 2025-05-10 13:23 UTC (3 days ago)
(HTM) web link (www.warpstream.com)
(TXT) w3m dump (www.warpstream.com)
| juancn wrote:
| Another possible mechanism for doing GC at scale (a variation on
| Asynchronous Reconciliation in the article) in some file/object
| store, is doing a probabilistic mark and sweep using bloom
| filters.
|
| The mark phase can be done in parallel building many bloom
| filters for the files/objects found.
|
| Then the bloom filters are merged (or'ed together essentially)
| and then a parallel sweep phase can use the bloom filter to
| answer the question: is this file/object live?
|
| The bloom filter then answers either "No" with 100% certainty or
| "Maybe" with some probability p that depends on the parameters
| used for the bitset and the hash function family.
| cogman10 wrote:
| What does the bloom filter solve?
|
| The expensive portion of the mark and sweep for the object
| store is the mark phase, not the storage of what's been marked.
| 100s, 1000s, or even millions of live objects wouldn't hardly
| take any space to keep in a remembered set.
|
| On the other hand, querying the S3 bucket to list those 1M
| objects would be expensive no matter how you store the results.
|
| But this does tickle my brain. Perhaps something akin to the
| generational hypotheses can be applied? Maybe it's the case
| that very old, very young, or very untouched objects are more
| likely to be garbage than not. If there's some way to divide
| the objects up and only look at objects whose are in "probably
| need to be collected" regions, then you could do minor fast
| sweeps semi frequently and schedule more expensive "really
| delete untracked stuff" infrequently.
| deathanatos wrote:
| > _Why Not Just Use a Bucket Policy?_
|
| I've heard these words _so many times_ , it's refreshing to see
| someone dig into why bucket policies aren't a cure-all.
|
| As for "Why not use synchronous deletion?" -- regarding the
| pitfall there, what about a WAL? I.e., you WAL the deletions you
| want to perform into an object in the object store, perform the
| deletions, and then delete the WAL. If you crash and find a WAL
| file, you repeat the delete commands contained in the WAL.
|
| (I've used this to handle this problem where some of the
| deletions are mixed: i.e., some in an object store, some in a SQL
| DB, etc. The object store is essentially being used as strongly
| consistent storage.)
|
| (Perhaps this is essentially the same as your "delayed queue"?
| All I've got is an object store though, not a queue, and it's
| pretty useful hammer.)
___________________________________________________________________
(page generated 2025-05-13 23:00 UTC)