[HN Gopher] Garbage collection of object storage at scale
       ___________________________________________________________________
        
       Garbage collection of object storage at scale
        
       Author : ko_pivot
       Score  : 18 points
       Date   : 2025-05-10 13:23 UTC (3 days ago)
        
 (HTM) web link (www.warpstream.com)
 (TXT) w3m dump (www.warpstream.com)
        
       | juancn wrote:
       | Another possible mechanism for doing GC at scale (a variation on
       | Asynchronous Reconciliation in the article) in some file/object
       | store, is doing a probabilistic mark and sweep using bloom
       | filters.
       | 
       | The mark phase can be done in parallel building many bloom
       | filters for the files/objects found.
       | 
       | Then the bloom filters are merged (or'ed together essentially)
       | and then a parallel sweep phase can use the bloom filter to
       | answer the question: is this file/object live?
       | 
       | The bloom filter then answers either "No" with 100% certainty or
       | "Maybe" with some probability p that depends on the parameters
       | used for the bitset and the hash function family.
        
         | cogman10 wrote:
         | What does the bloom filter solve?
         | 
         | The expensive portion of the mark and sweep for the object
         | store is the mark phase, not the storage of what's been marked.
         | 100s, 1000s, or even millions of live objects wouldn't hardly
         | take any space to keep in a remembered set.
         | 
         | On the other hand, querying the S3 bucket to list those 1M
         | objects would be expensive no matter how you store the results.
         | 
         | But this does tickle my brain. Perhaps something akin to the
         | generational hypotheses can be applied? Maybe it's the case
         | that very old, very young, or very untouched objects are more
         | likely to be garbage than not. If there's some way to divide
         | the objects up and only look at objects whose are in "probably
         | need to be collected" regions, then you could do minor fast
         | sweeps semi frequently and schedule more expensive "really
         | delete untracked stuff" infrequently.
        
       | deathanatos wrote:
       | > _Why Not Just Use a Bucket Policy?_
       | 
       | I've heard these words _so many times_ , it's refreshing to see
       | someone dig into why bucket policies aren't a cure-all.
       | 
       | As for "Why not use synchronous deletion?" -- regarding the
       | pitfall there, what about a WAL? I.e., you WAL the deletions you
       | want to perform into an object in the object store, perform the
       | deletions, and then delete the WAL. If you crash and find a WAL
       | file, you repeat the delete commands contained in the WAL.
       | 
       | (I've used this to handle this problem where some of the
       | deletions are mixed: i.e., some in an object store, some in a SQL
       | DB, etc. The object store is essentially being used as strongly
       | consistent storage.)
       | 
       | (Perhaps this is essentially the same as your "delayed queue"?
       | All I've got is an object store though, not a queue, and it's
       | pretty useful hammer.)
        
       ___________________________________________________________________
       (page generated 2025-05-13 23:00 UTC)