[HN Gopher] Copying collectors with block-structured heaps are u...
       ___________________________________________________________________
        
       Copying collectors with block-structured heaps are unreliable
        
       Author : ingve
       Score  : 38 points
       Date   : 2024-07-10 10:26 UTC (12 hours ago)
        
 (HTM) web link (wingolog.org)
 (TXT) w3m dump (wingolog.org)
        
       | chrisjj wrote:
       | > a form of soft reliability
       | 
       | A form of reliability that is unreliable, right? :)
        
       | bjourne wrote:
       | But this only happens if the copying collectors' space is
       | growable and discontinuous. I think a better scheme is to just
       | use copying for the young generation and mark and sweep for the
       | older ones. Structuring the heap into blocks seem quite
       | misguided, imo.
        
         | ryanobjc wrote:
         | This is exactly how the JVM CMS works, then what happens is the
         | old gen becomes too fragmented to allocate, you end up with a
         | GC failure and the JVM will rewrite the entire old heap.
         | 
         | When hadoop depended on a single namenode, a 64gb+ heap would
         | take 5+ minutes to collect on these failure modes. Quite
         | painful.
        
           | bjourne wrote:
           | Yes, but there are two easy solutions: 1. Larger nurseries to
           | minimize churn in older generations. Fragmentation is a
           | symptom of too many "false survivors". 2. Concurrent heap
           | compaction.
        
             | jakewins wrote:
             | The failure mode here is usually caching, in my experience.
             | Lots of data, all dying old.
        
           | hinkley wrote:
           | When JVMs started hitting a wall around 1GB of memory, we
           | ended up moving long-lived data out of the old generation and
           | into in-memory databases.
        
         | Someone wrote:
         | > Structuring the heap into blocks seem quite misguided, imo.
         | 
         | So, how would you keep allocations fast on multi-core hardware?
         | 
         | FTA: _"I am targetting embedders with multiple mutator threads,
         | and a classic semi-space collector doesn't scale - you could
         | access the allocation pointer atomically, but that would be a
         | bottleneck, serializing mutators, not to mention the cache
         | contention."_
        
           | bjourne wrote:
           | You have private, fixed-size, copy-collected spaces for each
           | thread.
        
           | hinkley wrote:
           | Per-thread nurseries worked just fine for Java for close to a
           | decade. They made it more efficient using escape analysis.
           | Escape analysis indirectly lead to borrow checking and Rust.
        
             | kccqzy wrote:
             | While per-thread nurseries certainly work, I recently
             | became aware that TCMalloc actually supports either per-
             | thread mode or per-CPU mode:
             | https://google.github.io/tcmalloc/overview.html#tcmalloc-
             | cac...
             | 
             | I have not looked into this very deeply but from first
             | principles it would appear to me that a per-CPU nursery
             | would be strictly better than a per-thread nursery for
             | typical applications.
        
               | convolvatron wrote:
               | that would almost certainly be true if you could make
               | statements about thread-cpu affinity
        
               | scottlamb wrote:
               | > While per-thread nurseries certainly work, I recently
               | became aware that TCMalloc actually supports either per-
               | thread mode or per-CPU mode:
               | https://google.github.io/tcmalloc/overview.html#tcmalloc-
               | cac...
               | 
               | Yeah, that's pretty new relative to when the old
               | gperftools version of tcmalloc hit the scene. It works
               | via the rseq operations on Linux; afaik no other OS is
               | supported.
               | 
               | > I have not looked into this very deeply but from first
               | principles it would appear to me that a per-CPU nursery
               | would be strictly better than a per-thread nursery for
               | typical applications.
               | 
               | Both are quite effective at removing the contention, so I
               | think it mostly comes down to which needs fewer
               | allocation caches to do it. In the case of a tie, per-CPU
               | is probably better because if a thread migrates across
               | cores you want to use memory likely to be hot in the CPU
               | cache. If you have far more threads than cores, per-CPU
               | should win. If you're single-threaded, per-thread should
               | win. If you have a thread-per-core reactor, and either
               | use the whole machine, or restrict it with cpuset so it
               | always stays on the same cores, per-CPU should win, and
               | more so if there are any ancillary threads for logging or
               | whatever. All those profiles exist; what a "typical
               | application" is I dunno.
               | 
               | There's also the concept of "virtual CPU IDs" (the
               | language may have changed), which can minimize the
               | downside of per-CPU for small cgroups on huge machines.
               | It really should be better than per-thread, as running
               | threads <= total threads. Not sure if that's even in
               | mainline Linux now or not.
               | https://lwn.net/Articles/885818/
        
           | neonsunset wrote:
           | Per-core heaps. Count, sizes and proportions of generational
           | heaps can also be scaled dynamically according to allocation
           | rate, target % CPU time spent in GC and collection
           | efficiency, like .NET's GC does.
           | 
           | https://devblogs.microsoft.com/dotnet/put-a-dpad-on-that-gc/
           | 
           | https://maoni0.medium.com/dynamically-adapting-to-
           | applicatio...
        
         | bjoli wrote:
         | I am not sure Andy is very present on social media. He would
         | probably appreciate a comment.
        
       ___________________________________________________________________
       (page generated 2024-07-10 23:01 UTC)