[HN Gopher] Copying collectors with block-structured heaps are u...
___________________________________________________________________
Copying collectors with block-structured heaps are unreliable
Author : ingve
Score : 38 points
Date : 2024-07-10 10:26 UTC (12 hours ago)
(HTM) web link (wingolog.org)
(TXT) w3m dump (wingolog.org)
| chrisjj wrote:
| > a form of soft reliability
|
| A form of reliability that is unreliable, right? :)
| bjourne wrote:
| But this only happens if the copying collectors' space is
| growable and discontinuous. I think a better scheme is to just
| use copying for the young generation and mark and sweep for the
| older ones. Structuring the heap into blocks seem quite
| misguided, imo.
| ryanobjc wrote:
| This is exactly how the JVM CMS works, then what happens is the
| old gen becomes too fragmented to allocate, you end up with a
| GC failure and the JVM will rewrite the entire old heap.
|
| When hadoop depended on a single namenode, a 64gb+ heap would
| take 5+ minutes to collect on these failure modes. Quite
| painful.
| bjourne wrote:
| Yes, but there are two easy solutions: 1. Larger nurseries to
| minimize churn in older generations. Fragmentation is a
| symptom of too many "false survivors". 2. Concurrent heap
| compaction.
| jakewins wrote:
| The failure mode here is usually caching, in my experience.
| Lots of data, all dying old.
| hinkley wrote:
| When JVMs started hitting a wall around 1GB of memory, we
| ended up moving long-lived data out of the old generation and
| into in-memory databases.
| Someone wrote:
| > Structuring the heap into blocks seem quite misguided, imo.
|
| So, how would you keep allocations fast on multi-core hardware?
|
| FTA: _"I am targetting embedders with multiple mutator threads,
| and a classic semi-space collector doesn't scale - you could
| access the allocation pointer atomically, but that would be a
| bottleneck, serializing mutators, not to mention the cache
| contention."_
| bjourne wrote:
| You have private, fixed-size, copy-collected spaces for each
| thread.
| hinkley wrote:
| Per-thread nurseries worked just fine for Java for close to a
| decade. They made it more efficient using escape analysis.
| Escape analysis indirectly lead to borrow checking and Rust.
| kccqzy wrote:
| While per-thread nurseries certainly work, I recently
| became aware that TCMalloc actually supports either per-
| thread mode or per-CPU mode:
| https://google.github.io/tcmalloc/overview.html#tcmalloc-
| cac...
|
| I have not looked into this very deeply but from first
| principles it would appear to me that a per-CPU nursery
| would be strictly better than a per-thread nursery for
| typical applications.
| convolvatron wrote:
| that would almost certainly be true if you could make
| statements about thread-cpu affinity
| scottlamb wrote:
| > While per-thread nurseries certainly work, I recently
| became aware that TCMalloc actually supports either per-
| thread mode or per-CPU mode:
| https://google.github.io/tcmalloc/overview.html#tcmalloc-
| cac...
|
| Yeah, that's pretty new relative to when the old
| gperftools version of tcmalloc hit the scene. It works
| via the rseq operations on Linux; afaik no other OS is
| supported.
|
| > I have not looked into this very deeply but from first
| principles it would appear to me that a per-CPU nursery
| would be strictly better than a per-thread nursery for
| typical applications.
|
| Both are quite effective at removing the contention, so I
| think it mostly comes down to which needs fewer
| allocation caches to do it. In the case of a tie, per-CPU
| is probably better because if a thread migrates across
| cores you want to use memory likely to be hot in the CPU
| cache. If you have far more threads than cores, per-CPU
| should win. If you're single-threaded, per-thread should
| win. If you have a thread-per-core reactor, and either
| use the whole machine, or restrict it with cpuset so it
| always stays on the same cores, per-CPU should win, and
| more so if there are any ancillary threads for logging or
| whatever. All those profiles exist; what a "typical
| application" is I dunno.
|
| There's also the concept of "virtual CPU IDs" (the
| language may have changed), which can minimize the
| downside of per-CPU for small cgroups on huge machines.
| It really should be better than per-thread, as running
| threads <= total threads. Not sure if that's even in
| mainline Linux now or not.
| https://lwn.net/Articles/885818/
| neonsunset wrote:
| Per-core heaps. Count, sizes and proportions of generational
| heaps can also be scaled dynamically according to allocation
| rate, target % CPU time spent in GC and collection
| efficiency, like .NET's GC does.
|
| https://devblogs.microsoft.com/dotnet/put-a-dpad-on-that-gc/
|
| https://maoni0.medium.com/dynamically-adapting-to-
| applicatio...
| bjoli wrote:
| I am not sure Andy is very present on social media. He would
| probably appreciate a comment.
___________________________________________________________________
(page generated 2024-07-10 23:01 UTC)