hngopher.com

       [HN Gopher] Conservative GC can be faster than precise GC
       ___________________________________________________________________
        
       Conservative GC can be faster than precise GC
        
       Author : diegocg
       Score  : 117 points
       Date   : 2024-09-07 10:44 UTC (12 hours ago)
        
 (HTM) web link (wingolog.org)
 (TXT) w3m dump (wingolog.org)
        
       | notorandit wrote:
       | Just like a number of (unrelated) algorithms, the sweet spot can
       | be found in the middle between two (or more) optimal solutions.
       | 
       | Precision is sometimes useless due to time and computing
       | resources constraints. Speed can come at the cost of poor
       | results.
       | 
       | I think the answer depends upon the specific use case and
       | environment.
        
       | tinco wrote:
       | I've heard about scanning of the stack but I'm not sure if I get
       | it. Is the strategy literally to not keep track of references at
       | all, and simply do a sequential scan over the entire memory
       | looking for bytes that look like they're pointers into the heap,
       | and then assuming they are roots? And then you look them up in
       | the allocation records and mark them as still in use? You'd have
       | to scan them in turn as well to know if they've got references to
       | other heap locations right?
       | 
       | Edit: ah he says assume the heap is precisely traced (somehow?)
       | so I guess it would already been known what references are in
       | there.
        
         | Rohansi wrote:
         | Pretty much, yes. There are optimizations done so it would not
         | need to scan every possible location. Pointers should be
         | properly aligned of course but also things like having separate
         | heap regions for data that has no pointers in it (large data
         | arrays) to skip scanning entirely.
        
         | adgjlsfhk1 wrote:
         | the heap is a lot easier to trace than the stack because
         | objects on the heap are put there explicitly, so as long as you
         | know their type, it's pretty easy to know where their pointers
         | are. the tricky part of the stack is that once your compiler
         | has figured out that something can go in the stack, it also
         | might want to do things like store part of the object only in
         | registers or something like that.
        
         | Tarean wrote:
         | Both strategies start from roots (e.g. the stack) and then
         | transitively chase pointers. Any memory reachable this way is
         | live.
         | 
         | To do this chasing precisely you need some metadata, saying
         | which fields are pointers vs other data like ints. For OOP
         | languages this metadata is stored with the vtables for objects.
         | For the stack you need similar metadata, at the very least how
         | many pointers there are if you put them first in the
         | stackframe.
         | 
         | Not having this metadata and conservatively treating random
         | ints as pointers isn't always catastrophic, but it has some
         | drawbacks. Two big ones:
         | 
         | - a moving GC is tough, you have to update pointers after
         | moving an object but can't risk updating random ints that
         | happen to have that value
         | 
         | - You do more work during GC chasing random non-pointers, and
         | free less memory meaning more GC runs
         | 
         | Generating ths precise GC metadata for stackframes is sort-of
         | easy.You need specific Safe-Points where all data is at known
         | locations for the GC to work anyway, usually by spilling
         | resgisters to the stack. These GC checkpoints usually coincide
         | with heap allocation, which is why long non-allocating loops
         | can block stop-the-world GC and send other threads into a spin-
         | lock in many GC implementations
         | 
         | Maybe a non-precise GC could treat registers as possible
         | pointers and skip spilling to the stack for Safe-Points? There
         | are alternatives to spilling like a precise stack-map for each
         | program instruction (so every instruction is a safe point), but
         | those are expensive to process. Usually only used for debugging
         | or exception handling, not something frequent like GC
        
       | neonsunset wrote:
       | > A compiler that does precise root-finding will typically output
       | a side-table indicating which slots in a stack frame hold
       | references to heap objects. These lifetimes aren't always
       | precise, in the sense that although they precisely enumerate heap
       | references, those heap references might actually not be used in
       | the continuation of the stack frame. When GC occurs, it might
       | mark more objects as live than are actually live, which is the
       | imputed disadvantage of conservative collectors.
       | 
       | This is not necessarily accurate with true precise tracking E.g.:
       | using System.Runtime.CompilerServices;              Example();
       | // Skip Tier 0 compilation which does not track gcrefs as
       | precisely
       | [MethodImpl(MethodImplOptions.AggressiveOptimization)]
       | static void Example() {             var obj = new object();
       | var wr = new WeakReference(obj);             for (var i = 0; i <
       | 3; i++) {                 Console.WriteLine(obj);             }
       | GC.Collect();             Console.WriteLine(wr.IsAlive);
       | }
       | 
       | This works in much more advanced scenarios too.
       | 
       | Unfortunately, I can't link a simple document that covers this in
       | detail from the top of my head but there's a wealth of
       | information in Konrad Kokosa's works:
       | 
       | .NET GC internals lectures:
       | https://www.youtube.com/watch?v=8i1Nv7wGsjk&list=PLpUkQYy-K8...
       | 
       | Pro .NET Memory Management (which is a great book in general):
       | https://prodotnetmemory.com/
        
         | andyayers wrote:
         | In .NET, even in optimized methods, there can be "untracked"
         | lifetimes where a stack slot is reported live to GC throughout
         | the extent of a method, so presumably these can lead to the
         | "over-reporting" cases mentioned.
         | 
         | The number of trackable lifetimes was 64 in .NET Framework but
         | has been steadily increased in modern .NET and is now 1024, so
         | it's rarely a capacity issue; but there are cases where we
         | can't effectively reason about lifetimes.
         | 
         | For us another big drawback to conservative scanning is that
         | any object referred to by a conservative reference cannot be
         | relocated, since the reference might be live and is not
         | guaranteed to be a GC reference; these objects are (in our
         | parlance) effectively pinned, and this causes additional
         | overhead.
        
       | DarkNova6 wrote:
       | I'm not sure what this article is trying to convey to me. I can
       | only suppose the intended audience is entirely academia-centric.
       | 
       | Reading the conclusion, I don't know if anybody working on actual
       | top-tier GCs (Java, JavaScript, C#) finds this useful. It strikes
       | me more as a somewhat interesting factoid, as demonstrated by a
       | paper.
       | 
       | The conclusion:
       | 
       | > When it comes to designing a system with GC, don't count out
       | conservative stack scanning; the tradeoffs don't obviously go one
       | way or the other, and conservative scanning might be the right
       | engineering choice for your system.
       | 
       | If there would be examples of how this relates to actual GCs in
       | production and compares them, now that would be interesting.
        
         | pizlonator wrote:
         | Most production GCs are accurate and the article's position on
         | conservative GC is a minority position.
        
           | mseepgood wrote:
           | s/accurate/precise/
        
             | pizlonator wrote:
             | No.
             | 
             | The literature uses "Accurate GC", "Precise GC", and "Exact
             | GC" interchangeably.
             | 
             | Famous paper on this that uses "accurate": https://citeseer
             | x.ist.psu.edu/document?repid=rep1&type=pdf&d...
             | 
             | Paper that calls it "exact":
             | https://dl.acm.org/doi/10.1145/2660193.2660198
        
           | samatman wrote:
           | As the Fine Article mentions, the JavaScriptCore GC is
           | conservative, and V8 is considering a switch.
           | 
           | Someone reading your sentence might be at risk of conflating
           | a minority position with a fringe one. Clearly this isn't the
           | case here.
        
             | pizlonator wrote:
             | I said minority, not fringe.
             | 
             | But fringe isn't far off. Among those who write GCs
             | professionally, I was definitely on the fringe as an
             | advocate for conservative-on-the-stack.
             | 
             | (I wrote most of JSC's GC and I was one of the folks
             | pushing for it to remain conservative at a time when V8 was
             | accurate and SpiderMonkey transitioned to being accurate. I
             | like that folks are now acknowledging the good choice we
             | made in JSC, but it was a fringe/minority choice at the
             | time for sure, and it's still a minority choice among high
             | performance GCs.)
        
               | samatman wrote:
               | I would say you're being a bit modest! JSC is deployed on
               | hundreds of millions of devices, and if V8 does adopt a
               | conservative GC strategy, that would be a comfortable
               | majority of JavaScript engines making use of that
               | approach.
               | 
               | It goes to show that popularity is a poor measure of
               | technical merit, and the fringe doesn't always stay that
               | way.
        
           | fweimer wrote:
           | Some are not. You can try this:                   diff --git
           | a/src/runtime/mgc.go b/src/runtime/mgc.go         index
           | a2b6b979c1..d2f2852294 100644         ---
           | a/src/runtime/mgc.go         +++ b/src/runtime/mgc.go
           | @@ -144,7 +144,7 @@ const (                     //
           | debugScanConservative enables debug logging for stack
           | // frames that are scanned conservatively.         -
           | debugScanConservative = false         + debugScanConservative
           | = true                     // sweepMinHeapDistance is a lower
           | bound on the heap distance           // (in bytes) reserved
           | for concurrent sweeping between GC
           | 
           | And observe that Go scans a few stack frames conservatively.
           | I think it only does this if a stack frame is preempted. Most
           | frames are scanned precisely.
        
         | zelphirkalt wrote:
         | Well, there are many ecosystems and languages and not all are
         | running a top tier GC. The author, afaik, has himself been a
         | driving force in developing a JIT for GNU Guile. Posts like
         | there, experts sharing their knowledge and insights, are among
         | the most valuable here on HN.
        
         | samatman wrote:
         | Andy Wingo is a professional compiler engineer. Best known for
         | his work on Guile Scheme, but he works for Igalia, and if you
         | were to read more of his blog (I recommend this highly) you'll
         | see references to work he's done on the so-called top-tier GCs
         | you seem to favor.
         | 
         | I would venture that, as a bare minimum, the intended audience
         | is people who understand and care about the subject. That
         | includes me; your mileage may vary.
        
       | nu11ptr wrote:
       | I've never understood why anyone would use a conservative
       | collector outside toy programs or academia. It is hard enough to
       | make programs deterministic even with precise collection. I can't
       | even imagine releasing software that was inherently non-
       | deterministic and could suddenly, and without notice, start
       | retaining memory (even if atypical in practice). Thus, IMHO,
       | which is faster is a moot point.
        
         | sestep wrote:
         | Yeah... I can't imagine trying to debug that. "We kept getting
         | memory leaks, so we dug into it and realized that the language
         | was interpreting local integer variables as pointers and
         | refusing to free memory, but this only happened sporadically
         | and we couldn't reproduce the bug on our development machines.
         | After banging our heads against the wall for weeks we realized
         | what was going on, and it turns out this behavior is completely
         | intentional and they have no plans to change it."
        
           | hypertele-Xii wrote:
           | Computer programming is full of probabilistic edge cases with
           | ridiculous costs that are amortized over normal use. Most
           | optimization, encoding, and compression is based on
           | statistics.
           | 
           | If your use case requires a minimum bound, use another
           | algorithm.
        
             | sestep wrote:
             | Amortized analysis actually provides a guarantee (either
             | deterministic or probabilistic) that things will tend to
             | even out in the long run. Unless I misunderstand something,
             | conservative GC provides no such guarantee, and there are
             | no hard statistics behind the claim that memory leaks
             | caused by it should be rare. There's a difference between
             | "this algorithm is actually random and so sometimes will
             | happen to exhibit suboptimal behavior" and "under the right
             | circumstances, this garbage collection scheme will
             | consistently produce memory leaks due to arcane rules, but
             | those conditions are practically impossible to reproduce in
             | a controlled setting."
        
               | reichstein wrote:
               | Many quick-sort implementations are deterministic, so
               | will consistently have their worst case behavior on the
               | same inputs again and again. The good ones try to do a
               | little better than choosing the center element as pivot,
               | but with a well crafted input, it can easily become
               | polynomial anyway.
               | 
               | Luckily sorting is something you can easily choose
               | another implementation of, if the default over didn't fit
               | your use-cade, unlike the GC built into the single
               | language implementation that your customer uses.
        
               | dzaima wrote:
               | Assuming that on-heap objects are tracked precisely, the
               | maximum number of objects conservative stack scanning can
               | incorrectly consider as roots is O(stack size) (with, in
               | practice, a tiny constant factor, from excluding actual
               | intentional references and frequently-changing or trivial
               | non-reference values).
               | 
               | The only way for the total amount of leaked memory to
               | grow is by replacing something else on the stack, which
               | releases whatever was there before, and, assuming there's
               | some maximum stack size, there's a finite amount of such
               | space.
               | 
               | End result being that, even if you have some code path
               | that generates an integer that looks like a reference, it
               | can only leak one object (which could be massive perhaps,
               | but with generational GC or similar there's no guarantee
               | of all garbage being GC'd at any single GC anyway); if
               | codegen clears stack slots upon their lifetime ending,
               | you need one such integer-holding frame to be on the
               | stack for every object you want to be leaked.
        
             | paulddraper wrote:
             | Yeah, but we're talking about memory leaks not some branch
             | misprediction or cache miss.
        
         | adgjlsfhk1 wrote:
         | The counterpoint here is that a if your program uses 2gb of
         | memory on a 64bit computer, only 1 in 4 billion random 64 bit
         | values will be plausible addresses (and almost all of them will
         | only pin small amounts of memory).
        
           | PaulHoule wrote:
           | Didn't the opposite happen near the end of the 32 bit age? My
           | take was that you could get away with conservative garbage
           | collection if you had 16MB of RAM in a 32 bit address space
           | but as you filled more of it the probability of a significant
           | memory leak converged on 1.
        
             | adgjlsfhk1 wrote:
             | I assume you mean 16 mb of ram in a 32 gb space, but yes.
             | As app size approaches address space size, lots of things
             | get worse.
             | 
             | That said this new generation of only conservatively
             | sweeping the stack has the significant advantage that the
             | stack has remained pretty small (8mb is still the standard
             | stack size), so if you have precise heap scanning the odds
             | of spurious collections from conservative scanning go down
             | a ton.
        
               | PaulHoule wrote:
               | I corrected GB -> MB in my post, thanks!
        
           | IshKebab wrote:
           | That's only true if the valid addresses or values are
           | uniformly distributed, which is not the case.
        
           | dzaima wrote:
           | Less trivial considering that typically only the low ~47 bits
           | of addresses are allowed to be non-zero; then again, values
           | on the stack would still be full 64 bits and any non-zero bit
           | in the top 17 immediately disqualifies it; then again,
           | besides literally random 64-bit integers, ones not pushing up
           | to the 64-bit limit are probably more likely anyway.
           | 
           | Another possibility would be two adjacent 32-bit ints
           | merging; for a 2GB heap this'd require the high half hitting
           | one of the two valid 1<=x<=32767 values (reasonably frequent
           | range for general-purpose numbers) and the bottom one can be
           | anything; though whether such packing can happen on-stack
           | depends on codegen (and, to a lesser extent, the program, as
           | one can just do "12345L<<32" and get a thing that has the 2
           | in 32767 (0.006%) chance of hitting the heap); but even then
           | fitting a million of such on the stack only gives ~61
           | potential false roots, and for that to be significant some of
           | those must hit massive objects (brings back some 1 in a
           | billion factor unless there are large cyclic object
           | subgraphs).
        
         | MobiusHorizons wrote:
         | The article suggests this might be used for JavaScript VMs
         | maybe the lack of int64s in the language helps?
         | 
         | > Also, Apple's JavaScriptCore uses conservative stack
         | scanning, and V8 is looking at switching to it. Funny, right?
        
         | rbehrends wrote:
         | The same argument would apply to any non-compacting allocator,
         | because the worst case memory blowup due to external
         | fragmentation is huge. But such cases are extremely rarely
         | observed in practice, so people use e.g. standard
         | malloc()/free() implementations or non-compacting garbage
         | collectors without being concerned about that.
         | 
         | In addition, there are plenty of cases where memory usage is
         | unbounded or excessive, not because of allocator behavior, but
         | because of programming mistakes. In fact, memory can sometimes
         | blow up just because of large user inputs and very few systems
         | are prepared for properly handling OOM conditions that happen
         | legitimately.
         | 
         | Case in point: Both CRuby and Apple's JavaScriptCore have
         | garbage collectors that use conservative stack scanning and are
         | widely used in production systems without the world ending.
         | 
         | That said, you're probably not going to use conservative stack
         | scanning because of collection speed alone. There are other
         | trade-offs between conservative stack scanning and precise
         | stack scanning that weigh more heavily.
         | 
         | I'll add the caveat that I would be very cautious about using a
         | conservative GC on 32-bit systems, but on 64-bit systems this
         | is about as much of a concern as memory fragmentation.
        
           | hinkley wrote:
           | > But such cases are extremely rarely observed in practice
           | 
           | After long years of finding problems and trying to encourage,
           | pressure, bribe, cajole and finally yell at people about said
           | problems, my professional opinion is that people aren't very
           | observant, and either do not see problems or pretend not to
           | see them.
           | 
           | They just reboot the process every 48 hours and walk away.
           | 
           | And sadly continuous deployment is making this worse. The
           | collective We have problems that only show up on three or
           | four day weekends, right when you don't want to be on call.
        
         | bjoli wrote:
         | Unless you have deterministic threading no parallel GC will be
         | deterministic. A lot of software has used the Boehm collector
         | without issue, like inkscape and I believe crystal lang.
         | 
         | It is not sexy, and it has other issues, but it has worked for
         | a long time.
        
         | hedora wrote:
         | I've never worked with a language that had a precise GC that
         | wasn't also a nightmare to run in production. Java's the
         | obvious example of a language with an unmanageable GC. (Yes,
         | they're claiming the next GC will work, but that was the top
         | line feature in the 1.4 marketing back in the late 1990's, and
         | I simply don't believe such claims after 25+ years and over a
         | dozen major releases that failed to deliver it.
         | 
         | Go is supposedly a counterexample. I haven't used it enough to
         | offer an opinion, but I have heard of companies rewriting Go
         | services simply to avoid GC at runtime.
        
           | adgjlsfhk1 wrote:
           | Counterpoint: the problem with Java isn't the GC, but the
           | language. If every struct you instantiated in C required a
           | malloc/free, that would be really slow too.
        
             | pron wrote:
             | Java's memory management doesn't work like malloc/free at
             | all, and that's why it's fast [1]. Even disregarding scalar
             | replacement that means many `new` don't allocate anything
             | on the heap, nearly all allocations are a pointer bump and
             | objects that die don't require some active deallocation.
             | With tracing collectors it's keeping objects _alive_ for a
             | long time that requires some computational work, not
             | deallocating them (as is the case with refcounting
             | collectors).
             | 
             | An _extremely_ simplified description of how OpenJDK 's GCs
             | work is like this: every allocation bumps the "heap end"
             | pointer, and when memory runs out, the GC scans the objects
             | reachable from some set of "roots", resets the allocation
             | pointer to the start of the heap, and copies the live
             | objects to the allocation pointer, bumping it along the way
             | (and in the process, overwriting the dead objects). In
             | practice, the scanning and the compaction may be done
             | concurrently with the application, there is not just one
             | allocation pointer (to avoid contention), and not the
             | entire object graph needs to be scanned to free memory.
             | 
             | The cost of Java's good GC is virtually entirely in memory
             | footprint (there may be some speed costs to GC in some
             | cases, but they're offset by GC's speed benefits).
             | 
             | [1]: Java may indeed be slower than C on some workloads due
             | to memory layout, but that's because of data locality, not
             | GC. The upcoming inlined value objects will take care of
             | that.
        
           | pron wrote:
           | https://netflixtechblog.com/bending-pause-times-to-your-
           | will...
        
           | foobazgt wrote:
           | This is such a weird take. Java's been everywhere in prod in
           | the industry for decades from backends running millions of
           | qps to HFT to game clients.
           | 
           | IME, the most common GC problems I see are:                 -
           | incredibly basic settings are wrong, like the max heap size
           | - the app was written with some very poor behavior, e.g.
           | unreal amounts of allocation or tons of allocation thrashing
           | in tenured heap.
           | 
           | There aren't even a lot of GCs knobs to "manage" anymore
           | compared to a decade ago [0]. Don't treat your GC as a black-
           | box machine with infinitely available resources, and you'll
           | be fine.
           | 
           | 0) https://docs.oracle.com/en/java/javase/22/gctuning/ergonom
           | ic...
        
       | chrsig wrote:
       | It was a bit of a bummer when go switched from the conservative
       | gc to a precise gc. One of the implications was that they needed
       | to change how interface types were represented.
       | 
       | They had a nice little optimization for word-sized values to
       | store in-place rather than as a pointer out to a value. With the
       | precise gc, they had to make the change to only storing pointers,
       | leading to allocating small values.
       | 
       | I don't know if they've done work to (or perhaps better put: had
       | success) regain the performance hit from the extra allocation &
       | gc load.
       | 
       | On the flip side, my experience is that they've made the pretty
       | unobtrusive with regards to latency and pauses. Or perhaps I'm
       | just not stressing it as much as I had in the past.
       | 
       | Random anecdote on gc tuning: I was once dealing with a go
       | process that sped up (higher throughput, lower latency) by an
       | alarming amount limiting the max processors. This was many years
       | ago, and I wouldn't be able to say what version of go. It was
       | after you no longer had to set GOMAXPROCS, but probably not very
       | long after.
       | 
       | Performance tuning is crazy sometimes.
        
         | znpy wrote:
         | Performance tuning is still largely a dark art from what I see,
         | having dabbled in the space.
         | 
         | It's both weird and beautiful because getting an end-to-end
         | understanding is instrumental, so you often have to go look at
         | many marginal things that might actually play a significant
         | role.
        
           | bqmjjx0kac wrote:
           | Disappointingly, it's a dark art often because the CPU is a
           | black box. Intel X86 chips translate the instructions you
           | give them to some internal microcode and then execute them
           | speculatively, out of order, etc. I'm still mystified by the
           | performance gains afforded by randomly inserting NOP
           | instructions.
        
             | pjmlp wrote:
             | That is why tooling like VTune exist.
        
             | 10000truths wrote:
             | There is a plethora of information regarding instruction
             | timings, throughout/latency, execution port usage, etc.
             | that compilers make liberal use of for optimization
             | purposes. You could, in theory, also use that information
             | to establish an upper bound on how long a series of
             | instructions would take to execute. The problem lies in the
             | difference in magnitude between average case and worst
             | case, due to dynamic execution state like CPU cache, branch
             | prediction, kernel scheduling, and so on.
        
             | cogman10 wrote:
             | Fortunately, at least in my experience, the variability
             | that CPUs introduce (which do matter in many contexts)
             | aren't often the source of slowness. In my experience plain
             | old algorithmic complexity would go a long way in making
             | stuff faster.
             | 
             | I can't tell you the number of times I've fixed code like
             | this                   matches = []         for (first :
             | items) {           for (second : items) {             if
             | (first.name == second.name)
             | matches.add(first);          }         }
             | 
             | Very frequently a bright red spot in most profiler output.
        
               | marginalia_nu wrote:
               | I think there's an element of selection bias in that
               | observation. Since that is the type of performance issue
               | that a profiler is good at finding, those are the
               | performance issues you'll find looking at a profiler.
        
               | cogman10 wrote:
               | > I think there's an element of selection bias in that
               | observation.
               | 
               | Almost certainly true, I can only speak of my own
               | experiences.
               | 
               | > Since that is the type of performance issue that a
               | profiler is good at finding, those are the performance
               | issues you'll find looking at a profiler.
               | 
               | I have to disagree with you on this. Sampling profilers
               | are good at finding out exactly what methods are eating
               | performance. In fact, if anything they have a tendency to
               | push you towards looking at single methods for problems
               | rather than moving up a layer of two to see the big
               | picture (it's why flame graphs are so important in
               | profiling).
               | 
               | I have, for example, seen plenty of times where the
               | profiler indicated that double math was the root cause of
               | problems yet popping a few layers up the stack revealed
               | (sometimes non-obviously) that there was n^2 behavior
               | going on.
        
             | Conscat wrote:
             | Intel also provides vtune to annotate sequenced
             | instructions and profile the microseconds and power
             | consumption down to the level of individual instructions.
             | 
             | I assume those NOPs you mention exist for alignment
             | padding. Clang and GCC let you configure the alignment
             | padding of any or all functions, and Clang lets you
             | explicitly align any for-loop anywhere you want with
             | `[[clang::code_align]]`.
        
         | soegaard wrote:
         | > Do you have more details (or a reference)?
        
           | chrsig wrote:
           | It took me a bit to hunt down, but it was part of the go 1.4
           | release in 2014.
           | 
           | the release notes[0] at the time stated
           | 
           | > The implementation of interface values has been modified.
           | In earlier releases, the interface contained a word that was
           | either a pointer or a one-word scalar value, depending on the
           | type of the concrete object stored. This implementation was
           | problematical for the garbage collector, so as of 1.4
           | interface values always hold a pointer. In running programs,
           | most interface values were pointers anyway, so the effect is
           | minimal, but programs that store integers (for example) in
           | interfaces will see more allocations.
           | 
           | @rsc wrote in some detail[0] about it the initial layout on
           | his blog.
           | 
           | I couldn't find a detailed explanation about how the
           | optimization interacted w/ the gc, but my understanding is
           | that it couldn't discern between pointer and integer values
           | 
           | [0] https://go.dev/doc/go1.4#runtime [1]
           | https://research.swtch.com/interfaces
        
         | hinkley wrote:
         | The problem with precise GC is usually the same problem with
         | malloc/free - if you allocate in an inner loop you have to free
         | in that inner loop and the bookkeeping kills throughput.
         | 
         | I don't know Go. Is that the problem we are seeing here?
         | 
         | One of the realtime GC solutions that stuck with me is
         | amortized GC, which might be appropriate to Go.
         | 
         | Instead of moving dead objects immediately to the free list you
         | "just" need to stay ahead of allocation. You can accomplish
         | that by freeing memory every time you allocate memory - but not
         | a full GC, just finishing the free of a handful of dead
         | objects.
         | 
         | That puts an upper bound on allocation time without a lower
         | bound on reallocation time.
        
           | sltkr wrote:
           | That's not what precise GC means.
           | 
           | Precise means that on a GC cycle, only true pointers to heap
           | allocated objects are identified as roots, which means that
           | an unreferenced object cannot be kept alive by a value that
           | happens to have a similar bit pattern as a heap pointer. This
           | guarantees that unreferenced objects are eventually cleaned
           | up, but not that this cleanup happens immediately, certainly
           | not as part of a busy loop. (Unless the system runs out of
           | memory, but in that case, a GC iteration is required anyway.)
        
             | vlovich123 wrote:
             | Since reference counting is GC, isn't reference counting a
             | form of precise GC? That's certainly the scenario I had in
             | mind reading what OP wrote where deallocation within a hot
             | loop would be possible.
        
               | hinkley wrote:
               | No it's usually called conservative unless you have loop
               | detection.
        
             | hinkley wrote:
             | Oof. We used to just call that tracing.
        
               | chrsig wrote:
               | i think it is still tracing, it's just a matter of
               | identifying possible roots versus actual roots and
               | constraints that fall out from that.
        
           | vlovich123 wrote:
           | > if you allocate in an inner loop you have to free in that
           | inner loop and the bookkeeping kills throughput
           | 
           | I wonder if there's anything that automatically defers
           | collection within a loop and collects the garbage after the
           | loop is exited. Something like Obj-C's old @autoreleasepool
           | but inserted automatically. Static analysis has gotten so
           | fancy that that might not be a terrible idea once you work
           | out all the nuances (e.g. what if the loop is too short,
           | nested loops, what happens when your entire program is a loop
           | like game loops, etc). Could be the best of both worlds.
           | 
           | But generally I think it turns out that in refcounted GC
           | systems you end up just knowing to not allocate/free in your
           | hot loop or if you're doing it in a loop then it's probably
           | not the thing your hot loop is dominated by.
        
             | hinkley wrote:
             | Generational GC - particularly with escape analysis - can
             | make those temp objects just slightly more expensive than
             | stack allocation.
             | 
             | The odd thing about GenGC is that it punishes you for
             | trying to recycle old objects yourself. You create much
             | more pressure to scan the old generation by doing so, which
             | is expensive. If you have the available memory it's often
             | better to build a new object and swap it for the retained
             | reference to the old one at the end when you're done (poor
             | man's MVCC as well if the switch takes a couple context
             | switches to finish)
        
         | silisili wrote:
         | It's always important to know your bottlenecks before tuning.
         | If you're disk io bound, throwing 100 goroutines at it just
         | compounds the problem, for example.
         | 
         | One very common thing I've noticed is that people new to the
         | language overuse/abuse goroutines and channels. I completely
         | get why, they are two selling points you learn about early on.
         | But it's really easy to make things go wrong when spamming them
         | everywhere.
        
       | cancerhacker wrote:
       | "One day a student came to Moon and said, I understand how to
       | make a better garbage collector. We must keep a reference count
       | of the pointers to each cons. Moon patiently told the student the
       | following story:
       | 
       | One day a student came to Moon and said, I understand how to make
       | a better garbage collector..."
        
       | gok wrote:
       | I would have assume the major benefit to precision is that it
       | enables compaction...
        
         | sfink wrote:
         | You can still compact with a conservative scanner, you just
         | have to accommodate pinned regions.
        
       | kazinator wrote:
       | Conservative GC is quite vicious against the use of lazy lists:
       | 
       | You have code like this:
       | function(make_infinite_lazy_list());
       | 
       | where function walks down the list:                 fun
       | function(list)       {          while (list) {  // nil is false
       | ...             list = cdr(list);          }       }
       | 
       | problem is, the parent stack frame contains a spurious copy of
       | the original return value from make_infinite_lazy_list, and so as
       | function marches down the list, the discard prefix of the list is
       | not becoming garbage, as expected.
       | 
       | This is the showstopper for conservative GC. Not the bogeyman of
       | a stray machine integer suddenly looking exactly like a heap
       | pointer.
       | 
       | Stick a conservative scanner under C, make yourself some lazy
       | lists, and this problem will _easily_ reproduce!
        
         | celeritascelery wrote:
         | Wouldn't a stack map have that value as well?
        
           | kazinator wrote:
           | If the compiler puts out a stack map that is conservative,
           | then the GC scan will be effectively conservative. The
           | compiler has to compensate for that somehow. If a temporary
           | location is in the stackmap, such that the value in it is a
           | dead value before a function call, the compiler has to insert
           | an instruction to null out that temporary location.
        
       | rurban wrote:
       | Of course it can be faster because it's wont need a shift, ptrs
       | not a mask and objects not 2 words.
       | 
       | But usually you want to free ints which look like ptrs earlier,
       | sync don't keep them around. It's rare, bug when happening a
       | memory hog. Esp. On long running processes I would only do
       | precise GC. Those simple bit shifts and masks cost almost nothing
       | today.
       | 
       | For C interop you need to track pointers anyway separately.
        
       | sfink wrote:
       | I find this to be more of an interesting observation than a
       | reason to choose a conservative scanner. I've never really
       | thought of conservative GCs as faster or slower. The stack scan
       | is fast, simple, and prefetchable, but occasionally has to do
       | extra work. The precise scan has to consult separate tables,
       | often not stored nearby. I guess I'd expect the speed difference
       | to come more as a result of conservative pointers getting pinned
       | and what repercussions that has on the overall collection. I did
       | think it was interesting that precise scanning can sometimes be
       | more conservative than a conservative scanner because of
       | excessively long live ranges. It's great to have a wingo writeup
       | on these things.
       | 
       | Precision gives you a lot more flexibility in moving data.
       | Conservative scanning is way more pleasant for embedding (you
       | don't have to go through a ritual to register every root). The
       | difference in what gets collected rarely seems to matter much in
       | practice, though the determinism can matter, especially for
       | intermittently failing tests.
       | 
       | There's a lot of path dependence in choosing one over the other.
       | It's easy to go from precise to conservative, but very hard to go
       | the other way. In practice, that means you tend to stick to
       | whatever you've got--if you have a working precise collector,
       | you're loathe to loosen it up because you might get stuck. I was
       | heavily involved in moving the SpiderMonkey collector from
       | conservative to precise, and I wouldn't like to lose that work,
       | and I'd _really_ hate to have to do it again! And yet, the
       | maintenance cost of keeping the precise collector correct is not
       | low.
       | 
       | I suspect a lot of the argument for precision boils down to: bump
       | allocators go brrr. With a precise scanner, you can bump allocate
       | into a young generation, move everything out during a minor GC
       | without regard for conservative pinning, and then do it over
       | again. No looking at allocation bitmaps or free lists or
       | whatever. Easily JITted (at least for the basic allocation), good
       | locality, simple code.
       | 
       | A more tenuous guess is that a lot of the argument for being
       | conservative boils down to ease of embedding. You mostly don't
       | have to think about registering your stack roots, you can just
       | allocate stuff and call things.
        
       ___________________________________________________________________
       (page generated 2024-09-07 23:00 UTC)