[HN Gopher] Conservative GC can be faster than precise GC
___________________________________________________________________
Conservative GC can be faster than precise GC
Author : diegocg
Score : 117 points
Date : 2024-09-07 10:44 UTC (12 hours ago)
(HTM) web link (wingolog.org)
(TXT) w3m dump (wingolog.org)
| notorandit wrote:
| Just like a number of (unrelated) algorithms, the sweet spot can
| be found in the middle between two (or more) optimal solutions.
|
| Precision is sometimes useless due to time and computing
| resources constraints. Speed can come at the cost of poor
| results.
|
| I think the answer depends upon the specific use case and
| environment.
| tinco wrote:
| I've heard about scanning of the stack but I'm not sure if I get
| it. Is the strategy literally to not keep track of references at
| all, and simply do a sequential scan over the entire memory
| looking for bytes that look like they're pointers into the heap,
| and then assuming they are roots? And then you look them up in
| the allocation records and mark them as still in use? You'd have
| to scan them in turn as well to know if they've got references to
| other heap locations right?
|
| Edit: ah he says assume the heap is precisely traced (somehow?)
| so I guess it would already been known what references are in
| there.
| Rohansi wrote:
| Pretty much, yes. There are optimizations done so it would not
| need to scan every possible location. Pointers should be
| properly aligned of course but also things like having separate
| heap regions for data that has no pointers in it (large data
| arrays) to skip scanning entirely.
| adgjlsfhk1 wrote:
| the heap is a lot easier to trace than the stack because
| objects on the heap are put there explicitly, so as long as you
| know their type, it's pretty easy to know where their pointers
| are. the tricky part of the stack is that once your compiler
| has figured out that something can go in the stack, it also
| might want to do things like store part of the object only in
| registers or something like that.
| Tarean wrote:
| Both strategies start from roots (e.g. the stack) and then
| transitively chase pointers. Any memory reachable this way is
| live.
|
| To do this chasing precisely you need some metadata, saying
| which fields are pointers vs other data like ints. For OOP
| languages this metadata is stored with the vtables for objects.
| For the stack you need similar metadata, at the very least how
| many pointers there are if you put them first in the
| stackframe.
|
| Not having this metadata and conservatively treating random
| ints as pointers isn't always catastrophic, but it has some
| drawbacks. Two big ones:
|
| - a moving GC is tough, you have to update pointers after
| moving an object but can't risk updating random ints that
| happen to have that value
|
| - You do more work during GC chasing random non-pointers, and
| free less memory meaning more GC runs
|
| Generating ths precise GC metadata for stackframes is sort-of
| easy.You need specific Safe-Points where all data is at known
| locations for the GC to work anyway, usually by spilling
| resgisters to the stack. These GC checkpoints usually coincide
| with heap allocation, which is why long non-allocating loops
| can block stop-the-world GC and send other threads into a spin-
| lock in many GC implementations
|
| Maybe a non-precise GC could treat registers as possible
| pointers and skip spilling to the stack for Safe-Points? There
| are alternatives to spilling like a precise stack-map for each
| program instruction (so every instruction is a safe point), but
| those are expensive to process. Usually only used for debugging
| or exception handling, not something frequent like GC
| neonsunset wrote:
| > A compiler that does precise root-finding will typically output
| a side-table indicating which slots in a stack frame hold
| references to heap objects. These lifetimes aren't always
| precise, in the sense that although they precisely enumerate heap
| references, those heap references might actually not be used in
| the continuation of the stack frame. When GC occurs, it might
| mark more objects as live than are actually live, which is the
| imputed disadvantage of conservative collectors.
|
| This is not necessarily accurate with true precise tracking E.g.:
| using System.Runtime.CompilerServices; Example();
| // Skip Tier 0 compilation which does not track gcrefs as
| precisely
| [MethodImpl(MethodImplOptions.AggressiveOptimization)]
| static void Example() { var obj = new object();
| var wr = new WeakReference(obj); for (var i = 0; i <
| 3; i++) { Console.WriteLine(obj); }
| GC.Collect(); Console.WriteLine(wr.IsAlive);
| }
|
| This works in much more advanced scenarios too.
|
| Unfortunately, I can't link a simple document that covers this in
| detail from the top of my head but there's a wealth of
| information in Konrad Kokosa's works:
|
| .NET GC internals lectures:
| https://www.youtube.com/watch?v=8i1Nv7wGsjk&list=PLpUkQYy-K8...
|
| Pro .NET Memory Management (which is a great book in general):
| https://prodotnetmemory.com/
| andyayers wrote:
| In .NET, even in optimized methods, there can be "untracked"
| lifetimes where a stack slot is reported live to GC throughout
| the extent of a method, so presumably these can lead to the
| "over-reporting" cases mentioned.
|
| The number of trackable lifetimes was 64 in .NET Framework but
| has been steadily increased in modern .NET and is now 1024, so
| it's rarely a capacity issue; but there are cases where we
| can't effectively reason about lifetimes.
|
| For us another big drawback to conservative scanning is that
| any object referred to by a conservative reference cannot be
| relocated, since the reference might be live and is not
| guaranteed to be a GC reference; these objects are (in our
| parlance) effectively pinned, and this causes additional
| overhead.
| DarkNova6 wrote:
| I'm not sure what this article is trying to convey to me. I can
| only suppose the intended audience is entirely academia-centric.
|
| Reading the conclusion, I don't know if anybody working on actual
| top-tier GCs (Java, JavaScript, C#) finds this useful. It strikes
| me more as a somewhat interesting factoid, as demonstrated by a
| paper.
|
| The conclusion:
|
| > When it comes to designing a system with GC, don't count out
| conservative stack scanning; the tradeoffs don't obviously go one
| way or the other, and conservative scanning might be the right
| engineering choice for your system.
|
| If there would be examples of how this relates to actual GCs in
| production and compares them, now that would be interesting.
| pizlonator wrote:
| Most production GCs are accurate and the article's position on
| conservative GC is a minority position.
| mseepgood wrote:
| s/accurate/precise/
| pizlonator wrote:
| No.
|
| The literature uses "Accurate GC", "Precise GC", and "Exact
| GC" interchangeably.
|
| Famous paper on this that uses "accurate": https://citeseer
| x.ist.psu.edu/document?repid=rep1&type=pdf&d...
|
| Paper that calls it "exact":
| https://dl.acm.org/doi/10.1145/2660193.2660198
| samatman wrote:
| As the Fine Article mentions, the JavaScriptCore GC is
| conservative, and V8 is considering a switch.
|
| Someone reading your sentence might be at risk of conflating
| a minority position with a fringe one. Clearly this isn't the
| case here.
| pizlonator wrote:
| I said minority, not fringe.
|
| But fringe isn't far off. Among those who write GCs
| professionally, I was definitely on the fringe as an
| advocate for conservative-on-the-stack.
|
| (I wrote most of JSC's GC and I was one of the folks
| pushing for it to remain conservative at a time when V8 was
| accurate and SpiderMonkey transitioned to being accurate. I
| like that folks are now acknowledging the good choice we
| made in JSC, but it was a fringe/minority choice at the
| time for sure, and it's still a minority choice among high
| performance GCs.)
| samatman wrote:
| I would say you're being a bit modest! JSC is deployed on
| hundreds of millions of devices, and if V8 does adopt a
| conservative GC strategy, that would be a comfortable
| majority of JavaScript engines making use of that
| approach.
|
| It goes to show that popularity is a poor measure of
| technical merit, and the fringe doesn't always stay that
| way.
| fweimer wrote:
| Some are not. You can try this: diff --git
| a/src/runtime/mgc.go b/src/runtime/mgc.go index
| a2b6b979c1..d2f2852294 100644 ---
| a/src/runtime/mgc.go +++ b/src/runtime/mgc.go
| @@ -144,7 +144,7 @@ const ( //
| debugScanConservative enables debug logging for stack
| // frames that are scanned conservatively. -
| debugScanConservative = false + debugScanConservative
| = true // sweepMinHeapDistance is a lower
| bound on the heap distance // (in bytes) reserved
| for concurrent sweeping between GC
|
| And observe that Go scans a few stack frames conservatively.
| I think it only does this if a stack frame is preempted. Most
| frames are scanned precisely.
| zelphirkalt wrote:
| Well, there are many ecosystems and languages and not all are
| running a top tier GC. The author, afaik, has himself been a
| driving force in developing a JIT for GNU Guile. Posts like
| there, experts sharing their knowledge and insights, are among
| the most valuable here on HN.
| samatman wrote:
| Andy Wingo is a professional compiler engineer. Best known for
| his work on Guile Scheme, but he works for Igalia, and if you
| were to read more of his blog (I recommend this highly) you'll
| see references to work he's done on the so-called top-tier GCs
| you seem to favor.
|
| I would venture that, as a bare minimum, the intended audience
| is people who understand and care about the subject. That
| includes me; your mileage may vary.
| nu11ptr wrote:
| I've never understood why anyone would use a conservative
| collector outside toy programs or academia. It is hard enough to
| make programs deterministic even with precise collection. I can't
| even imagine releasing software that was inherently non-
| deterministic and could suddenly, and without notice, start
| retaining memory (even if atypical in practice). Thus, IMHO,
| which is faster is a moot point.
| sestep wrote:
| Yeah... I can't imagine trying to debug that. "We kept getting
| memory leaks, so we dug into it and realized that the language
| was interpreting local integer variables as pointers and
| refusing to free memory, but this only happened sporadically
| and we couldn't reproduce the bug on our development machines.
| After banging our heads against the wall for weeks we realized
| what was going on, and it turns out this behavior is completely
| intentional and they have no plans to change it."
| hypertele-Xii wrote:
| Computer programming is full of probabilistic edge cases with
| ridiculous costs that are amortized over normal use. Most
| optimization, encoding, and compression is based on
| statistics.
|
| If your use case requires a minimum bound, use another
| algorithm.
| sestep wrote:
| Amortized analysis actually provides a guarantee (either
| deterministic or probabilistic) that things will tend to
| even out in the long run. Unless I misunderstand something,
| conservative GC provides no such guarantee, and there are
| no hard statistics behind the claim that memory leaks
| caused by it should be rare. There's a difference between
| "this algorithm is actually random and so sometimes will
| happen to exhibit suboptimal behavior" and "under the right
| circumstances, this garbage collection scheme will
| consistently produce memory leaks due to arcane rules, but
| those conditions are practically impossible to reproduce in
| a controlled setting."
| reichstein wrote:
| Many quick-sort implementations are deterministic, so
| will consistently have their worst case behavior on the
| same inputs again and again. The good ones try to do a
| little better than choosing the center element as pivot,
| but with a well crafted input, it can easily become
| polynomial anyway.
|
| Luckily sorting is something you can easily choose
| another implementation of, if the default over didn't fit
| your use-cade, unlike the GC built into the single
| language implementation that your customer uses.
| dzaima wrote:
| Assuming that on-heap objects are tracked precisely, the
| maximum number of objects conservative stack scanning can
| incorrectly consider as roots is O(stack size) (with, in
| practice, a tiny constant factor, from excluding actual
| intentional references and frequently-changing or trivial
| non-reference values).
|
| The only way for the total amount of leaked memory to
| grow is by replacing something else on the stack, which
| releases whatever was there before, and, assuming there's
| some maximum stack size, there's a finite amount of such
| space.
|
| End result being that, even if you have some code path
| that generates an integer that looks like a reference, it
| can only leak one object (which could be massive perhaps,
| but with generational GC or similar there's no guarantee
| of all garbage being GC'd at any single GC anyway); if
| codegen clears stack slots upon their lifetime ending,
| you need one such integer-holding frame to be on the
| stack for every object you want to be leaked.
| paulddraper wrote:
| Yeah, but we're talking about memory leaks not some branch
| misprediction or cache miss.
| adgjlsfhk1 wrote:
| The counterpoint here is that a if your program uses 2gb of
| memory on a 64bit computer, only 1 in 4 billion random 64 bit
| values will be plausible addresses (and almost all of them will
| only pin small amounts of memory).
| PaulHoule wrote:
| Didn't the opposite happen near the end of the 32 bit age? My
| take was that you could get away with conservative garbage
| collection if you had 16MB of RAM in a 32 bit address space
| but as you filled more of it the probability of a significant
| memory leak converged on 1.
| adgjlsfhk1 wrote:
| I assume you mean 16 mb of ram in a 32 gb space, but yes.
| As app size approaches address space size, lots of things
| get worse.
|
| That said this new generation of only conservatively
| sweeping the stack has the significant advantage that the
| stack has remained pretty small (8mb is still the standard
| stack size), so if you have precise heap scanning the odds
| of spurious collections from conservative scanning go down
| a ton.
| PaulHoule wrote:
| I corrected GB -> MB in my post, thanks!
| IshKebab wrote:
| That's only true if the valid addresses or values are
| uniformly distributed, which is not the case.
| dzaima wrote:
| Less trivial considering that typically only the low ~47 bits
| of addresses are allowed to be non-zero; then again, values
| on the stack would still be full 64 bits and any non-zero bit
| in the top 17 immediately disqualifies it; then again,
| besides literally random 64-bit integers, ones not pushing up
| to the 64-bit limit are probably more likely anyway.
|
| Another possibility would be two adjacent 32-bit ints
| merging; for a 2GB heap this'd require the high half hitting
| one of the two valid 1<=x<=32767 values (reasonably frequent
| range for general-purpose numbers) and the bottom one can be
| anything; though whether such packing can happen on-stack
| depends on codegen (and, to a lesser extent, the program, as
| one can just do "12345L<<32" and get a thing that has the 2
| in 32767 (0.006%) chance of hitting the heap); but even then
| fitting a million of such on the stack only gives ~61
| potential false roots, and for that to be significant some of
| those must hit massive objects (brings back some 1 in a
| billion factor unless there are large cyclic object
| subgraphs).
| MobiusHorizons wrote:
| The article suggests this might be used for JavaScript VMs
| maybe the lack of int64s in the language helps?
|
| > Also, Apple's JavaScriptCore uses conservative stack
| scanning, and V8 is looking at switching to it. Funny, right?
| rbehrends wrote:
| The same argument would apply to any non-compacting allocator,
| because the worst case memory blowup due to external
| fragmentation is huge. But such cases are extremely rarely
| observed in practice, so people use e.g. standard
| malloc()/free() implementations or non-compacting garbage
| collectors without being concerned about that.
|
| In addition, there are plenty of cases where memory usage is
| unbounded or excessive, not because of allocator behavior, but
| because of programming mistakes. In fact, memory can sometimes
| blow up just because of large user inputs and very few systems
| are prepared for properly handling OOM conditions that happen
| legitimately.
|
| Case in point: Both CRuby and Apple's JavaScriptCore have
| garbage collectors that use conservative stack scanning and are
| widely used in production systems without the world ending.
|
| That said, you're probably not going to use conservative stack
| scanning because of collection speed alone. There are other
| trade-offs between conservative stack scanning and precise
| stack scanning that weigh more heavily.
|
| I'll add the caveat that I would be very cautious about using a
| conservative GC on 32-bit systems, but on 64-bit systems this
| is about as much of a concern as memory fragmentation.
| hinkley wrote:
| > But such cases are extremely rarely observed in practice
|
| After long years of finding problems and trying to encourage,
| pressure, bribe, cajole and finally yell at people about said
| problems, my professional opinion is that people aren't very
| observant, and either do not see problems or pretend not to
| see them.
|
| They just reboot the process every 48 hours and walk away.
|
| And sadly continuous deployment is making this worse. The
| collective We have problems that only show up on three or
| four day weekends, right when you don't want to be on call.
| bjoli wrote:
| Unless you have deterministic threading no parallel GC will be
| deterministic. A lot of software has used the Boehm collector
| without issue, like inkscape and I believe crystal lang.
|
| It is not sexy, and it has other issues, but it has worked for
| a long time.
| hedora wrote:
| I've never worked with a language that had a precise GC that
| wasn't also a nightmare to run in production. Java's the
| obvious example of a language with an unmanageable GC. (Yes,
| they're claiming the next GC will work, but that was the top
| line feature in the 1.4 marketing back in the late 1990's, and
| I simply don't believe such claims after 25+ years and over a
| dozen major releases that failed to deliver it.
|
| Go is supposedly a counterexample. I haven't used it enough to
| offer an opinion, but I have heard of companies rewriting Go
| services simply to avoid GC at runtime.
| adgjlsfhk1 wrote:
| Counterpoint: the problem with Java isn't the GC, but the
| language. If every struct you instantiated in C required a
| malloc/free, that would be really slow too.
| pron wrote:
| Java's memory management doesn't work like malloc/free at
| all, and that's why it's fast [1]. Even disregarding scalar
| replacement that means many `new` don't allocate anything
| on the heap, nearly all allocations are a pointer bump and
| objects that die don't require some active deallocation.
| With tracing collectors it's keeping objects _alive_ for a
| long time that requires some computational work, not
| deallocating them (as is the case with refcounting
| collectors).
|
| An _extremely_ simplified description of how OpenJDK 's GCs
| work is like this: every allocation bumps the "heap end"
| pointer, and when memory runs out, the GC scans the objects
| reachable from some set of "roots", resets the allocation
| pointer to the start of the heap, and copies the live
| objects to the allocation pointer, bumping it along the way
| (and in the process, overwriting the dead objects). In
| practice, the scanning and the compaction may be done
| concurrently with the application, there is not just one
| allocation pointer (to avoid contention), and not the
| entire object graph needs to be scanned to free memory.
|
| The cost of Java's good GC is virtually entirely in memory
| footprint (there may be some speed costs to GC in some
| cases, but they're offset by GC's speed benefits).
|
| [1]: Java may indeed be slower than C on some workloads due
| to memory layout, but that's because of data locality, not
| GC. The upcoming inlined value objects will take care of
| that.
| pron wrote:
| https://netflixtechblog.com/bending-pause-times-to-your-
| will...
| foobazgt wrote:
| This is such a weird take. Java's been everywhere in prod in
| the industry for decades from backends running millions of
| qps to HFT to game clients.
|
| IME, the most common GC problems I see are: -
| incredibly basic settings are wrong, like the max heap size
| - the app was written with some very poor behavior, e.g.
| unreal amounts of allocation or tons of allocation thrashing
| in tenured heap.
|
| There aren't even a lot of GCs knobs to "manage" anymore
| compared to a decade ago [0]. Don't treat your GC as a black-
| box machine with infinitely available resources, and you'll
| be fine.
|
| 0) https://docs.oracle.com/en/java/javase/22/gctuning/ergonom
| ic...
| chrsig wrote:
| It was a bit of a bummer when go switched from the conservative
| gc to a precise gc. One of the implications was that they needed
| to change how interface types were represented.
|
| They had a nice little optimization for word-sized values to
| store in-place rather than as a pointer out to a value. With the
| precise gc, they had to make the change to only storing pointers,
| leading to allocating small values.
|
| I don't know if they've done work to (or perhaps better put: had
| success) regain the performance hit from the extra allocation &
| gc load.
|
| On the flip side, my experience is that they've made the pretty
| unobtrusive with regards to latency and pauses. Or perhaps I'm
| just not stressing it as much as I had in the past.
|
| Random anecdote on gc tuning: I was once dealing with a go
| process that sped up (higher throughput, lower latency) by an
| alarming amount limiting the max processors. This was many years
| ago, and I wouldn't be able to say what version of go. It was
| after you no longer had to set GOMAXPROCS, but probably not very
| long after.
|
| Performance tuning is crazy sometimes.
| znpy wrote:
| Performance tuning is still largely a dark art from what I see,
| having dabbled in the space.
|
| It's both weird and beautiful because getting an end-to-end
| understanding is instrumental, so you often have to go look at
| many marginal things that might actually play a significant
| role.
| bqmjjx0kac wrote:
| Disappointingly, it's a dark art often because the CPU is a
| black box. Intel X86 chips translate the instructions you
| give them to some internal microcode and then execute them
| speculatively, out of order, etc. I'm still mystified by the
| performance gains afforded by randomly inserting NOP
| instructions.
| pjmlp wrote:
| That is why tooling like VTune exist.
| 10000truths wrote:
| There is a plethora of information regarding instruction
| timings, throughout/latency, execution port usage, etc.
| that compilers make liberal use of for optimization
| purposes. You could, in theory, also use that information
| to establish an upper bound on how long a series of
| instructions would take to execute. The problem lies in the
| difference in magnitude between average case and worst
| case, due to dynamic execution state like CPU cache, branch
| prediction, kernel scheduling, and so on.
| cogman10 wrote:
| Fortunately, at least in my experience, the variability
| that CPUs introduce (which do matter in many contexts)
| aren't often the source of slowness. In my experience plain
| old algorithmic complexity would go a long way in making
| stuff faster.
|
| I can't tell you the number of times I've fixed code like
| this matches = [] for (first :
| items) { for (second : items) { if
| (first.name == second.name)
| matches.add(first); } }
|
| Very frequently a bright red spot in most profiler output.
| marginalia_nu wrote:
| I think there's an element of selection bias in that
| observation. Since that is the type of performance issue
| that a profiler is good at finding, those are the
| performance issues you'll find looking at a profiler.
| cogman10 wrote:
| > I think there's an element of selection bias in that
| observation.
|
| Almost certainly true, I can only speak of my own
| experiences.
|
| > Since that is the type of performance issue that a
| profiler is good at finding, those are the performance
| issues you'll find looking at a profiler.
|
| I have to disagree with you on this. Sampling profilers
| are good at finding out exactly what methods are eating
| performance. In fact, if anything they have a tendency to
| push you towards looking at single methods for problems
| rather than moving up a layer of two to see the big
| picture (it's why flame graphs are so important in
| profiling).
|
| I have, for example, seen plenty of times where the
| profiler indicated that double math was the root cause of
| problems yet popping a few layers up the stack revealed
| (sometimes non-obviously) that there was n^2 behavior
| going on.
| Conscat wrote:
| Intel also provides vtune to annotate sequenced
| instructions and profile the microseconds and power
| consumption down to the level of individual instructions.
|
| I assume those NOPs you mention exist for alignment
| padding. Clang and GCC let you configure the alignment
| padding of any or all functions, and Clang lets you
| explicitly align any for-loop anywhere you want with
| `[[clang::code_align]]`.
| soegaard wrote:
| > Do you have more details (or a reference)?
| chrsig wrote:
| It took me a bit to hunt down, but it was part of the go 1.4
| release in 2014.
|
| the release notes[0] at the time stated
|
| > The implementation of interface values has been modified.
| In earlier releases, the interface contained a word that was
| either a pointer or a one-word scalar value, depending on the
| type of the concrete object stored. This implementation was
| problematical for the garbage collector, so as of 1.4
| interface values always hold a pointer. In running programs,
| most interface values were pointers anyway, so the effect is
| minimal, but programs that store integers (for example) in
| interfaces will see more allocations.
|
| @rsc wrote in some detail[0] about it the initial layout on
| his blog.
|
| I couldn't find a detailed explanation about how the
| optimization interacted w/ the gc, but my understanding is
| that it couldn't discern between pointer and integer values
|
| [0] https://go.dev/doc/go1.4#runtime [1]
| https://research.swtch.com/interfaces
| hinkley wrote:
| The problem with precise GC is usually the same problem with
| malloc/free - if you allocate in an inner loop you have to free
| in that inner loop and the bookkeeping kills throughput.
|
| I don't know Go. Is that the problem we are seeing here?
|
| One of the realtime GC solutions that stuck with me is
| amortized GC, which might be appropriate to Go.
|
| Instead of moving dead objects immediately to the free list you
| "just" need to stay ahead of allocation. You can accomplish
| that by freeing memory every time you allocate memory - but not
| a full GC, just finishing the free of a handful of dead
| objects.
|
| That puts an upper bound on allocation time without a lower
| bound on reallocation time.
| sltkr wrote:
| That's not what precise GC means.
|
| Precise means that on a GC cycle, only true pointers to heap
| allocated objects are identified as roots, which means that
| an unreferenced object cannot be kept alive by a value that
| happens to have a similar bit pattern as a heap pointer. This
| guarantees that unreferenced objects are eventually cleaned
| up, but not that this cleanup happens immediately, certainly
| not as part of a busy loop. (Unless the system runs out of
| memory, but in that case, a GC iteration is required anyway.)
| vlovich123 wrote:
| Since reference counting is GC, isn't reference counting a
| form of precise GC? That's certainly the scenario I had in
| mind reading what OP wrote where deallocation within a hot
| loop would be possible.
| hinkley wrote:
| No it's usually called conservative unless you have loop
| detection.
| hinkley wrote:
| Oof. We used to just call that tracing.
| chrsig wrote:
| i think it is still tracing, it's just a matter of
| identifying possible roots versus actual roots and
| constraints that fall out from that.
| vlovich123 wrote:
| > if you allocate in an inner loop you have to free in that
| inner loop and the bookkeeping kills throughput
|
| I wonder if there's anything that automatically defers
| collection within a loop and collects the garbage after the
| loop is exited. Something like Obj-C's old @autoreleasepool
| but inserted automatically. Static analysis has gotten so
| fancy that that might not be a terrible idea once you work
| out all the nuances (e.g. what if the loop is too short,
| nested loops, what happens when your entire program is a loop
| like game loops, etc). Could be the best of both worlds.
|
| But generally I think it turns out that in refcounted GC
| systems you end up just knowing to not allocate/free in your
| hot loop or if you're doing it in a loop then it's probably
| not the thing your hot loop is dominated by.
| hinkley wrote:
| Generational GC - particularly with escape analysis - can
| make those temp objects just slightly more expensive than
| stack allocation.
|
| The odd thing about GenGC is that it punishes you for
| trying to recycle old objects yourself. You create much
| more pressure to scan the old generation by doing so, which
| is expensive. If you have the available memory it's often
| better to build a new object and swap it for the retained
| reference to the old one at the end when you're done (poor
| man's MVCC as well if the switch takes a couple context
| switches to finish)
| silisili wrote:
| It's always important to know your bottlenecks before tuning.
| If you're disk io bound, throwing 100 goroutines at it just
| compounds the problem, for example.
|
| One very common thing I've noticed is that people new to the
| language overuse/abuse goroutines and channels. I completely
| get why, they are two selling points you learn about early on.
| But it's really easy to make things go wrong when spamming them
| everywhere.
| cancerhacker wrote:
| "One day a student came to Moon and said, I understand how to
| make a better garbage collector. We must keep a reference count
| of the pointers to each cons. Moon patiently told the student the
| following story:
|
| One day a student came to Moon and said, I understand how to make
| a better garbage collector..."
| gok wrote:
| I would have assume the major benefit to precision is that it
| enables compaction...
| sfink wrote:
| You can still compact with a conservative scanner, you just
| have to accommodate pinned regions.
| kazinator wrote:
| Conservative GC is quite vicious against the use of lazy lists:
|
| You have code like this:
| function(make_infinite_lazy_list());
|
| where function walks down the list: fun
| function(list) { while (list) { // nil is false
| ... list = cdr(list); } }
|
| problem is, the parent stack frame contains a spurious copy of
| the original return value from make_infinite_lazy_list, and so as
| function marches down the list, the discard prefix of the list is
| not becoming garbage, as expected.
|
| This is the showstopper for conservative GC. Not the bogeyman of
| a stray machine integer suddenly looking exactly like a heap
| pointer.
|
| Stick a conservative scanner under C, make yourself some lazy
| lists, and this problem will _easily_ reproduce!
| celeritascelery wrote:
| Wouldn't a stack map have that value as well?
| kazinator wrote:
| If the compiler puts out a stack map that is conservative,
| then the GC scan will be effectively conservative. The
| compiler has to compensate for that somehow. If a temporary
| location is in the stackmap, such that the value in it is a
| dead value before a function call, the compiler has to insert
| an instruction to null out that temporary location.
| rurban wrote:
| Of course it can be faster because it's wont need a shift, ptrs
| not a mask and objects not 2 words.
|
| But usually you want to free ints which look like ptrs earlier,
| sync don't keep them around. It's rare, bug when happening a
| memory hog. Esp. On long running processes I would only do
| precise GC. Those simple bit shifts and masks cost almost nothing
| today.
|
| For C interop you need to track pointers anyway separately.
| sfink wrote:
| I find this to be more of an interesting observation than a
| reason to choose a conservative scanner. I've never really
| thought of conservative GCs as faster or slower. The stack scan
| is fast, simple, and prefetchable, but occasionally has to do
| extra work. The precise scan has to consult separate tables,
| often not stored nearby. I guess I'd expect the speed difference
| to come more as a result of conservative pointers getting pinned
| and what repercussions that has on the overall collection. I did
| think it was interesting that precise scanning can sometimes be
| more conservative than a conservative scanner because of
| excessively long live ranges. It's great to have a wingo writeup
| on these things.
|
| Precision gives you a lot more flexibility in moving data.
| Conservative scanning is way more pleasant for embedding (you
| don't have to go through a ritual to register every root). The
| difference in what gets collected rarely seems to matter much in
| practice, though the determinism can matter, especially for
| intermittently failing tests.
|
| There's a lot of path dependence in choosing one over the other.
| It's easy to go from precise to conservative, but very hard to go
| the other way. In practice, that means you tend to stick to
| whatever you've got--if you have a working precise collector,
| you're loathe to loosen it up because you might get stuck. I was
| heavily involved in moving the SpiderMonkey collector from
| conservative to precise, and I wouldn't like to lose that work,
| and I'd _really_ hate to have to do it again! And yet, the
| maintenance cost of keeping the precise collector correct is not
| low.
|
| I suspect a lot of the argument for precision boils down to: bump
| allocators go brrr. With a precise scanner, you can bump allocate
| into a young generation, move everything out during a minor GC
| without regard for conservative pinning, and then do it over
| again. No looking at allocation bitmaps or free lists or
| whatever. Easily JITted (at least for the basic allocation), good
| locality, simple code.
|
| A more tenuous guess is that a lot of the argument for being
| conservative boils down to ease of embedding. You mostly don't
| have to think about registering your stack roots, you can just
| allocate stuff and call things.
___________________________________________________________________
(page generated 2024-09-07 23:00 UTC)