[HN Gopher] Boehm-Demers-Weiser conservative C/C++ Garbage Colle...
       ___________________________________________________________________
        
       Boehm-Demers-Weiser conservative C/C++ Garbage Collector
        
       Author : swatson741
       Score  : 111 points
       Date   : 2024-01-21 11:30 UTC (11 hours ago)
        
 (HTM) web link (hboehm.info)
 (TXT) w3m dump (hboehm.info)
        
       | torginus wrote:
       | Ah the spectre of the 80s, most famously used by the ancient fork
       | of Mono that Unity uses, leading to GC-pause induced stutters
       | that have plagued gaming for over a decade.
       | 
       | I'd estimate that the bad rap GC languages get can be traced back
       | to this piece of code.
        
         | pyeri wrote:
         | You don't need to go that far for GC's bad rep considering that
         | excruciatingly slow Python can be found TODAY!
        
           | DarkNova6 wrote:
           | But you must understand, Python is safe and C is difficult.
           | Therefore, to be fast we need Python (fast development that
           | is because nobody _really_ cares about performance (or
           | maintainability)).
           | 
           | If only more than two languages would exist!
        
             | h4ch1 wrote:
             | > fast development that is because nobody _really_ cares
             | about performance (or maintainability)
             | 
             | This is either
             | 
             | 1. missing a /s
             | 
             | 2. delusion
             | 
             | 3. bait
        
               | tmtvl wrote:
               | I believe it's 1, considering the statement 'If only more
               | than two languages would exist!'. Granted, sarcasm
               | doesn't transmit well in written text.
        
           | amelius wrote:
           | Python wasn't designed for speed.
        
           | torginus wrote:
           | Afaik (C)Python uses reference counting, it's slowness comes
           | from the interpreter itself.
        
             | _flux wrote:
             | It uses also tracing GC to handle RC loops.
             | 
             | I guess doing it this way allows to have mostly
             | deterministic collection (that is then just an
             | implementation feature, not language semantics) while
             | needing the actual tracing GC to not be that efficient, as
             | it presumably needs to be executed less rarely.
        
             | kaba0 wrote:
             | Ref counting is a GC-algorithm, and the slowest at that
             | (especially with multi-core sync, but that's not the case
             | with python), as a tradeoff for less memory consumption.
        
           | adev_ wrote:
           | > You don't need to go that far for GC's bad rep considering
           | that excruciatingly slow Python can be found TODAY!
           | 
           | To be fair, the GC is probably the last reason of Python
           | speed problems.
           | 
           | Many languages also designed with GC performs significantly
           | better than python in term of raw performance [^1].
           | 
           | The extremely dynamic nature of python associated with its
           | reference semantics are probably significantly more to blame
           | here.
           | 
           | [^1] https://benchmarksgame-
           | team.pages.debian.net/benchmarksgame/...
        
         | p_l wrote:
         | Mono is nowhere close to major use case of it, it has been
         | succesfully deployed in maaaany projects.
         | 
         | C++ crazies denigrating GC even when it visibly fixed their
         | issues, or maybe especially when it did, was a phenomenon since
         | 1990. Even when code with Boehm GC ran faster - and most
         | importantly, didn't crash all the time.
        
           | jstimpfle wrote:
           | How did GC visibly fix their issues?
           | 
           | Manually managed code doesn't crash all the time. It probably
           | crashed way more in the 90s. There is an art to writing
           | manually managed code that is maintainable and fast, sure,
           | and it doesn't involve malloc() and free(). And then there is
           | RAII, automating lots of the work.
           | 
           | The C++ crazies are AFAICS still most everywhere where there
           | is lots of temporary objects and serious requirements on
           | throughput & latency. There are frameworks and languages that
           | support garbage collection that get used in this space too,
           | but I don't know how much can be inferred from that.
           | 
           | Seems like C and C++ guys are constantly looking for ways to
           | automate their garbage cleanup.
           | 
           | And seems like GC guys are constantly looking for ways to fix
           | their web-of-objects memory layout, and to prevent GC from
           | ruining their performance. Plus, GC isn't deterministic, it
           | mostly works for memory but otherwise you'll see _lots_ of
           | manual cleanup code (not making use of the GC), and it always
           | feels a bit tricky with the GC still being part of the game.
           | Not that I take a huge issue with C-like cleanup code -- I
           | think RAII can be tricky too because it's hard to "read" what
           | happens, and I often prefer the manual cleanup code. But
           | compared to GC, at least RAII is deterministic.
        
             | pjmlp wrote:
             | The deterministic myth only works out for basic types, see
             | Herb Stutter CppCon 2015 talk on the matter, and how
             | performance workarounds turn out a poor man's GC.
        
               | jstimpfle wrote:
               | What? RAII is totally deterministic, not sure what you
               | want to contend here. CppCon 2015: Herb Sutter "Writing
               | Good C++14... By Default": Not sure where I'm going to
               | find my answer here.
               | 
               | Anyway, for complex state that must be cleaned up, maybe
               | flushed before etc., you'll always find a lot of manual
               | cleanup code, or otherwise find code with lots of random
               | problems, often unreported. That is true both for GC
               | languages as well as C++ and other languages with RAII or
               | RAII-likes.
        
               | pjmlp wrote:
               | Unless you can static assert how long each destructor
               | takes, it is not deterministic, even less if ownership is
               | given to background threads for cleanup like on
               | C++/WinRT.
               | 
               | The talk is "Leak-Freedom in C++", 2016 actually,
               | https://youtu.be/JfmTagWcqoE?si=MY1JnWyybh5K8I8-
        
               | jstimpfle wrote:
               | What I want to say here is the most basic thing. The
               | order in which the destructors get called (also in the
               | context of the other code) is deterministic. GC doesn't
               | give you that. It should be obvious what I said.
        
               | pjmlp wrote:
               | Even that depends if constructors/destructors are ellided
               | when UB or LTO are part of the picture.
               | 
               | An optimization that coroutines rely upon as means to
               | remove the abstractions of their execution.
        
               | jstimpfle wrote:
               | UB? What? That's completely off-topic. Please stop that,
               | it's annoying.
               | 
               | I can't imagine what LTO should have to do with that
               | either, but if there's any substance here please
               | enlighten me.
               | 
               | The order in which destructors get called is well defined
               | (and generally useful insofar as it's the reverse order
               | of construction). Can't you just accept a simple
               | argument? I'm not even a huge fan of RAII, but it has its
               | uses.
        
               | nwlieb wrote:
               | There's some ambiguity for argument destruction order for
               | example: https://stackoverflow.com/a/36992250
               | 
               | Similarly, the construction/destruction order for
               | std::tuple elements is not well defined.
               | 
               | Granted, that's implementation defined behavior, which is
               | technically deterministic on a single compiler.
        
               | jstimpfle wrote:
               | This isn't really about constructors/destructors.
               | Expressions like function calls with multiple arguments
               | have always been "unsequenced" with respect to each
               | other. In other words the order is left to decide for the
               | compiler. It's always been like that, going back to C
               | (and probably other languages). If you call _f(x++, x++)_
               | what values get passed to _f_ is unspecified.
               | 
               | I suppose the destruction of whatever the expressions
               | constructed still happens in reverse order of
               | construction.
               | 
               | But either way I might not even care, I'm aware that at
               | the level of a single statement the execution sequences
               | aren't much defined, so I rarely put more than one
               | mutating thing per expression, or otherwise I'm sure that
               | I don't care about the order -- I could live with any
               | sequence as well as totally parallel execution.
               | 
               | Example: _buf[i++] = 5_ has 2 mutating sub-expressions,
               | but I know it 's not messing up anything. I don't care
               | whether _i_ gets incremented before _5_ gets assigned or
               | the other way around.
        
               | josefx wrote:
               | That is a one and a half our long talk, can you point to
               | the timestamp where he talks about determinism?
               | 
               | > and how performance workarounds turn out a poor man's
               | GC.
               | 
               | If you need performance then it doesn't matter if it is a
               | "poor mans GC", what matters is that it meets the
               | space/time requirements, most general purpose GC tend to
               | care about neither.
        
               | pjmlp wrote:
               | The only deterministic thing is the call point, Herb
               | points out that just like calling malloc()/free() it is
               | not possible to deterministically assert the execution
               | time, more so with complex data structures where not only
               | there is a domino effect of destructors being called,
               | there is also the possibility of causing stack overflows
               | if the destuctors are written incorrectly, to build an
               | argument for presenting his deferred pointer class, which
               | is basically a GC in disguise.
        
               | jstimpfle wrote:
               | That is an obvious thing, and totally besides the point.
               | Also, probably GC only helps little against this. Sure,
               | actually calling the destructor can be delayed in order
               | to even out latency. You can do the same with RAII.
               | 
               | Sometimes you can't do that though because you have
               | certain constraints on the execution order of whatever
               | routines you need to run. This is completely independent
               | of whether you use manual cleanup or GC or RAII, however
               | with GC it is weird/hard/impossible to synchronize the GC
               | with the non-cleanup code.
        
             | einpoklum wrote:
             | In modern C++, you no longer need to manually-manage your
             | memory (except when writing custom containers and low-level
             | libraries). Memory is managed automagically, no garbage is
             | generated and no collection is necessary.
             | 
             | Of course, you can also do manual allocations if you really
             | wanted to, but - with the changes in the language since
             | C++11 - you just don't have the motivation to do it.
        
         | mihaaly wrote:
         | I assume that real time computing should have used real time
         | computing compatible techniques. But luckily those exposed to
         | other than games and game development might feel less aversion
         | towards GC, probably even the other way around.
        
         | pjmlp wrote:
         | Another reason being that many people don't get GC theory, nor
         | that because a language has some form of automatic resource
         | management, does not mean the language doesn't offer other
         | mechanisms.
        
         | yxhuvud wrote:
         | Another big reason is the lack of value types in languages like
         | Java. Modern languages with a GC tend to have those, and where
         | it makes sense to use them they help, a lot.
        
           | amelius wrote:
           | What are "value types"?
        
             | greesil wrote:
             | In Java, primitives. int, boolean, float. Allocated on the
             | stack, or the heap along with the object it is a part of.
        
             | jstimpfle wrote:
             | If you nest an object within another object you often want
             | to embed the child directly in its parent, sharing the
             | memory and avoiding the indirection that you get if you
             | nest with a reference as is the typical approach with GC
             | languages. Also embedding makes it clear that it's a 1:1
             | relation, i.e. non-optional, fixed identity etc.
             | 
             | AFAICT few GC languages allow you to do this, basically you
             | have to buy into indirection and end up with webs of
             | objects.
             | 
             | This is quite cache unfriendly, but the bigger issue for me
             | is that it makes the code less clear and requires
             | additional complexities since you can't just jump from a
             | child to its parent by subtracting an offset from its
             | address.
        
               | ghusbands wrote:
               | > makes the code less clear and requires additional
               | complexities since you can't just jump from a child to
               | its parent by subtracting an offset from its address.
               | 
               | Wait, so you're saying that code that subtracts an offset
               | from a child node to get to a parent node is clearer and
               | less complex than alternatives?
        
               | groestl wrote:
               | > Wait, so you're saying that code that subtracts an
               | offset from a child node to get to a parent node is
               | clearer and less complex than alternatives?
               | 
               | Parent made me laugh out loud on this.
        
               | jstimpfle wrote:
               | I hope you've stopped laughing and can read and
               | understand my answer.
        
               | groestl wrote:
               | Please, I did not want to offend you personally. It just
               | took me by surprise. I understand your points, even
               | though I'm with your direct reply, all things considered.
        
               | jstimpfle wrote:
               | The address offset subtraction thing specifically is used
               | quite a bit in the Linux kernel for example. It allows to
               | combine abstract code with concrete data. For example
               | they define a red-black tree as a graph of just plain
               | nodes. These nodes are totally abstract, they contain
               | only node pointers but no useful data. The code is
               | totally abstract, no need to instanciate anything a dozen
               | times with different offsets and such. You simply wrap
               | the abstract nodes with whatever concrete useful data you
               | fancy.
               | 
               | The same approach works for lists and a variety of other
               | data structures.
               | 
               | You can get from the abstract link node to the containing
               | struct with a macro: _CONTAINER_OF(ptr, containertype,
               | membername)_. It 's easy to use and unlikely to get it
               | wrong.
        
               | jstimpfle wrote:
               | Not having the address relationship requires additional
               | complexities. To get from the child to the parent, you
               | need to spend an additional reference field, and need to
               | keep it consistent with the other reference. If you embed
               | directly, there's nothing in the relationship of the
               | objects' identities that could ever get consistent. It's
               | a strong guarantee.
               | 
               | How do you get from the child to the parent in, say Java?
               | And how do you tell that a child is never swapped out?
               | 
               | How do you copy an object that holds, say 4 vec4, vec4
               | being a struct that itself 4 holds floats? With value
               | types, say in C, it's 1 memcpy and you're done.
        
         | rbehrends wrote:
         | This needs some qualifications.
         | 
         | The above problem is about latency of stop the world collectors
         | in a domain that requires extremely low latency. And if you
         | think that stop the world collections are representative of
         | garbage collection as a whole (i.e. the "bad rap"), this is
         | just not being up to the state of the art. (It's just that
         | generational/incremental collection is hard to impossible
         | without having support for write barriers/read
         | barriers/snapshotting on the compiler or hardware side, which
         | makes that a practical no-go for a language that was never
         | designed to have GC support.)
         | 
         | But in terms of throughput, the BDW GC actually performs pretty
         | well, competitive with or beating malloc()/free() libraries.
         | This is plenty enough for a number of applications, especially
         | when it comes to batch processing. In fact, even the stop the
         | world latency is (combined with parallel marking) plenty good
         | enough for a number of types of regular GUI applications, where
         | you don't have the fairly extreme latency requirements of
         | standard video games.
         | 
         | It is also worth noting that manual deallocation (especially
         | via RAII) isn't a panacea for latency, as that is just as prone
         | to large pauses due to cascading deletes [1].
         | 
         | [1] While in theory you can do that lazily, in this case you're
         | losing timeliness of deletion and may actually have worse worst
         | case latency than a modern GC that can stagger its work to have
         | bounded pause times. The benefit of RAII is that you may be
         | able to plan these deletions, but even that can be surprisingly
         | difficult at times, requiring extra management to keep data
         | structures alive longer than normal. [2]
         | 
         | [2] Note that lazily deleting objects in RC to avoid pause
         | times is not necessarily the answer; for one thing, you're
         | introducing much of the complexity associated with incremental
         | GC again (especially having to do X amount of deallocation work
         | for each allocation), or risking considerable space overhead.
         | https://dl.acm.org/doi/abs/10.1145/964001.964019 It also
         | remains a difficult challenge when you're dealing with objects
         | that aren't of bounded size (e.g. large arrays).
        
           | cplusplusfellow wrote:
           | For all of the reasons you note above, this is why I've
           | almost exclusively dealt with arena-style allocations of
           | fixed-object-sizes when doing ultra low latency work.
        
             | iainctduncan wrote:
             | Hi, I'm just starting a (long) PhD in computer music
             | languages, and one of my hoped-for projects is improving
             | the GC for the Scheme implementation I work in for soft
             | realtime use. If you (or anyone else) wouldn't mind sharing
             | suggestion on resources for learning about low-latency
             | modern GCs, it would be most appreciated. :-)
        
               | moonchild wrote:
               | read the garbage collection handbook -
               | https://gchandbook.org/
        
           | dataangel wrote:
           | I have been writing low latency code for a decade and never
           | seen cascading deletes be a problem. GC on the other hand is
           | a constant latency problem for people.
        
             | p_l wrote:
             | running a throughput-oriented GC instead of
             | latency/realtime one is a common error of those people
             | (also, throughput oriented ones are usually the default)
        
             | rbehrends wrote:
             | > I have been writing low latency code for a decade and
             | never seen cascading deletes be a problem.
             | 
             | I don't know what your domain is, but the domains I'm
             | familiar with (especially hard real-time) deal with the
             | underlying constraints by way of restricting how you write
             | programs. A side effect of these restrictions is that you
             | generally avoid use constructs that give rise to cascading
             | deletes. But ultimately, having to deal with these
             | restrictions is suboptimal.
             | 
             | I remember a talk by Gil Tene (CTO and co-founder of Azul
             | Systems) about how the problem with writing real-time Java
             | was that it historically required you to write non-
             | idiomatic Java code, which came at a serious cost to reuse
             | and where he argued that one of the benefits of Azul's GC
             | was that you could write (mostly) idiomatic Java and still
             | have GC noise be less than OS noise.
             | 
             | There is of course the actual problem that the stricter
             | your latency requirements are, the fewer GCs actually meet
             | them. Plenty of modern off the shelf GCs meet typical soft
             | real time requirements, but the harder your real time
             | requirements get, the fewer implementations actually meet
             | them (and are increasingly more expensive).
             | 
             | I'll also add that depending on how strict your latency
             | requirements are, off the shelf allocators may not meet
             | them, either, even for a single malloc() (though there
             | _are_ allocators that offer hard upper bounds for
             | allocation costs, such as TLSF).
        
           | gumby wrote:
           | > It's just that generational/incremental collection is hard
           | to impossible without having support for write barriers/read
           | barriers/snapshotting on the compiler or hardware side, which
           | makes that a practical no-go for a language that was never
           | designed to have GC support.
           | 
           | It was demonstrated in the mid 80s that the MMU could be used
           | to implement the write barrier (when I saw that I knew lisp
           | machines were a dead end, though I kept using them for as
           | long as was feasible). You can use this and other compiler
           | support to implement it in a language not designed for GC
           | support, no problem.
           | 
           | The issue I think you meant is that you can't have a
           | _transporting_ collector (which also compacts your working
           | set, even allowing you to return memory to the system). A
           | language like C++ doesn't anticipate an object's address ever
           | changing, so that's a no go.
           | 
           | I suppose you could write special shared_ptr / unique_ptr
           | that supported address mutation but you'd need a compiler
           | flag to check for an object's address being taken... but to
           | some degree the destructor paradigm renders a lot of this
           | moot.
        
             | rbehrends wrote:
             | > It was demonstrated in the mid 80s that the MMU could be
             | used to implement the write barrier (when I saw that I knew
             | lisp machines were a dead end, though I kept using them for
             | as long as was feasible). You can use this and other
             | compiler support to implement it in a language not designed
             | for GC support, no problem.
             | 
             | Yes, that's why I wrote about support "on the compiler or
             | _hardware_ side " (i.e. MMUs). That said, while this might
             | work in theory, the BDW GC does have support for it, and in
             | practice it doesn't actually work well. Throughput suffers
             | and pause times can actually get worse.
             | 
             | By the way, another alternative way of having incremental
             | collection without compiler support is to use fork() for
             | snapshotting (there's an option in D for that), but that
             | has its own issues, of course.
             | 
             | > The issue I think you meant is that you can't have a
             | transporting collector (which also compacts your working
             | set, even allowing you to return memory to the system). A
             | language like C++ doesn't anticipate an object's address
             | ever changing, so that's a no go.
             | 
             | No, I wasn't talking about a compacting collector. I very
             | specifically meant generational and/or incremental
             | collection.
        
               | gumby wrote:
               | > No, I wasn't talking about a compacting collector. I
               | very specifically meant generational and/or incremental
               | collection.
               | 
               | FWIW they are orthogonal
        
           | zozbot234 wrote:
           | The compiler support you need in practice is quite limited.
           | There is an implementation of incremental cycle collection in
           | Rust: https://github.com/chc4/samsara It's made possible
           | because Rust can tell apart read-only and read-write
           | references (except for references to interior mutable
           | objects, but these are known to the compiler and references
           | to them can be treated as read-write). This avoids a global
           | stop-the-world for the entire program.
           | 
           | Cascading deletes are rare in practice, and if anything they
           | are inherent to _deterministic_ deletion, which is often a
           | desirable property. When they 're possible, one can often use
           | arena allocation to avoid the issue altogether since arenas
           | are managed as a single allocation.
        
             | rbehrends wrote:
             | > The compiler support you need in practice is quite
             | limited.
             | 
             | This is true, but it's absent especially in C. Of course,
             | as long as you can intercept writes and/or reads through
             | pointers, you are good to go in principle. However, that
             | solution may not be efficient.
             | 
             | > Cascading deletes are rare in practice
             | 
             | An array of objects (e.g. std::vector<std::string>) already
             | has unbounded deallocation cost. People mostly think of
             | tree-like structures here, but such structures only add the
             | risk of stack blowout to the lack of an upper bound. But in
             | principle, any container that does not have a size limit
             | can give rise to unbounded pause times.
        
           | torginus wrote:
           | The problem isn't mark-and-sweep GCs it's that the Boehm GC
           | is the Ford T model of garbage collectors.
        
             | rbehrends wrote:
             | Absolutely not. While it obviously should not be used for
             | use cases that it isn't designed for, as a conservative
             | mark-and-sweep GC it is extremely well engineered and has
             | very good throughput.
        
       | itsphilos wrote:
       | My experience with Cpp isn't that extensive, but what is the use-
       | case of a garbage collector in this language? I always had the
       | impression that with the well-defined lifetimes of objects, you
       | wouldn't really create garbage to begin with, but I guess there
       | are some use-cases I don't yet know about?
        
         | FartyMcFarter wrote:
         | > what is the use-case of a garbage collector in this language?
         | 
         | Same as other languages.
         | 
         | > I always had the impression that with the well-defined
         | lifetimes of objects, you wouldn't really create garbage to
         | begin with
         | 
         | There's no well defined lifetime of objects when it comes to
         | dynamic allocation. For example, if you allocate something with
         | the _new_ keyword, there are no language guarantees that this
         | won 't leak memory.
        
         | pjmlp wrote:
         | Targeting .NET, hence C++/CLI, integration with Blueprints,
         | hence Unreal C++.
        
         | mike_hearn wrote:
         | It's pretty useful. Chrome uses one extensively for example
         | (called Oilpan). So does Unreal Engine. GC in C++ is much more
         | widely used than people realize.
         | 
         | The problem is that big programs in the real world often don't
         | have well defined lifetimes for everything. The idea that
         | everything in your app can have its lifetime worked out in
         | advance isn't true when you're modelling the world (a world),
         | or lifetimes are controlled by other people (e.g. website
         | authors).
         | 
         | Generally what you see is that these apps start out trying to
         | do manual memory management, decide it's too difficult to do
         | reliably at scale, and switch to a conservatively GCd heap with
         | a custom collector or Boehm.
         | 
         | Note that Rust doesn't fix this problem. Rust just encodes the
         | assumption of pre-known lifetimes much more strongly in the
         | language. If you're not in such a domain then you have to fall
         | back to refcounting, which is (a) slow and (b) easy to get
         | wrong such that you leak anyway. Early versions of Rust used
         | refcounting for everything and iirc they found anywhere between
         | 10-15% of the resulting binaries was refcounting instructions!
        
           | steveklabnik wrote:
           | > Early versions of Rust used refcounting for everything and
           | iirc they found anywhere between 10-15% of the resulting
           | binaries was refcounting instructions!
           | 
           | Do you happen to have a citation for this? I don't remember
           | ever hearing about it, but it's possible this was before my
           | time, as I started in the smart pointers era.
        
             | mike_hearn wrote:
             | I read it in an old HN comment by pcwalton. Algolia to the
             | rescue! Actually it's even worse, he said it was 19% of the
             | binary size:
             | 
             | https://news.ycombinator.com/item?id=5691684
             | 
             |  _The Rust compiler does this. Even so, 19% of the binary
             | size in rustc is adjusting reference counts.
             | 
             | I am not exaggerating this. One-fifth of the code in the
             | binary is sitting there wasted adjusting reference counts.
             | This is much of the reason we're moving to tracing garbage
             | collection._
             | 
             | It's interesting how many strategies Rust tried before
             | settling on linear types.
        
               | steveklabnik wrote:
               | Thank you!!!
        
               | klodolph wrote:
               | > It's interesting how many strategies Rust tried before
               | settling on linear types.
               | 
               | Rust doesn't actually have linear types. I'm not sure
               | what Rust's types are called (affine?), but linear types
               | are the "must be consumed" (can't leak) types, and Rust
               | doesn't have any support for this.
               | 
               | Rust's guarantee is that you MUST NOT use an object after
               | dropping it. Linear types would add the additional
               | requirement that you MUST drop the object.
        
               | steveklabnik wrote:
               | Rust's types are affine, yes. Sometimes some people don't
               | draw the distinction between the two and call both
               | "linear." But "affine" is more precise.
        
               | pcwalton wrote:
               | Fascinating, thank you!
               | 
               | (I completely forgot I wrote that and also I'm mildly
               | embarrassed at how I used to write back then.) :)
        
           | adgjlsfhk1 wrote:
           | it's worth noting that percent of instructions is a bad
           | metric since modern CPUs have lots of extra compute, so
           | adding simple integer instructions that aren't in the
           | critical path will often not affect the wall time at all.
        
             | mike_hearn wrote:
             | The problem is that once threading gets involved they have
             | to be interlocked adjustments and that's very slow due to
             | all the cache coherency traffic. Refcounts are also
             | branches that may or may not be well predicted. And you're
             | filling up icache, which is valuable.
        
               | vlovich123 wrote:
               | You could have a hybrid recount where you use basic
               | integers when adjusting the ref count on the current
               | thread and atomics to adjust the global count when you
               | hit 0 (hybrid-rc is the crate you can use but something
               | Swift-like where the compiler does ARC for specific
               | values when you opt in may not be the worst). Also when
               | the type isn't Send then you don't need to do atomic
               | refcounts although the interaction with unsafe does get a
               | like more complex.
               | 
               | But yeah, at this point Rust's niche is a competitor to
               | c/c++ and in that world implicit recounting doesn't have
               | a lot of traction and people favor explicit gc and
               | "manual" resource management (RAII mechanisms like drop
               | and destructors are ok).
        
           | bfgeek wrote:
           | Oilpan can theoretically be used as a library as well - see:
           | https://v8.dev/blog/oilpan-library
           | https://v8.dev/blog/tags/cppgc https://chromium.googlesource.
           | com/v8/v8.git/+/HEAD/include/c...
           | 
           | Due to the nature of web engine workloads migrating objects
           | to being GC'd isn't performance negative (as most people
           | would expect). With care it can often end up performance
           | positive.
           | 
           | There are a few tricks that Oilpan can apply. Concurrent
           | tracing helps a lot (e.g. instead of
           | incrementing/decrementing refs, you can trace on a different
           | thread), in addition when destructing objects, the
           | destructors typically become trivial meaning the object can
           | just be dropped from memory. Both these free up main thread
           | time. (The tradeoff with concurrent tracing is that you need
           | atomic barriers when assigning pointers which needs care).
           | 
           | This is on top of the safey improvements you gain from being
           | GC'd vs. smart pointers, etc.
           | 
           | One major tradeoff that UAF bugs become more difficult to
           | fix, as you are just accessing objects which "should" be
           | dead.
        
             | vlovich123 wrote:
             | > One major tradeoff that UAF bugs become more difficult to
             | fix, as you are just accessing objects which "should" be
             | dead.
             | 
             | Are you referring to access through a raw pointer after
             | ownership has been dropped and then garbage collection is
             | non deterministic?
        
               | bfgeek wrote:
               | > Are you referring to access through a raw pointer after
               | ownership has been dropped and then garbage collection is
               | non deterministic?
               | 
               | No - basically objects sometimes have some state of when
               | they are "destroyed", e.g. an Element detached from the
               | DOM tree[1]. Other parts of the codebase might have
               | references to these objects, and previously accessing
               | them after they destroyed would be a UAF. Now its just a
               | bug. This is good! Its not a security bug anymore!
               | However much harder to determine what is happening as it
               | isn't a hard crash.
               | 
               | [1] This isn't a real case, just an example.
        
           | uluyol wrote:
           | Not sure early versions of rust is the best example of
           | refcounting overhead. There are a bunch of tricks you can use
           | to decrease that, and it usually doesn't make sense to invest
           | too much time into that type of thing while there is so much
           | flux in the language.
           | 
           | Swift is probably a better baseline.
        
             | EPWN3D wrote:
             | Yeah I was thinking the same thing. "10 years ago the Rust
             | compiler couldn't produce a binary without significant size
             | coming from reference counts after spending minimal effort
             | to try and optimize it" doesn't seem like an especially
             | damning indictment of the overall strategy. Rust is a
             | language which is sensitive to binary size, so they
             | probably just saw a lower-effort, higher-reward way to get
             | that size back and made the decision to abandon reference
             | counts instead of sinking time into optimizing them.
             | 
             | It was probably right for that language at that time, but I
             | don't see it as being a generalizable decision.
             | 
             | Swift and ObjC have plenty of optimizations for reference
             | counting that go beyond "Elide unless there's an ownership
             | change".
        
           | pcwalton wrote:
           | > Note that Rust doesn't fix this problem. Rust just encodes
           | the assumption of pre-known lifetimes much more strongly in
           | the language. If you're not in such a domain then you have to
           | fall back to refcounting, which is (a) slow and (b) easy to
           | get wrong such that you leak anyway. Early versions of Rust
           | used refcounting for everything and iirc they found anywhere
           | between 10-15% of the resulting binaries was refcounting
           | instructions!
           | 
           | Well, modern idiomatic Rust only uses Arc/Rc on the few
           | objects where it's needed, so the overhead of reference count
           | adjustment is so tiny as to never show up. You typically only
           | see reference count traffic be expensive when either (a)
           | _everything_ is reference counted, as in ancient Rust; or (b)
           | on super-inefficient implementations of reference counting,
           | as in COM where AddRef() and Release() are virtual calls.
        
         | Jeaye wrote:
         | I'm using C++ to build jank, a native Clojure dialect on LLVM
         | with C++ interop. All of Clojure's runtime objects are
         | dynamically allocated and the churn of reference counting is
         | far too slow, compared to garbage collection. I had originally
         | started with an intrusive_ptr and atomic count and the Boehm GC
         | was about 2x faster for that benchmark (and at least as much
         | for every later benchmark).
         | 
         | Even outside of writing languages on top of C++, if you're
         | using something like immer, for persistent immutable data
         | structures in C++ (as jank does), it has memory policies for
         | reference counting or garbage collection. This is because
         | immutable data generally results in more garbage, even when
         | transients are used for the most pathological cases. That
         | garbage is the trade off for knowing your values will never be
         | mutated in place. The huge win of that is complete thread
         | safety for reading those values, as well as complete trust on
         | reproducibility/purity and trivial memoization.
        
       | habibur wrote:
       | in CPP you don't need GC. Use RAII. Objects only need a
       | constructor, destructor and a copy-constructor for a very
       | efficient reference counting GC that's almost impossible to leak.
       | 
       | In C you can construct your hackish tracing GC yourself. For root
       | objects save the stack address at the beginning of main() by
       | saving the address of a dummy stack variable.
       | int dummy=0;         void* stack_head=&dummy;
       | 
       | and later during GC get the current stack address again and trace
       | from old one to new. You also need to save the registers and can
       | do it in a semi portable way using unix setjmp() longjmp()
       | functions which save the register values into a structure that
       | you can then later trace. There's also the issues of split stack
       | that you need to keep into mind. But the positive thing is memory
       | alignment ensures pointers are all 8 byte aligned, you only need
       | to read address%8 values in between. It's a bit hackish
       | regardless.
       | 
       | I built a few apps using the above technique but now am more
       | comfortable with old styled manual memory management. Valgrind
       | will tell you exactly which line you allocated memory that
       | leaked.
        
         | pjmlp wrote:
         | In theory, yes.
         | 
         | In practice if you care about performance, one ends up creating
         | a pseudo GC like Herb Sutter has shown on his CppCon talk about
         | deferred _ptr, or why hazardous pointers are coming to C++26.
         | 
         | Then there is the point RAII handles have to be written in
         | exception safe way, also be movable if used as part of other
         | data structures.
        
           | mgaunard wrote:
           | I wouldn't call hazard pointers garbage collection.
           | 
           | Also, I don't think that's necessarily as critical as you
           | suggest. I've been working with performance-critical software
           | for many years and never saw a strong need for general-
           | purpose concurrent reclamation techniques.
        
         | rst wrote:
         | The RAII/ref-counting technique often works great, but not
         | always. Cyclic structures are a problem -- so, if you've got,
         | say, a global tree structure with bidirectional parent/child
         | links, you'll need some kind of special-casing to make ref-
         | counting clean up an entire abandoned subtree. A full GC does
         | take care of cases like this.
        
           | TinkersW wrote:
           | That is super rare and we have weak_ptr like objects for
           | that.
        
         | nemetroid wrote:
         | Parallel reference counting can have extremely poor
         | performance.
        
       | shrubble wrote:
       | Bjarne Stroustrup's view:
       | https://www.stroustrup.com/bs_faq.html#garbage-collection
        
       | nickpsecurity wrote:
       | I think a mix of methods are best. We can use non-GC methods for
       | whatever is easiest to prove is safe. In Python, they then use
       | reference counting by default. Then, a GC when that would be a
       | problem. I also prefer when the GC let's me tell it to
       | immediately delete something which it just checks incrementally.
       | Then, performance-sensitive modules can be coded without GC's
       | using tools like Meta's Infer which work on large modules.
       | 
       | For the GC component, there's two options I'd like to see more
       | exploration of. One is a modern, concurrent GC to reduce the
       | pauses people here are worried about. Immix [1] is an interesting
       | design I remember.
       | 
       | The other option is a runtime supporting multiple, app-specific
       | GC's without much tuning. JX OS was the first variation of that I
       | saw. I just stumbled on this other paper. So, we have a global,
       | memory manager divided up into chunks managed differently across
       | apps or components.
       | 
       | [1] https://www.cs.cornell.edu/courses/cs6120/2019fa/blog/immix/
       | 
       | [2] https://www4.cs.fau.de/Projects/JX/publications.html (It was
       | in 'Beyond Address Spaces.')
       | 
       | [3] https://dl.acm.org/doi/abs/10.1145/1029873.1029880
        
       | truenorth wrote:
       | Incremental alternatives to stop-the-world garbage collectors
       | were well known over 30 years ago.
        
         | mgaunard wrote:
         | The problem with the Boehm GC has always been that it's
         | conservative.
         | 
         | If it was something built into the language then you could do
         | much better, but then that language wouldn't be C++.
        
       | NelsonMinar wrote:
       | What I love about the Boehm collector is the hack that enables it
       | to work in a language with no runtime typing like C. It literally
       | _guesses_ what 's a pointer. In C pointers are basically integers
       | but they're unusual numbers, often quite large, and you know the
       | entire set of valid numbers if you control malloc(). Surprisingly
       | this worked even in 32 bit architectures.
        
         | chubot wrote:
         | I don't see that this is something to "love", and what you
         | describe doesn't capture the full extent of the approximation.
         | 
         | What scared me off of Boehm GC is that there a bazillion config
         | options -- e.g. it requires tuning and blacklisting to avoid
         | memory leaks -- and probably even correctness because register
         | scanning isn't portable.
         | 
         | The codebase is large and a bit scary, and there are seemingly
         | lots of ways to use it poorly. People kept saying it's "drop
         | in", but that seems like a myth.
         | 
         | At least personally it doesn't seem like the kind of thing I
         | want to debug.
         | 
         | The opinions on Boehm generally seem very split -- a bunch of
         | bad stories resulting in ripping it out of the codebase, and
         | also people who think it's surprisingly good, and codebases
         | that have used it for a very long time.
         | 
         | I think the Nix evaluator uses it and is carrying around
         | patches, and Guile Scheme uses it. Though I think there is some
         | desire to move away from it in both cases. But again people
         | have used it for a very long time, which isn't nothing!
         | 
         | ---
         | 
         | e.g. https://www.hboehm.info/gc/gcdescr.html
         | 
         |  _Recent versions of the collector use an approximate best fit
         | algorithm by keeping free lists for several large block sizes.
         | The actual implementation of GC_allochblk is significantly
         | complicated by black-listing issues (see below)._
         | 
         | ...
         | 
         |  _The collector implements black-listing of pages, as described
         | in Boehm, ``Space Efficient Conservative Collection '', PLDI
         | '93, also available here._
         | 
         |  _During the mark phase, the collector tracks ``near misses '',
         | i.e. attempts to follow a ``pointer'' to just outside the
         | garbage-collected heap, or to a currently unallocated page
         | inside the heap. Pages that have been the targets of such near
         | misses are likely to be the targets of misidentified
         | ``pointers'' in the future. To minimize the future damage
         | caused by such misidentifications they will be allocated only
         | to small pointerfree objects._
         | 
         |  _The collector understands two different kinds of black-
         | listing. A page may be black listed for interior pointer
         | references (GC_add_to_black_list_stack), if it was the target
         | of a near miss from a location that requires interior pointer
         | recognition, e.g. the stack, or the heap if
         | GC_all_interior_pointers is set. In this case, we also avoid
         | allocating large blocks that include this page._
        
           | NelsonMinar wrote:
           | "love" in the sense of someone loving their snaggletooth
           | mutt. What's remarkable to me is despite the obviously
           | terrible idea at the core of it, it seems to work pretty well
           | for some practical use. (IIRC it was at the core of GNU Java
           | as well, and I could have sworn the earliest Sun JVMs used
           | it.)
           | 
           | I don't doubt it caused all sorts of problems too. It's just
           | an interesting reminder that sometimes a really wacky hack
           | can be pretty useful.
        
       | m4nu3l wrote:
       | I once wrote a non-conservative generational GC for C++ just as
       | an exercise, with the constraint that I could only use standard
       | C++ (it was in C++17).
       | 
       | It worked based on a template type I called "gc_ptr<T>". So you
       | could create one of these with the function template
       | "make_gc(...)". This was the only way you could allocate a new
       | garbage-collected object.
       | 
       | "make_gc" would check if a type descriptor for the type was
       | already initialized and if not it would allocate a new one. It
       | would then set a thread_local with the descriptor and the address
       | of the current object being created.
       | 
       | Then it would call the object constructor. If the type being
       | instantiated had gc_ptr members, these would be initialized by
       | their constructor. The gc_ptr constructor would, in turn, check
       | the descriptor of the parent object and add a record to it
       | representing the member. The record would store the member offset
       | calculated from its address and the one of the parent object.
       | 
       | Garbage-collected objects would also use reference counting for
       | gc_ptr(s) on the stack or inside objects that were not garbage-
       | collected. It's easy to know if a gc_ptr is being constructed
       | inside a garbage-collected object or not: If it is, then the code
       | is also executing inside make_gc, so I just need a thread_local
       | flag.
       | 
       | For the mark and sweep, I would use an atomic flag to mark the GC
       | as running. If the flag was set, then all gc_ptr would stop the
       | thread (by waiting on a mutex) as soon as some code would try to
       | swap/reassign them. This means that code that didn't touch them
       | would be kept running during garbage collection.
       | 
       | I would start marking objects from those allocated in the custom
       | arena I used that had a reference count different from zero.
       | 
       | I wanted to add containers that could allocate contiguous GC'ed
       | objects by wrapping the standard containers but never finished
       | that. GC'ed containers of gc_ptr(s) just worked fine.
       | 
       | I just re-ran the benchmarks I wrote to compare it against the
       | Boehm-Demser-Weiser GC. It is quite faster in doing the mark-and-
       | sweep (up to an order of magnitude faster with thousands of
       | objects and a bit less than 100 MBs of GC'ed memory in total).
       | 
       | However, because of the atomic check, it was slower in swapping
       | pointers by roughly an order of magnitude.
       | 
       | I never used it in any project.
        
         | frutiger wrote:
         | Did you ever publish this library?
        
       | rurban wrote:
       | I have a library which has an extremely slow free, around 2m for
       | large files, because of unnaturally scattered allocation
       | patterns, but this old conservative GC didn't help at all. It was
       | about 40% slower with libgc. mimalloc was a bit better. Best
       | would be an arena allocator without free, or at least a properly
       | fast GC, like mps https://github.com/Ravenbrook/mps, but this
       | would be too much work.
        
         | CyberDildonics wrote:
         | If you have a program that takes 2 minutes to free allocations,
         | that implies 100 million to 1 billion small allocations. An
         | arena is not the answer, a normal data structure is (most
         | likely a vector). I'm not sure an arena is ever really the
         | answer, because it seems like a poor work around to multiple
         | small allocations, which shouldn't happen in the first place.
         | 
         | Maybe that time is coming from something even more ridiculous
         | like scanning an entire linked list to find each element, but
         | if you can open up that library there are without a doubt
         | terrible memory allocation techniques that should never happen
         | going on.
        
       | Jeaye wrote:
       | I have been using the Boehm-Demers-Weiser GC for jank [1], the
       | native Clojure dialect on LLVM with C++ interop. Since jank is a
       | Clojure dialect, and Clojure is built primarily on persistent,
       | immutable data structures, there are potentially a lot of
       | references to objects, potentially cyclical, across many threads,
       | and there's a lot of garbage being churned. I originally started
       | with reference counting and RAII, using boost::intrusive_ptr and
       | an atomic count. The GC was actually 2x faster, in the sequence
       | benchmark I was trying to optimize.
       | 
       | At this point, jank is generally several times faster than the
       | equivalent Clojure JVM code. I'm sure there are better GCs out
       | there, in terms of performance, and I have my eye on MMTK [2] for
       | a future upgrade, but the fact that remains is this: the Boehm GC
       | is stupid simple to integrate and it's surprisingly fast. Compare
       | it to MPS, MMTK, and others and both the documentation and the
       | actual dev work required are worlds apart.
       | 
       | For a project which needs a GC but doesn't need to pick the
       | absolute fastest one first, it seems like the best option, based
       | on my research and my experience.
       | 
       | 1: https://jank-lang.org/
       | 
       | 2: https://www.mmtk.io/code
        
         | pharmakom wrote:
         | Immutable structures cannot be cyclical. Maybe I'm missing
         | something?
        
           | nlitened wrote:
           | In Clojure, the data structures may usually contain
           | references to mutable cells, which in turn may form cycles.
        
           | Jeaye wrote:
           | Exactly as nlitened mentioned, Clojure has a few difference
           | reference types, which are boxes that hold pointers to more
           | immutable values. The values within those boxes can be
           | changed transactionally. This allows for cyclical references.
        
         | mtlmtlmtlmtl wrote:
         | Just want to say, Never heard of Jank before and it looks
         | awesome, great work!.
         | 
         | As more of a CL guy who's dabbled in Clojure but found it a bit
         | too slow(but conversely I also miss the Clojure's data
         | structures and transactional state in CL; multithread
         | programming really isn't CLs strong suit, for obvious
         | historical reasons), this might reinvigorate me. I may sit down
         | one day soon and churn out some AOC or something in Jank just
         | to get a feel.
         | 
         | Cheers :)
        
           | Jeaye wrote:
           | The shameless plugging pays off! One more HN reader
           | converted. :D
           | 
           | Check it out, join our Slack, and stay tuned. jank's still
           | under heavy development right now, and not ready for AOC this
           | year, but it's getting close. The blog is the best way to
           | watch the progress: https://jank-lang.org/blog/
           | 
           | Thanks for taking the time to comment!
        
       | dang wrote:
       | Related. Others?
       | 
       |  _A Garbage Collector for C and C++_ -
       | https://news.ycombinator.com/item?id=9862135 - July 2015 (73
       | comments)
        
       | gregfjohnson wrote:
       | This thread brings to mind an issue I've wondered about: For hard
       | real-time systems, do people use specialized CPU's and hardware?
       | Caches can introduce hard-to-predict timing issues. I'm sure
       | there are other aspects of modern hardware that present similar
       | problems; large pipelines? Speculative instruction execution?
       | Does Intel Time-Coordinated Computing address some of these
       | hardware issues? Are there embedded CPU's from TI, Intel, AMD,
       | etc. that are specifically designed to support predictable and
       | understandable "lockstep" hard real-time time behavior?
        
         | deschutes wrote:
         | It's an interesting question and one I'm unqualified to answer.
         | From a brief survey I gather at least caching is enabled for
         | some hard real time systems. The name of the game seems to be
         | employing caches in ways that provably decrease worst case
         | execution time. For example, ignoring associativity, if your
         | task's worst case working set fits into a given cache level,
         | you can reason that the task misses each line at most once.
         | That's almost certainly a dramatic improvement in wcet over
         | cache disabled.
         | 
         | Worst case execution time with pipelining should be similarly
         | measurable given a worst case execution (which seems in any
         | case necessary to demonstrate a bound on the task, even in a
         | lockstep system).
         | 
         | See https://cs-
         | people.bu.edu/rmancuso/files/papers/cache_survey_...
        
       | HarHarVeryFunny wrote:
       | I guess garbage collection has it's uses, but it's very niche.
       | Since C++11 (i.e 2011) (STL, unique_ptr, shared_ptr), any need
       | for it in C++ has all but disappeared. For the most part you are
       | using STL data structures, and if you need something custom then
       | just use a class with a smart pointer member owner, and you get
       | memory management for free (and a lot more performant than a
       | garbage collector).
        
         | Rochus wrote:
         | There are a lot of cases where unique and shared pointers are
         | no good solution, and where a GC would help (e.g. when there is
         | no clear ownership hierarchy but a lot of mutual dependencies
         | between objects). And not to forget that smart pointers come
         | with a non-negligible cost.
        
       | Animats wrote:
       | > In our experience, the only examples we have found of a failure
       | with the current collector, even in multi-threaded code, were
       | contrived.
       | 
       | I once found a real incompatibilty. This collector assumes that a
       | pointer to a block must point to somewhere in the block, or, in
       | later versions, at least somewhere really close.
       | 
       | The original Numerical Recipes in C was a conversion of Numerical
       | Recipes in FORTRAN. FORTRAN arrays start at 1. So, NR in C had an
       | array allocator which returned an array address computed such
       | that subscripts starting at 1 would work.[1] If you allocated a
       | 100x100 array of 4-byte f32 values, the array address would be
       | offset downward by (100+1)*4 bytes. That would not look like a
       | pointer to the array to the garbage collector.
       | 
       | By the third edition, in 1986, the Numerical Recipes people had
       | converted to C++, dropped FORTRAN array compatibility, and the
       | problem went away.
       | 
       | [1] http://s3.amazonaws.com/nrbook.com/book_C210.html page 19.
        
       | einpoklum wrote:
       | Related C++ Stackoverflow question/discussion:
       | 
       | "Why Doesn't C++ have a garbage collector?"
       | https://stackoverflow.com/q/147130/1593077
       | 
       | and my answer: https://stackoverflow.com/a/48046118/1593077
       | 
       | ... which develos Bjarne's classic quote:
       | 
       | "I don't like garbage. I don't like littering. My ideal is to
       | eliminate the need for a garbage collector by not producing any
       | garbage. That is now possible."
        
       ___________________________________________________________________
       (page generated 2024-01-21 23:00 UTC)