[HN Gopher] Mimalloc Cigarette: Losing one week of my life catch...
       ___________________________________________________________________
        
       Mimalloc Cigarette: Losing one week of my life catching a memory
       leak (Rust)
        
       Author : Patryk27
       Score  : 110 points
       Date   : 2024-08-21 15:09 UTC (7 hours ago)
        
 (HTM) web link (pwy.io)
 (TXT) w3m dump (pwy.io)
        
       | kibwen wrote:
       | Level 1 systems programmer: "wow, it feels so nice having control
       | over my memory and getting out from under the thumb of a garbage
       | collector"
       | 
       | Level 2 systems programmer: "oh no, my memory allocator is a
       | garbage collector"
        
         | seanthemon wrote:
         | At the very bottom of everything is a garbage collector..
        
           | riwsky wrote:
           | Market forces: the ultimate garbage collector
        
           | hinkley wrote:
           | Soil is just the biggest swap meet in the world. Where every
           | microbe, invertebrate and tree is just looking for someone
           | else's trash to turn into treasure.
        
         | amelius wrote:
         | Level 3 system programmer: "get me out of this straight jacket
         | and give me my garbage collector back so I can get stuff done"
        
           | forrestthewoods wrote:
           | No. Just no.
           | 
           | For as painful as the debugging story was I have spent
           | _vastly_ more amounts of time working around garbage
           | collectors to ship performant code.
        
             | 0x457 wrote:
             | What, you don't like doing GC only N requests (ruby web
             | servers), disabling GC completely during working hours
             | (java stock trading), fake allocating large buffers (go's
             | allocate and don't use trick)?
        
               | pton_xd wrote:
               | Ain't nothin' wrong with configuring V8 to have unbounded
               | heap growth, disabling the memory reducer, and then
               | killing the process after a while.
        
               | mike_hearn wrote:
               | The Java shops you're thinking of didn't disable GC
               | during working hours, they just sized the generations to
               | avoid a collection given their normal allocation rates.
               | 
               | But there were / are also plenty of trading shops that
               | paid Azul for their pauseless C4 GC. Nowadays there's
               | also ZGC and Shenandoah, so if you want to both allocate
               | a lot and also not have pauses, that tech is no longer
               | expensive.
        
               | 0x457 wrote:
               | > The Java shops you're thinking of didn't disable GC
               | during working hours, they just sized the generations to
               | avoid a collection given their normal allocation rates.
               | 
               | Well, I just trivialized it. However, in one case in mid
               | 00s, I saw it disabled completely to avoid any pauses
               | during trading hours.
        
               | wpollock wrote:
               | Used to do this in C decades ago. Worked on Unix but I
               | doubt it works on Linux today, unless you disable memory
               | overcommit completely.
        
             | neonsunset wrote:
             | I'd wager it was an issue with the language of choice (or
             | its GC) being rather poorly made performance-wise or a
             | design that does not respect how GC works in the first
             | place :)
        
               | forrestthewoods wrote:
               | I'm sure you would! GC is like communism. Always some
               | excuse as to why GC isn't to blame.
               | 
               | > or a design that does not respect how GC works in the
               | first place
               | 
               | It's called shipping a 90 Hz VR game without dropping
               | frames.
        
               | neonsunset wrote:
               | Aside from finding the analogy strange and unfortunate, I
               | assume you're talking about Unity, is that correct?
               | 
               | (if that is the case, I understand where the GC PTSD
               | comes from)
        
             | gnuvince wrote:
             | I need to find a pithy way to express "we use a garbage
             | collector to avoid doing manual memory management because
             | that'd require too much effort; but since the GC causes
             | performance problems in production, we have spent more
             | effort and energy working around those issues and creating
             | bespoke tooling to mitigate them than the manual memory
             | management we were trying to avoid in the first place
             | would've required."
        
               | habibur wrote:
               | RAII <-- best of both worlds.
        
               | chubot wrote:
               | If you are talking about C++, it's nice when RAII works.
               | But if it does work, then in some sense your problem was
               | easy. Async code and concurrent code require different
               | solutions
        
           | ComputerGuru wrote:
           | That's not how _system_ programmers think..
        
             | amelius wrote:
             | A new generation of system programmers is tired of solving
             | the same old boring memory riddles over and over again and
             | no borrow checker is going to help them because it only
             | brings new riddles.
        
               | __s wrote:
               | gc replaces riddles with punchlines
        
             | troutwine wrote:
             | I agree. If we were to try and pin a thought process to an
             | additional level of systems programmer it'd involve writing
             | an allocator that's custom to your domain. The problem with
             | garbage collection for the systems' case is you're opting
             | into a set of undefined and uncontrolled runtime behavior
             | which is okay until it catastrophically isn't. An allocator
             | is the same but with less surface area and you can swap it
             | at need.
        
               | amelius wrote:
               | Meanwhile an OS uses the filesystem for just about
               | everything and it is also a garbage collected system ...
               | 
               | Why should memory be different?
        
               | troutwine wrote:
               | I'm not tracking how your question follows. If by garbage
               | collection you mean a system in which resources are
               | cleaned up at or after the moment they are marked as no
               | longer being necessary then, sure, I guess I can see a
               | thread here, although I think it a thin connection. The
               | conversation up-thread is about runtime garbage
               | collectors which are a mechanism with more semantic
               | properties than this expansive definition implies and
               | possessing an internal complexity that is opaque to the
               | user. An allocator does have the more expensive
               | definition I think you might be operating with, as does a
               | filesystem, but it's the opacity and intrinsic binding to
               | a specific runtime GC that makes it a challenging tool
               | for systems programming.
               | 
               | Go for instance bills itself as a systems language and
               | that's true for domains where bounded, predictable memory
               | consumption / CPU trade-offs are not necessary _because_
               | the runtime GC is bundled and non-negotiable. Its
               | behavior also shifts with releases. A systems program
               | relying on an allocator alone can choose to ignore the
               | allocator until it's a problem and swap the
               | implementation out for one -- perhaps custom made -- that
               | tailors to the domain.
        
               | amelius wrote:
               | An OS has the job of managing resources, such as CPU,
               | disk and memory.
               | 
               | It is easy to understand how it has grown historically,
               | but the fact that every process still manages its own
               | memory is a little absurd.
               | 
               | If your program __wants__ to manage its own memory, then
               | that is simple: allocate a large (gc'd) blob of memory
               | and run an allocator in it.
               | 
               | The problem is that the current view has it backwards.
        
               | LegionMammal978 wrote:
               | The OS already does that, though? Your program requests
               | some number of pages of virtual memory, and the OS uses a
               | GC-like mechanism to allocate physical memory to those
               | virtual pages on demand, wiping and reusing it soon after
               | the virtual pages are unmapped.
               | 
               | It's just that programs tend to want to manage objects
               | with sub-page granularity (as well as on separate threads
               | in parallel), and at that level there are infinitely many
               | possible access patterns and reachability criteria that a
               | GC might want to optimize for.
        
               | PaulDavisThe1st wrote:
               | AFAIK, no OS uses a "GC-like mechanism" to handle page
               | allocation.
               | 
               | When a process requests additional pages be added to its
               | address space, they remain in that address space until
               | the process explicitly releases them or the process
               | exits. At that time they go back on the free list to be
               | re-used.
               | 
               | GC implies "finding" unused stuff among something other
               | than a free list.
        
             | pjmlp wrote:
             | Those of us that actually used systems programming
             | languages with automatic resource management, do think that
             | way.
             | 
             | Unfortunately science only evolves one funeral at a time.
        
             | mike_hearn wrote:
             | It's how some think. Graal is a full compiler written in
             | Java. There's a long history of JVMs and databases being
             | written in GCd Java. I think you could push it a lot
             | further too. Modern JVM GCs are entirely pauseless for
             | instance.
        
         | matklad wrote:
         | The answer is clear: just don't have a malloc implementation in
         | your process' address space!
        
           | poikroequ wrote:
           | A bump allocator is all anyone really needs
        
             | mcguire wrote:
             | "Eh, it'll crash before it runs out of memory."
        
               | zokier wrote:
               | In some cases in a very literal sense (cue story about
               | missiles)
        
           | thebruce87m wrote:
           | Welcome to embedded! It's no heaps of fun!
        
             | hinkley wrote:
             | > no heaps
             | 
             | Angry upvote
        
               | sgt wrote:
               | pg needs to build that. Hold upvote icon for 5 secs=angry
               | upvote
        
               | hinkley wrote:
               | "I'm doing this as hard as I can"
        
             | eschneider wrote:
             | I'm always surprised how much I don't miss dynamic
             | allocation. :)
        
         | ckocagil wrote:
         | "stackoverflow please help me how do i fix memory
         | fragmentation"
        
       | loeg wrote:
       | Sort of tl;dr: mimalloc doesn't actually free memory in a way
       | that it can be reused on threads other than the one that
       | allocated it; the free call marks regions for eventual delayed
       | reclaim by the original thread. If the original thread calls
       | malloc again, those regions are collected (1/N malloc calls). Or
       | (C) you can explicitly invoke mi_collect[1] in the allocating
       | thread (the Rust crate does not seem to expose this API).
       | 
       | [1]:
       | https://github.com/microsoft/mimalloc/blob/dev/src/heap.c#L1...
        
         | Arnavion wrote:
         | The mimalloc crate just provides the GlobalAlloc impl that can
         | be registered with libstd as the global allocator using the
         | `#[global_allocator]` attr.
         | 
         | The underlying sys crate provides the binding for mimalloc API
         | like `mi_collect`: https://docs.rs/libmimalloc-
         | sys/0.1.39/libmimalloc_sys/fn.mi...
        
       | hinkley wrote:
       | We had learned helplessness on a drag and drop bug in jquery UI.
       | I had like three hours every second or third Friday and would
       | just step through the code trying to find the bug. That code was
       | so sketchy the jquery team was trying to rewrite it from scratch
       | one component at a time, and wouldn't entertain any bug
       | discussions on the old code even though they were a year behind
       | already.
       | 
       | After almost six months, I finally found a spot where I could
       | monkey patch a function to wrap it with a short circuit if the
       | coordinates were out of bounds. Not only fixed the bug but made
       | drag and drop several times faster. Couldn't share this with the
       | world because they weren't accepting PRs against the old widgets.
       | 
       | I've worked harder on bug fixes, but I think that's the longest
       | I've worked on one.
        
         | giancarlostoro wrote:
         | One of my favorite most elusive bugs was a one liner change. I
         | didn't understand the problem because nobody could reproduce
         | it, or show it. Months later, after my boss told his boss it
         | was fixed, despite never being able to test that it was fixed,
         | I figured it out and fixed it. We had a gift card form, and we
         | stored it in localStorage, if for any reason the person left
         | the tab, and came back months later, it would show the old gift
         | card with its old dated balance, it was a client-side bug. The
         | fix was to use sessionStorage.
        
           | contingencies wrote:
           | It seems in the context of your story the old adage that
           | organizations reproduce software in their own architecture
           | again rings true, with multilayered bureaucracy, lies and
           | promises resulting in "client state".
        
             | giancarlostoro wrote:
             | When I tried to explain to him that I fixed the thing he
             | claimed to have fixed, I heard him hesitantly say it wasn't
             | the same bug. Not sure what he told his boss this time
             | around the fix was for, but I was able to fully reproduce
             | the bug with this fix.
             | 
             | If you can't reproduce a bug, you cannot in my opinion say
             | that it is fixed. If you have to reproduce it via local
             | debugging and changing a value, or hard coding a value, I
             | think you're possibly close, but there's a chance it might
             | not be the case!
        
               | hinkley wrote:
               | The thing with saying "I don't know" is if you use it
               | judiciously, people believe you more when you say you can
               | do something they think can't be done.
               | 
               | If he didn't know he would just say. But he says he does.
        
           | arghwhat wrote:
           | For web, my favorite is JIT miscompilations. A tie between a
           | mobile Safari bug that caused basic math operations to return
           | 0 regardless of input values (basic, positive Numbers, no
           | shenanigans), or a mobile Samsung browser bug where
           | concatenating a specific single-character string with another
           | single-character string would yield a Number.
           | 
           | Debugging errors in JS crypto and compression implementations
           | that only occur at random, after at least some ten thousand
           | iterations, on a mobile browser back when those were awful,
           | and only if the debugger is closed/detached as opening it
           | disabled the JIT was not fun.
           | 
           | It taught me to go into debugging with no assumptions about
           | what can and cannot be to blame, which has been very useful
           | later in even trickier scenarios.
        
             | giancarlostoro wrote:
             | There's a weird one I ran into, and I for the life of me do
             | not remember which project it was under, but if I open dev
             | tools, the style changes for an element, then if I close
             | dev tools, the style goes back to normal. I never could
             | figure out what the heck is going on. I almost want to
             | blame the viewport size changing just slightly, but I
             | couldn't find a single CSS rule that would make it make
             | sense, and I think even popping it out, it behaved exactly
             | the same. It was frustrating, but I felt like I had to
             | ignore it since no normal user would ever see it, it just
             | made debugging with dev tools confusing.
             | 
             | Edit: In your case, that's where I start print debugging
             | LOL
        
               | hinkley wrote:
               | println("1");
               | 
               | ...
               | 
               | println("2");
               | 
               | ...
               | 
               | println("wtf");
        
             | hinkley wrote:
             | I think you might use "favorite" the way I mean "fun" (if I
             | say fun at work, it's because we are having none)
             | 
             | A lot of my opinions on code and the human brain started in
             | college. My roommate was washing out and didn't know it
             | yet. The rules about helping other people were very clear,
             | I was a boy scout but also grade-a bargainer and
             | rationalized so I created a protocol for helping him
             | without getting us expelled. Other kids in the lab started
             | using me the same way.
             | 
             | There were so many people who couldn't grasp that your code
             | can have three bugs at once, and fixing one won't make your
             | code behave. Some of those must have washed out too.
             | 
             | But applying the scientific method as you say above is
             | something that I came to later and it's how I mentor
             | people. If all of your assumptions say the answer should be
             | 3 but it's 4, or "4" or "Spain", one of your assumptions is
             | wrong and you need to test them. Odds of being the flaw /
             | difficulty of rechecking. Prioritize and work the problem.
             | 
             | (Hidden variable: how embarrassed you'll be if this turns
             | out to be the problem)
        
         | WalterBright wrote:
         | My longest one was an uninitialized declaration of a local
         | variable, which acquired ever-changing values.
         | 
         | This is why D, by default, initializes all variables. Note that
         | the optimizer removes dead assignments, so this is runtime
         | cost-free. D's implementation of C, ImportC, also default
         | initializes all locals. Why let that stupid C bug continue?
         | 
         | Another that repeatedly bit me was adding a field, and
         | neglecting to add initialization of it to all the constructors.
         | 
         | This is why D guarantees that all fields are initialized.
        
           | hinkley wrote:
           | The first bug I remember writing was making native calls in
           | Java to process data. I didn't understand why in the examples
           | they kept rerunning the handle dereference in every loop.
           | 
           | If native code calls back into Java, and the GC kicks in, all
           | the objects the native code can see can be compacted and
           | moved. So my implementation worked fine for all of the
           | smaller test fixtures, and blew up half the time with the
           | largest. Because I skipped a line to make it "go faster".
           | 
           | I finally realized I was seeing raw Java objects in the
           | middle of my "array" and changing the value of final fields
           | into illegal pairings which blew everything the fuck up.
        
           | ckocagil wrote:
           | Valgrind didn't catch it?
        
       | IceTDrinker wrote:
       | PSA: do not use floating point for monetary amounts
        
         | SAI_Peregrinus wrote:
         | MS Excel uses floating point, and it's used a _ton_ in finance.
         | Don 't use floating-point for monetary amounts if you don't
         | know what rounding mode you've set.
        
           | koverstreet wrote:
           | It's somewhat acceptable with double precision floats - never
           | single precision floats.
           | 
           | But far better to just use integer cents.
        
         | nurettin wrote:
         | I have used single precision floats in my latest project just
         | to disprove this baloney.
        
       | znpy wrote:
       | > Allocators have different characteristics for a reason - they
       | do some things differently between each other. What do you think
       | mimalloc does that could account for this behavior?
       | 
       | Interestingly, it would seem that Java programmers play with
       | garbage collectors while Rust programmers play with memory
       | allocators.
        
       | zokier wrote:
       | I wonder if there is something that could be done on language
       | design level to have better "sympathy" to memory allocation, i.e.
       | built upon having mmap/munmap as primitives instead of
       | malloc/free; where language patterns are built around allocating
       | pages instead of arbitrarily sized objects. Probably not
       | practical for general high-level languages, but for e.g. embedded
       | or high-performance stuff might make sense?
        
         | eschneider wrote:
         | In general for embedded, you don't page memory even if you're
         | running something like embedded linux.
         | 
         | For high performance stuff where you need low, predictable
         | latency, you're probably not going to want to use dynamic
         | memory at all.
        
         | loeg wrote:
         | Not exactly what you're getting at, but you could maybe imagine
         | an explicit version of malloc where allocations are destined
         | either for thread-local only use, or shared use. Then locally
         | freeing remote thread-local memory is an invalid operation and
         | these kinds of assume-locality optimizations are valid on many
         | structures. I think you can imagine a version of mmap that
         | allows for thread-local mappings to help detect accidental
         | misuse of local allocation.
        
         | dathinab wrote:
         | most modern memory allocators use internally mmap, this is why
         | it most times makes sense to not use the system allocate for
         | long running programs
         | 
         | Generally given that page size isn't something you know at
         | compiler (or even install size) and it can vary between each
         | restart and it being between anything between ~4KiB and 1GiB
         | and most natural memory objects being much less then 4KiB but
         | some being potentially much more then 1GiB you kind don't want
         | to leak anything related to page sizes into your business logic
         | if it can be helper. If you still need to most languages have
         | memory/allocation pools you can use to get a bit more control
         | about memory allocation/free and reuse.
         | 
         | Also the performance issues mentioned have not much to do with
         | memory pages or anything like that _instead they are rooted in
         | concurrency controls of a global resource (memory)_. I.e.
         | thread local concurrency syncronization vs. process concurrency
         | synchronization.
         | 
         | mainly instead of using a fully general purpose allocator they
         | used an allocator whiche is still general purpose but has a
         | design bias which improves same-thread (de)allocation perf at
         | cost of cross thread (de)allocation perf. And they where doing
         | a ton of cross thread (de)allocations leading to noticeable
         | performance degradation.
         | 
         | The thing is even if you hypothetically only had allocations at
         | sizes multiple of a memory page or use a ton of manual mmap you
         | still would want to use a allocator and not always directly
         | free freed memory back to the OS as doing so and doing a
         | syscall on every allocation tends to lead to major performance
         | degradation (in many use cases). So you still need concurrency
         | controls but they come at a cost, especially for cross thread
         | synchronization. Even just lock-free controls based on atomic
         | have a cost over thread local controls caused often largely by
         | cache invalidation/synchronization.
        
         | PaulDavisThe1st wrote:
         | This seems to fail to understand that we already have both
         | levels.
         | 
         | Every OS will provide some mechanism to get more pages. But it
         | turns out that managing the use of those pages requires
         | specialized handling, depending on the use case, as well as a
         | bunch of boilerplate. Hence, we also have malloc and its many,
         | many cousins to allocate arbitrary size objects.
         | 
         | You're always welcome to use brk(2) or your OS's equivalent if
         | you just want pages. The question is, what are you going to do
         | with each page once you have it? That's where the next level
         | comes in ...
        
       | Exuma wrote:
       | I really love the design of this blog
        
       | Arnavion wrote:
       | jemalloc also has its own funny problem with threads - if you
       | have a multi-threaded application that uses jemalloc on all
       | threads except the main thread, then the cleanup that jemalloc
       | runs on main thread exit will segfault. In $dayjob we use
       | jemalloc as a sub-allocator in specific arenas. (*) The
       | application itself is fine in production because it allocates
       | from the main thread too, but the unit test framework only runs
       | tests in spawned threads and the main thread of the test binary
       | just orchestrates them. So the test binary triggers this segfault
       | reliably.
       | 
       | ( https://github.com/jemalloc/jemalloc/issues/1317 Unlike what
       | the title says, it's not Windows-specific.)
       | 
       | (*): The application uses libc malloc normally, but at some
       | places it allocates pages using `mmap(non_anonymous_tempfile)`
       | and then uses jemalloc to partition them. jemalloc has a feature
       | called "extent hooks" where you can customize how jemalloc gets
       | underlying pages for its allocations, which we use to give it
       | pages via such mmap's. Then the higher layers of the code that
       | just want to allocate don't have to care whether those
       | allocations came from libc malloc or mmap-backed disk file.
        
       | CraigJPerry wrote:
       | Tangent: what's the ideal data structure for this problem?
       | 
       | If there were 20million rooms in the world with a price for each
       | day of the year, we'd be looking at around 7billion prices per
       | year. That'd be say 4Tb of storage without indexes.
       | 
       | The problem space seems to have a bunch of options to partition -
       | by locality, by date etc.
       | 
       | I'm curious if there's a commonly understood match for this
       | problem?
       | 
       | FWIW with that dataset size, my first experiments would be with
       | SQL server because that data will fit in ram. I don't know if
       | that's where I'd end up - but I'm pretty sure it's where I'd
       | start my performance testing grappling with this problem.
        
         | jrpelkonen wrote:
         | I think your premise is somewhat off. There might be 20 million
         | hotel rooms in a world, but surely they are not individually
         | priced, e.g. all king bed rooms in a given hotel have the same
         | price per given day.
        
       | om8 wrote:
       | TLDR: use shitty allocators, win shitty memory leaks
        
       | malkia wrote:
       | In C++, your
       | https://en.cppreference.com/w/cpp/memory/new/new_handler should
       | call mi_collect.
        
       | PaulDavisThe1st wrote:
       | A perfect demonstration of how many of harder problems we face
       | writing (especially non-browser-based) software are in fact not
       | addressed by language changes.
       | 
       | The concept of memory that is allocated by a thread and can only
       | be deallocated by that thread is useful and valid, but as TFA
       | demonstrates, can also cause problems if you're not careful with
       | your overall architecture. If the language you're using even
       | allows you to use this concept, it almost certainly will not
       | protect you from having to get the architecture corect.
        
       ___________________________________________________________________
       (page generated 2024-08-21 23:01 UTC)