[HN Gopher] Mimalloc Cigarette: Losing one week of my life catch...
___________________________________________________________________
Mimalloc Cigarette: Losing one week of my life catching a memory
leak (Rust)
Author : Patryk27
Score : 110 points
Date : 2024-08-21 15:09 UTC (7 hours ago)
(HTM) web link (pwy.io)
(TXT) w3m dump (pwy.io)
| kibwen wrote:
| Level 1 systems programmer: "wow, it feels so nice having control
| over my memory and getting out from under the thumb of a garbage
| collector"
|
| Level 2 systems programmer: "oh no, my memory allocator is a
| garbage collector"
| seanthemon wrote:
| At the very bottom of everything is a garbage collector..
| riwsky wrote:
| Market forces: the ultimate garbage collector
| hinkley wrote:
| Soil is just the biggest swap meet in the world. Where every
| microbe, invertebrate and tree is just looking for someone
| else's trash to turn into treasure.
| amelius wrote:
| Level 3 system programmer: "get me out of this straight jacket
| and give me my garbage collector back so I can get stuff done"
| forrestthewoods wrote:
| No. Just no.
|
| For as painful as the debugging story was I have spent
| _vastly_ more amounts of time working around garbage
| collectors to ship performant code.
| 0x457 wrote:
| What, you don't like doing GC only N requests (ruby web
| servers), disabling GC completely during working hours
| (java stock trading), fake allocating large buffers (go's
| allocate and don't use trick)?
| pton_xd wrote:
| Ain't nothin' wrong with configuring V8 to have unbounded
| heap growth, disabling the memory reducer, and then
| killing the process after a while.
| mike_hearn wrote:
| The Java shops you're thinking of didn't disable GC
| during working hours, they just sized the generations to
| avoid a collection given their normal allocation rates.
|
| But there were / are also plenty of trading shops that
| paid Azul for their pauseless C4 GC. Nowadays there's
| also ZGC and Shenandoah, so if you want to both allocate
| a lot and also not have pauses, that tech is no longer
| expensive.
| 0x457 wrote:
| > The Java shops you're thinking of didn't disable GC
| during working hours, they just sized the generations to
| avoid a collection given their normal allocation rates.
|
| Well, I just trivialized it. However, in one case in mid
| 00s, I saw it disabled completely to avoid any pauses
| during trading hours.
| wpollock wrote:
| Used to do this in C decades ago. Worked on Unix but I
| doubt it works on Linux today, unless you disable memory
| overcommit completely.
| neonsunset wrote:
| I'd wager it was an issue with the language of choice (or
| its GC) being rather poorly made performance-wise or a
| design that does not respect how GC works in the first
| place :)
| forrestthewoods wrote:
| I'm sure you would! GC is like communism. Always some
| excuse as to why GC isn't to blame.
|
| > or a design that does not respect how GC works in the
| first place
|
| It's called shipping a 90 Hz VR game without dropping
| frames.
| neonsunset wrote:
| Aside from finding the analogy strange and unfortunate, I
| assume you're talking about Unity, is that correct?
|
| (if that is the case, I understand where the GC PTSD
| comes from)
| gnuvince wrote:
| I need to find a pithy way to express "we use a garbage
| collector to avoid doing manual memory management because
| that'd require too much effort; but since the GC causes
| performance problems in production, we have spent more
| effort and energy working around those issues and creating
| bespoke tooling to mitigate them than the manual memory
| management we were trying to avoid in the first place
| would've required."
| habibur wrote:
| RAII <-- best of both worlds.
| chubot wrote:
| If you are talking about C++, it's nice when RAII works.
| But if it does work, then in some sense your problem was
| easy. Async code and concurrent code require different
| solutions
| ComputerGuru wrote:
| That's not how _system_ programmers think..
| amelius wrote:
| A new generation of system programmers is tired of solving
| the same old boring memory riddles over and over again and
| no borrow checker is going to help them because it only
| brings new riddles.
| __s wrote:
| gc replaces riddles with punchlines
| troutwine wrote:
| I agree. If we were to try and pin a thought process to an
| additional level of systems programmer it'd involve writing
| an allocator that's custom to your domain. The problem with
| garbage collection for the systems' case is you're opting
| into a set of undefined and uncontrolled runtime behavior
| which is okay until it catastrophically isn't. An allocator
| is the same but with less surface area and you can swap it
| at need.
| amelius wrote:
| Meanwhile an OS uses the filesystem for just about
| everything and it is also a garbage collected system ...
|
| Why should memory be different?
| troutwine wrote:
| I'm not tracking how your question follows. If by garbage
| collection you mean a system in which resources are
| cleaned up at or after the moment they are marked as no
| longer being necessary then, sure, I guess I can see a
| thread here, although I think it a thin connection. The
| conversation up-thread is about runtime garbage
| collectors which are a mechanism with more semantic
| properties than this expansive definition implies and
| possessing an internal complexity that is opaque to the
| user. An allocator does have the more expensive
| definition I think you might be operating with, as does a
| filesystem, but it's the opacity and intrinsic binding to
| a specific runtime GC that makes it a challenging tool
| for systems programming.
|
| Go for instance bills itself as a systems language and
| that's true for domains where bounded, predictable memory
| consumption / CPU trade-offs are not necessary _because_
| the runtime GC is bundled and non-negotiable. Its
| behavior also shifts with releases. A systems program
| relying on an allocator alone can choose to ignore the
| allocator until it's a problem and swap the
| implementation out for one -- perhaps custom made -- that
| tailors to the domain.
| amelius wrote:
| An OS has the job of managing resources, such as CPU,
| disk and memory.
|
| It is easy to understand how it has grown historically,
| but the fact that every process still manages its own
| memory is a little absurd.
|
| If your program __wants__ to manage its own memory, then
| that is simple: allocate a large (gc'd) blob of memory
| and run an allocator in it.
|
| The problem is that the current view has it backwards.
| LegionMammal978 wrote:
| The OS already does that, though? Your program requests
| some number of pages of virtual memory, and the OS uses a
| GC-like mechanism to allocate physical memory to those
| virtual pages on demand, wiping and reusing it soon after
| the virtual pages are unmapped.
|
| It's just that programs tend to want to manage objects
| with sub-page granularity (as well as on separate threads
| in parallel), and at that level there are infinitely many
| possible access patterns and reachability criteria that a
| GC might want to optimize for.
| PaulDavisThe1st wrote:
| AFAIK, no OS uses a "GC-like mechanism" to handle page
| allocation.
|
| When a process requests additional pages be added to its
| address space, they remain in that address space until
| the process explicitly releases them or the process
| exits. At that time they go back on the free list to be
| re-used.
|
| GC implies "finding" unused stuff among something other
| than a free list.
| pjmlp wrote:
| Those of us that actually used systems programming
| languages with automatic resource management, do think that
| way.
|
| Unfortunately science only evolves one funeral at a time.
| mike_hearn wrote:
| It's how some think. Graal is a full compiler written in
| Java. There's a long history of JVMs and databases being
| written in GCd Java. I think you could push it a lot
| further too. Modern JVM GCs are entirely pauseless for
| instance.
| matklad wrote:
| The answer is clear: just don't have a malloc implementation in
| your process' address space!
| poikroequ wrote:
| A bump allocator is all anyone really needs
| mcguire wrote:
| "Eh, it'll crash before it runs out of memory."
| zokier wrote:
| In some cases in a very literal sense (cue story about
| missiles)
| thebruce87m wrote:
| Welcome to embedded! It's no heaps of fun!
| hinkley wrote:
| > no heaps
|
| Angry upvote
| sgt wrote:
| pg needs to build that. Hold upvote icon for 5 secs=angry
| upvote
| hinkley wrote:
| "I'm doing this as hard as I can"
| eschneider wrote:
| I'm always surprised how much I don't miss dynamic
| allocation. :)
| ckocagil wrote:
| "stackoverflow please help me how do i fix memory
| fragmentation"
| loeg wrote:
| Sort of tl;dr: mimalloc doesn't actually free memory in a way
| that it can be reused on threads other than the one that
| allocated it; the free call marks regions for eventual delayed
| reclaim by the original thread. If the original thread calls
| malloc again, those regions are collected (1/N malloc calls). Or
| (C) you can explicitly invoke mi_collect[1] in the allocating
| thread (the Rust crate does not seem to expose this API).
|
| [1]:
| https://github.com/microsoft/mimalloc/blob/dev/src/heap.c#L1...
| Arnavion wrote:
| The mimalloc crate just provides the GlobalAlloc impl that can
| be registered with libstd as the global allocator using the
| `#[global_allocator]` attr.
|
| The underlying sys crate provides the binding for mimalloc API
| like `mi_collect`: https://docs.rs/libmimalloc-
| sys/0.1.39/libmimalloc_sys/fn.mi...
| hinkley wrote:
| We had learned helplessness on a drag and drop bug in jquery UI.
| I had like three hours every second or third Friday and would
| just step through the code trying to find the bug. That code was
| so sketchy the jquery team was trying to rewrite it from scratch
| one component at a time, and wouldn't entertain any bug
| discussions on the old code even though they were a year behind
| already.
|
| After almost six months, I finally found a spot where I could
| monkey patch a function to wrap it with a short circuit if the
| coordinates were out of bounds. Not only fixed the bug but made
| drag and drop several times faster. Couldn't share this with the
| world because they weren't accepting PRs against the old widgets.
|
| I've worked harder on bug fixes, but I think that's the longest
| I've worked on one.
| giancarlostoro wrote:
| One of my favorite most elusive bugs was a one liner change. I
| didn't understand the problem because nobody could reproduce
| it, or show it. Months later, after my boss told his boss it
| was fixed, despite never being able to test that it was fixed,
| I figured it out and fixed it. We had a gift card form, and we
| stored it in localStorage, if for any reason the person left
| the tab, and came back months later, it would show the old gift
| card with its old dated balance, it was a client-side bug. The
| fix was to use sessionStorage.
| contingencies wrote:
| It seems in the context of your story the old adage that
| organizations reproduce software in their own architecture
| again rings true, with multilayered bureaucracy, lies and
| promises resulting in "client state".
| giancarlostoro wrote:
| When I tried to explain to him that I fixed the thing he
| claimed to have fixed, I heard him hesitantly say it wasn't
| the same bug. Not sure what he told his boss this time
| around the fix was for, but I was able to fully reproduce
| the bug with this fix.
|
| If you can't reproduce a bug, you cannot in my opinion say
| that it is fixed. If you have to reproduce it via local
| debugging and changing a value, or hard coding a value, I
| think you're possibly close, but there's a chance it might
| not be the case!
| hinkley wrote:
| The thing with saying "I don't know" is if you use it
| judiciously, people believe you more when you say you can
| do something they think can't be done.
|
| If he didn't know he would just say. But he says he does.
| arghwhat wrote:
| For web, my favorite is JIT miscompilations. A tie between a
| mobile Safari bug that caused basic math operations to return
| 0 regardless of input values (basic, positive Numbers, no
| shenanigans), or a mobile Samsung browser bug where
| concatenating a specific single-character string with another
| single-character string would yield a Number.
|
| Debugging errors in JS crypto and compression implementations
| that only occur at random, after at least some ten thousand
| iterations, on a mobile browser back when those were awful,
| and only if the debugger is closed/detached as opening it
| disabled the JIT was not fun.
|
| It taught me to go into debugging with no assumptions about
| what can and cannot be to blame, which has been very useful
| later in even trickier scenarios.
| giancarlostoro wrote:
| There's a weird one I ran into, and I for the life of me do
| not remember which project it was under, but if I open dev
| tools, the style changes for an element, then if I close
| dev tools, the style goes back to normal. I never could
| figure out what the heck is going on. I almost want to
| blame the viewport size changing just slightly, but I
| couldn't find a single CSS rule that would make it make
| sense, and I think even popping it out, it behaved exactly
| the same. It was frustrating, but I felt like I had to
| ignore it since no normal user would ever see it, it just
| made debugging with dev tools confusing.
|
| Edit: In your case, that's where I start print debugging
| LOL
| hinkley wrote:
| println("1");
|
| ...
|
| println("2");
|
| ...
|
| println("wtf");
| hinkley wrote:
| I think you might use "favorite" the way I mean "fun" (if I
| say fun at work, it's because we are having none)
|
| A lot of my opinions on code and the human brain started in
| college. My roommate was washing out and didn't know it
| yet. The rules about helping other people were very clear,
| I was a boy scout but also grade-a bargainer and
| rationalized so I created a protocol for helping him
| without getting us expelled. Other kids in the lab started
| using me the same way.
|
| There were so many people who couldn't grasp that your code
| can have three bugs at once, and fixing one won't make your
| code behave. Some of those must have washed out too.
|
| But applying the scientific method as you say above is
| something that I came to later and it's how I mentor
| people. If all of your assumptions say the answer should be
| 3 but it's 4, or "4" or "Spain", one of your assumptions is
| wrong and you need to test them. Odds of being the flaw /
| difficulty of rechecking. Prioritize and work the problem.
|
| (Hidden variable: how embarrassed you'll be if this turns
| out to be the problem)
| WalterBright wrote:
| My longest one was an uninitialized declaration of a local
| variable, which acquired ever-changing values.
|
| This is why D, by default, initializes all variables. Note that
| the optimizer removes dead assignments, so this is runtime
| cost-free. D's implementation of C, ImportC, also default
| initializes all locals. Why let that stupid C bug continue?
|
| Another that repeatedly bit me was adding a field, and
| neglecting to add initialization of it to all the constructors.
|
| This is why D guarantees that all fields are initialized.
| hinkley wrote:
| The first bug I remember writing was making native calls in
| Java to process data. I didn't understand why in the examples
| they kept rerunning the handle dereference in every loop.
|
| If native code calls back into Java, and the GC kicks in, all
| the objects the native code can see can be compacted and
| moved. So my implementation worked fine for all of the
| smaller test fixtures, and blew up half the time with the
| largest. Because I skipped a line to make it "go faster".
|
| I finally realized I was seeing raw Java objects in the
| middle of my "array" and changing the value of final fields
| into illegal pairings which blew everything the fuck up.
| ckocagil wrote:
| Valgrind didn't catch it?
| IceTDrinker wrote:
| PSA: do not use floating point for monetary amounts
| SAI_Peregrinus wrote:
| MS Excel uses floating point, and it's used a _ton_ in finance.
| Don 't use floating-point for monetary amounts if you don't
| know what rounding mode you've set.
| koverstreet wrote:
| It's somewhat acceptable with double precision floats - never
| single precision floats.
|
| But far better to just use integer cents.
| nurettin wrote:
| I have used single precision floats in my latest project just
| to disprove this baloney.
| znpy wrote:
| > Allocators have different characteristics for a reason - they
| do some things differently between each other. What do you think
| mimalloc does that could account for this behavior?
|
| Interestingly, it would seem that Java programmers play with
| garbage collectors while Rust programmers play with memory
| allocators.
| zokier wrote:
| I wonder if there is something that could be done on language
| design level to have better "sympathy" to memory allocation, i.e.
| built upon having mmap/munmap as primitives instead of
| malloc/free; where language patterns are built around allocating
| pages instead of arbitrarily sized objects. Probably not
| practical for general high-level languages, but for e.g. embedded
| or high-performance stuff might make sense?
| eschneider wrote:
| In general for embedded, you don't page memory even if you're
| running something like embedded linux.
|
| For high performance stuff where you need low, predictable
| latency, you're probably not going to want to use dynamic
| memory at all.
| loeg wrote:
| Not exactly what you're getting at, but you could maybe imagine
| an explicit version of malloc where allocations are destined
| either for thread-local only use, or shared use. Then locally
| freeing remote thread-local memory is an invalid operation and
| these kinds of assume-locality optimizations are valid on many
| structures. I think you can imagine a version of mmap that
| allows for thread-local mappings to help detect accidental
| misuse of local allocation.
| dathinab wrote:
| most modern memory allocators use internally mmap, this is why
| it most times makes sense to not use the system allocate for
| long running programs
|
| Generally given that page size isn't something you know at
| compiler (or even install size) and it can vary between each
| restart and it being between anything between ~4KiB and 1GiB
| and most natural memory objects being much less then 4KiB but
| some being potentially much more then 1GiB you kind don't want
| to leak anything related to page sizes into your business logic
| if it can be helper. If you still need to most languages have
| memory/allocation pools you can use to get a bit more control
| about memory allocation/free and reuse.
|
| Also the performance issues mentioned have not much to do with
| memory pages or anything like that _instead they are rooted in
| concurrency controls of a global resource (memory)_. I.e.
| thread local concurrency syncronization vs. process concurrency
| synchronization.
|
| mainly instead of using a fully general purpose allocator they
| used an allocator whiche is still general purpose but has a
| design bias which improves same-thread (de)allocation perf at
| cost of cross thread (de)allocation perf. And they where doing
| a ton of cross thread (de)allocations leading to noticeable
| performance degradation.
|
| The thing is even if you hypothetically only had allocations at
| sizes multiple of a memory page or use a ton of manual mmap you
| still would want to use a allocator and not always directly
| free freed memory back to the OS as doing so and doing a
| syscall on every allocation tends to lead to major performance
| degradation (in many use cases). So you still need concurrency
| controls but they come at a cost, especially for cross thread
| synchronization. Even just lock-free controls based on atomic
| have a cost over thread local controls caused often largely by
| cache invalidation/synchronization.
| PaulDavisThe1st wrote:
| This seems to fail to understand that we already have both
| levels.
|
| Every OS will provide some mechanism to get more pages. But it
| turns out that managing the use of those pages requires
| specialized handling, depending on the use case, as well as a
| bunch of boilerplate. Hence, we also have malloc and its many,
| many cousins to allocate arbitrary size objects.
|
| You're always welcome to use brk(2) or your OS's equivalent if
| you just want pages. The question is, what are you going to do
| with each page once you have it? That's where the next level
| comes in ...
| Exuma wrote:
| I really love the design of this blog
| Arnavion wrote:
| jemalloc also has its own funny problem with threads - if you
| have a multi-threaded application that uses jemalloc on all
| threads except the main thread, then the cleanup that jemalloc
| runs on main thread exit will segfault. In $dayjob we use
| jemalloc as a sub-allocator in specific arenas. (*) The
| application itself is fine in production because it allocates
| from the main thread too, but the unit test framework only runs
| tests in spawned threads and the main thread of the test binary
| just orchestrates them. So the test binary triggers this segfault
| reliably.
|
| ( https://github.com/jemalloc/jemalloc/issues/1317 Unlike what
| the title says, it's not Windows-specific.)
|
| (*): The application uses libc malloc normally, but at some
| places it allocates pages using `mmap(non_anonymous_tempfile)`
| and then uses jemalloc to partition them. jemalloc has a feature
| called "extent hooks" where you can customize how jemalloc gets
| underlying pages for its allocations, which we use to give it
| pages via such mmap's. Then the higher layers of the code that
| just want to allocate don't have to care whether those
| allocations came from libc malloc or mmap-backed disk file.
| CraigJPerry wrote:
| Tangent: what's the ideal data structure for this problem?
|
| If there were 20million rooms in the world with a price for each
| day of the year, we'd be looking at around 7billion prices per
| year. That'd be say 4Tb of storage without indexes.
|
| The problem space seems to have a bunch of options to partition -
| by locality, by date etc.
|
| I'm curious if there's a commonly understood match for this
| problem?
|
| FWIW with that dataset size, my first experiments would be with
| SQL server because that data will fit in ram. I don't know if
| that's where I'd end up - but I'm pretty sure it's where I'd
| start my performance testing grappling with this problem.
| jrpelkonen wrote:
| I think your premise is somewhat off. There might be 20 million
| hotel rooms in a world, but surely they are not individually
| priced, e.g. all king bed rooms in a given hotel have the same
| price per given day.
| om8 wrote:
| TLDR: use shitty allocators, win shitty memory leaks
| malkia wrote:
| In C++, your
| https://en.cppreference.com/w/cpp/memory/new/new_handler should
| call mi_collect.
| PaulDavisThe1st wrote:
| A perfect demonstration of how many of harder problems we face
| writing (especially non-browser-based) software are in fact not
| addressed by language changes.
|
| The concept of memory that is allocated by a thread and can only
| be deallocated by that thread is useful and valid, but as TFA
| demonstrates, can also cause problems if you're not careful with
| your overall architecture. If the language you're using even
| allows you to use this concept, it almost certainly will not
| protect you from having to get the architecture corect.
___________________________________________________________________
(page generated 2024-08-21 23:01 UTC)