[HN Gopher] Jemalloc Postmortem
___________________________________________________________________
Jemalloc Postmortem
Author : jasone
Score : 687 points
Date : 2025-06-13 01:37 UTC (21 hours ago)
(HTM) web link (jasone.github.io)
(TXT) w3m dump (jasone.github.io)
| gdiamos wrote:
| Congrats on the great run and the future. Jemalloc was an
| inspirational to many memory allocators.
| kstrauser wrote:
| I was using FreeBSD back when jemalloc came along, and it blew
| my mind to imagine swapping out just that one (major) part of
| its libc. Honestly, it hadn't occured to me, and made me wonder
| what else we could wholesale replace.
| Twirrim wrote:
| Oh that's interesting. jemalloc is the memory allocator used by
| redis, among other projects. Wonder what the performance impact
| will be if they have to change allocators.
| dpe82 wrote:
| Why would they have to change? Sometimes software development
| is largely "done" and there isn't much more you need to do to a
| library.
| jeffbee wrote:
| For an example of why an allocator is a maintenance
| treadmill, consider that C++ recently (relatively) added
| sized delete, and Linux recently gained transparent huge
| pages.
| Twirrim wrote:
| It's been 14 years since THP got added to the kernel[1],
| surely we're past calling that "recent" :)
|
| https://www.kernelconfig.io/config_transparent_hugepage
| jeffbee wrote:
| But if they'd declared the allocators "done" _15_ years
| ago, then you wouldn 't have it.
| senderista wrote:
| Another example is rseq (which was originally implemented
| for tcmalloc).
| dymk wrote:
| Technology marches on, and in some number of years other
| allocators will exist that outperform/outfeature jemalloc.
| jcelerier wrote:
| This number of years depending on your allocation profile
| could be something like -10 years easily. New allocators
| constantly crop up
| edflsafoiewq wrote:
| Presumably then the performance impact of any switch will
| be positive.
| Analemma_ wrote:
| Memory allocators are something I expect to rapidly degrade
| in the absence of continuous updates as the world changes
| underneath you. Changing page sizes, new ucode latencies, new
| security features etc. all introduce either outright breakage
| or at least changing the optimum allocation strategy and
| making your old profiling obsolete. Not to mention the
| article already pointed out one instance where a software
| stack (KDE, in that case) used allocation profiles that broke
| an earlier version completely. Even though that's fixed now,
| any language runtime update or new feature could introduce a
| new allocation style that grinds you down.
|
| As much as it's nice to think software can be done, I think
| something so closely tied to the kernel _and_ hardware _and_
| the application layer, which all change constantly, never can
| be.
| binary132 wrote:
| "Software is just done sometimes" is a common refrain I see
| repeated among communities where irreplaceable software
| projects are often abandoned. The community consensus has a
| tendency to become "it is reliable and good enough, it must
| be done".
| poorman wrote:
| Jemalloc is used as an easy performance boost probably by
| every major Ruby on Rails server.
| Twirrim wrote:
| While I certainly wish that more software would reach a
| "done" stage, I don't think jemalloc is necessarily there
| yet. Unfortunately I'm aware of there being bugs in the
| current version of jemalloc, when run in certain environment
| configurations, including memory leaks. I know the folks that
| found it were looking to report it, but I guess that won't
| happen now.
|
| Even from a quick look at the open issues, I can see
| https://github.com/jemalloc/jemalloc/issues/2838, and
| https://github.com/jemalloc/jemalloc/issues/2815 as two
| examples, but there's a fair number of issues still open
| against the repository.
|
| So that'll leave projects like redis & valkey with some
| decisions to make.
|
| 1) Keep jemalloc and accept things like memory leak bugs
|
| 2) Fork and maintain their own version of jemalloc.
|
| 3) Spend time replacing it entirely.
|
| 4) Hope someone else picks it up?
| senderista wrote:
| jemalloc is used enough at Amazon that it would make sense
| for them to maintain it, but that's not really their style.
| burnt-resistor wrote:
| Some people believe everything must always be constantly
| tweaked, redone, broken and fixed, and churned for no reason.
| The only things that need to be fixed in mature, working
| software are bugs and security issues. It doesn't magically
| stop working or get "stale" unless dependencies, the OS, or
| build tools break.
| almostgotcaught wrote:
| > Sometimes software development is largely "done"
|
| Lol absolutely not
| spookie wrote:
| Firefox as well.
| perbu wrote:
| Back in 2008-2009 I remember the Varnish project struggled with
| what looked very much like a memory leak. Because of the
| somewhat complex way memory was used, replacing the Glibc
| malloc with jemalloc was an immediate improvement and removed
| the leak-like behavior.
| technion wrote:
| I know through years of looking at Ruby on Rails performance a
| commonly cited quick win was to run with jemalloc.
| swinglock wrote:
| Last I checked Redis used their own fork of jemalloc. It may
| not even be updated to the latest release.
| jeffbee wrote:
| The article mentioned the influence of large-scale profiling on
| both jemalloc and tcmalloc, but doesn't mention mimalloc. I
| consider mimalloc to be on par with these others, and now I am
| wondering whether Microsoft also used large scale profiling to
| develop theirs, or if they just did it by dead reckoning.
| bch wrote:
| How does mimalloc telemetry compare to jemalloc?
| poorman wrote:
| How cool would it be to see Doug Lea pick up the torch and create
| a modern day multi-threaded dlmalloc2!?
| ecshafer wrote:
| dl is just an observer on the open jdk governance board now, so
| he might have enough time.
| nevon wrote:
| I very recently used jemalloc to resolve a memory fragmentation
| issue that caused a service to OOM every few days. While jemalloc
| as it is will continue to work, same as it does today, I wonder
| what allocator I should reach for in the future. Does anyone have
| any experiences to share regarding tcmalloc or other allocators
| that aim to perform better than stock glibc?
| kev009 wrote:
| snmalloc
| sanxiyn wrote:
| mimalloc is a good choice. CPython recently switched to
| mimalloc.
| beyonddream wrote:
| Try mimalloc. I have prototyped a feature on top of mimalloc
| and while effort was a dead end, the code (this was around
| 2020) was nicely written and well maintained and it was fun to
| hack on it. When I swapped jemalloc in our system with
| mimalloc, it was on par if not better when it comes to
| fragmentation growth control and heap usage perspective.
| meisel wrote:
| I believe there's no other allocator besides jemalloc that can
| seamlessly override macOS malloc/free like people do with
| LD_PRELOAD on Linux (at least as of ~2020). jemalloc has a very
| nice zone-based way of making itself the default, and manages to
| accommodate Apple's odd requirements for an allocator that have
| tripped other third-party allocators up when trying to override
| malloc/free.
| adgjlsfhk1 wrote:
| I believe mimalloc works here (but might be wrong).
| glandium wrote:
| Note this requires hackery that relies on Apple not changing
| things in its system allocator, which has happened at least
| twice IIRC.
| chubot wrote:
| Nice post -- so does Facebook no longer use jemalloc at all? Or
| is it maintenance mode?
|
| Or I wonder if they could simply use tcmalloc or another
| allocator these days?
|
| _Facebook infrastructure engineering reduced investment in core
| technology, instead emphasizing return on investment._
| Svetlitski wrote:
| As of when I left Meta nearly two years ago (although I would
| be absolutely shocked if this isn't still the case) Jemalloc is
| _the_ allocator, and is statically linked into every single
| binary running at the company.
|
| > Or I wonder if they could simply use tcmalloc or another
| allocator these days?
|
| Jemalloc is very deeply integrated there, so this is a lot
| harder than it sounds. From the telemetry being plumbed through
| in Strobelight, to applications using every highly Jemalloc-
| specific extension under the sun (e.g. manually created arenas
| with custom extent hooks), to the convergent evolution of
| applications being written in ways such that they perform
| optimally with respect to Jemalloc's exact behavior.
| charcircuit wrote:
| Meta has a fork that they still are working on, where
| development is continuing.
|
| https://github.com/facebook/jemalloc
| burnt-resistor wrote:
| They take everything FLOSS and ruin it with bureaucracy,
| churn, breakage, and inconsideration to external use. They
| may claim FOSS broadly but it's mostly FOSS-washed, unusable
| garbage except for a few popular things.
| umanwizard wrote:
| React, PyTorch, and RocksDB are all extremely significant.
| Not to mention them being one of the biggest contributors
| to the Linux kernel.
| nh2 wrote:
| The point of the blog post is that repo is over-focused on
| Facebook's needs instead of "general utility":
|
| > as a result of recent changes within Meta we no longer have
| anyone shepherding long-term jemalloc development with an eye
| toward general utility
|
| > we reached a sad end for jemalloc in the hands of
| Facebook/Meta
|
| > Meta's needs stopped aligning well with those of external
| uses some time ago, and they are better off doing their own
| thing.
| nh2 wrote:
| But I'd like to know exactly what that means.
|
| How can I find out if Facebook's focus is aligned with my
| own needs?
| anonymoushn wrote:
| The big recent change is that jemalloc no longer has any of its
| previous long-term maintainers. But it is receiving more
| attention from Facebook than it has in a long time, and I am
| somewhat optimistic that after some recent drama where some of
| that attention was aimed in a counterproductive direction that
| the company can aim the rest of it in directions that Qi and
| Jason would agree with, and that are well aligned with the
| needs of external users.
| kstrauser wrote:
| I've wondered about this before but never when around people who
| might know. From my outsider view, jemalloc looked like a strict
| improvement over glibc's malloc, according to all the benchmarks
| I'd seen when the subject came up. So, why isn't it the default
| allocator?
| sanxiyn wrote:
| As far as I know there is no technical reason why jemalloc
| shouldn't be the default allocator. In fact, as pointed out in
| the article, it IS the default allocator on FreeBSD. My
| understanding is it is largely political.
| kstrauser wrote:
| Now that I think about it, I could easily imagine it being
| left out of glibc because it doesn't build on Hurd or
| something.
| lloeki wrote:
| > I could easily imagine it being left out of glibc because
| [...]
|
| ... its license is BSD-2-Clause ;)
|
| hence "political"
| vkazanov wrote:
| Huh? Bsd-style licenses are fully compatible with gpl.
|
| The problem is exactly this: Facebook becomes the
| upstream of a key part of your system.
|
| And Facebook can just walk away from the project. Like it
| did just now.
| lloeki wrote:
| They are compatible but that's not the point.
|
| If it were included it would instantly become a LGPL
| hard-fork because of any subsequently added line of code,
| if not by "virality" of the glibc license, at least
| because any glibc author code addition would be LGPL, per
| GNU project policy/ideology.
|
| Also also this would he a hard bar to pass:
| https://sourceware.org/glibc/wiki/CopyrightFSForDisclaim
|
| As I recall this is what prevented Apple from
| contributing C blocks+ back to upstream GCC.
|
| + https://github.com/lloeki/cblocks-clobj
| vkazanov wrote:
| What prevents apple from working with gpl-style licenses
| is strict hatred towards code that they can't use without
| opensourcing it. So this is what prevents them from
| contributing to gpl projects: the need to control access
| to code.
|
| Llvm is OK for them from this point of view: upstream is
| open but they can maintain and distribute their
| proprietary fork.
| lloeki wrote:
| > What prevents apple from working with gpl-style
| licenses is strict hatred towards code that they can't
| use without opensourcing it.
|
| Specifically regarding the C blocks feature introduced in
| Snow Leopard, as I recall, Apple wrote implementations
| for _both_ clang and gcc, attempted to upstream the gcc
| patchset, said gcc patchset was obviously under a GPL
| license, but the GCC team threw a fit because it wanted
| the code copyright to be attributed to the FSF, and that
| ended up as a stalemate.
|
| If there was any hatred they could literally have skipped
| the whole gcc implementation + patchset upstreaming
| attempt altogether. Also they did have patchsets of
| various sizes on other projects, whose code ends up
| obviously being GPL as well.
|
| The "hatred" came later with the GPLv3 family and the
| patent clause, which is a legal landmine, the FSF stating
| that signing apps is incompatible with the GPLv3, and
| getting hung up on copyright transfer.
|
| From https://lwn.net/Articles/405417/
|
| > Apple's motives may not be pure, but it has published
| the code under the license required and it's the FSF's
| own copyright assignment policies that block the
| inclusion. The code is available and licensed
| appropriately for the version of GCC that Apple adopted.
| It might be nice if Apple took the further step of
| assigning copyright to the FSF, but the GPLv3 was not
| part of the bargain that Apple agreed to when it first
| started contributing to GCC.
|
| The intent behind such copyright transfer is generally so
| that the recipient of the transfer can relicense without
| having to ask all contributors. Essentially as a
| contributor agreeing to a transfer means ceding up
| control on the license that you initially contributed
| under.
|
| Read another way:
|
| - the FSF says "this code is GPLv2"
|
| - someone contributes under that GPLv2 promise, cedes
| copyright to the FSF because it's "the process"
|
| - the FSF says "this code is now GPLv3 exclusively"
|
| - that someone says "but that was not the deal!"
|
| - the FSF says "I am altering the deal, pray I don't
| alter it any further."
| Y_Y wrote:
| Big evil FSF, always trying to extract value and increase
| their stock price.
| lloeki wrote:
| That was tongue-in-cheek, I thought Darth Vader's voice
| was enough of a cue.
|
| License switches do happen though, and are the source of
| outrage. Cue redis.
|
| The cause of transferring copyright is often practical
| (hard to track down + reach out to + gather answers from
| all authors-slash-contributors which hampers some
| critical decisions down the road); for the FSF it's
| ideological (GCC source code must remain under sole FSF
| control).
|
| The consequence of the transfer though is not well
| understood by authors forfeiting their copyright: they
| essentially agree to work for free for whatever the
| codebase ends up being licensed to in the future,
| including possibly becoming entirely closed source.
|
| Think of it next time you sign a CLA!
| Y_Y wrote:
| Apologies. I am generally against CLAs, and I think it's
| shitty of GNU/FSF to use them, even if they promise to
| only do good and free things.
| vkazanov wrote:
| Fsf is a non-profit organisation that proved numerous
| times that the point of its existence is making sure that
| I and we and you have freedom to change things we own.
|
| I contribute and transfer copyright to them for my
| contributions for this sole reason.
|
| Apple is not about freedoms at all.
|
| That's OK. I mean, these are the meanings of what both
| orgs do. Understanding the system, I'd rather spend my
| free time on fsf cause (and make money in a commercial
| organisation).
| jeffbee wrote:
| These allocators often have higher startup cost. They are
| designed for high performance in the steady state, but they can
| be worse in workloads that start a million short-lived
| processes in the unix style.
| kstrauser wrote:
| Oh, interesting. If that's the case, I can see why that'd be
| a bummer for short-lived command line tools. "Makes ls run
| 10x slower" would not be well received. OTOH, FreeBSD uses it
| by default, and it's not known for being a sluggish OS.
| favorited wrote:
| Disclaimer: I'm not an allocator engineer, this is just an
| anecdote.
|
| A while back, I had a conversation with an engineer who
| maintained an OS allocator, and their claim was that custom
| allocators tend to make one process's memory allocation faster
| at the expense of the rest of the system. System allocators are
| less able to make allocation fair holistically, because one
| process isn't following the same patterns as the rest.
|
| Which is why you see it recommended so frequently with
| services, where there is generally one process that you want to
| get preferential treatment over everything else.
| jeffbee wrote:
| I don't think that's really a position that can be defended.
| Both jemalloc and tcmalloc evolved and were refined in
| antagonistic multitenant environments without one
| overwhelming application. They are optimal for that exact
| thing.
| favorited wrote:
| It's possible that they were referring to something
| specific about their platform and its system allocator, but
| like I said it was an anecdote about one engineer's
| statement. I just remember thinking it sounded fair at the
| time.
| vlovich123 wrote:
| The "system" allocator is managing memory within a
| process boundary. The kernel is responsible for managing
| it across processes. Claiming that a user space allocator
| is greedily inefficient is voodoo reasoning that suggests
| the person making the claim has a poor grasp of
| architecture.
| jdsully wrote:
| The "greedy" part is likely not releasing pages back to
| the OS in a timely manner.
| nicoburns wrote:
| That seems odd though, seeing as this is one of the main
| criticisms of glibc's allocator.
| jeffbee wrote:
| In the containerized environments where these allocators
| were mainly developed, it is all but totally pointless to
| return memory to the kernel. You might as well keep
| everything your container is entitled to use, because
| it's not like the other containers can use it. Someone or
| some automatic system has written down how much memory
| the container is going to use.
| toast0 wrote:
| Returning no longer used anonymous memory is not without
| benefits.
|
| Returning pages allows them to be used for disk cache.
| They can be zeroed in the background by the kernel which
| may save time when they're needed again, or zeroing can
| be avoided if the kernel uses them as the destination of
| a full page DMA write.
|
| Also, returning no longer used pages helps get closer to
| a useful memory used measurement. Measuring memory usage
| is pretty difficult of course, but making the numbers a
| little more accurate helps.
| jeffbee wrote:
| There are shared resources involved though, for example
| one process can cause a lot of traffic in khugepaged.
| However I would point out that is an endemic risk of
| Linux's overall architecture. Any process can cause chaos
| by dirtying pages, or otherwise triggering reclaim.
| favorited wrote:
| For context, the "allocator engineer" I was talking to
| was a kernel engineer - they have an extremely solid
| grasp of their platform's architecture.
|
| The whole advantage of being the platform's system
| allocator is that you can have a tighter relationship
| between the library function and the kernel
| implementation.
| lmm wrote:
| > Both jemalloc and tcmalloc evolved and were refined in
| antagonistic multitenant environments without one
| overwhelming application. They are optimal for that exact
| thing.
|
| They were mostly optimised on Facebook/Google server-side
| systems, which were likely one application per VM, no?
| (Unlike desktop usage where users want several applications
| to run cooperatively). Firefox is a different case but
| apparently mainline jemalloc never matched Firefox
| jemalloc, and even then it's entirely plausible that
| Firefox benefitted from a "selfish" allocator.
| jeffbee wrote:
| Google runs dozens to hundreds of unrelated workloads in
| lightweight containers on a single machine, in "borg".
| Facebook has a thing called "tupperware" with the same
| property.
| mort96 wrote:
| The only way I can see that this would be true is if a custom
| allocator is worse about unmapping unused memory than the
| system allocator. After all, processes aren't sharing one
| heap, it's not like fragmentation in one process's address
| space is visible outside of that process... The only aspects
| of one process's memory allocation that's visible to other
| processes is, "that process uses N pages worth of resident
| memory so there's less available for me". But one of the
| common criticisms against glibc is that it's often really bad
| at unmapping its pages, so I'd think that most custom
| allocators are _nicer_ to the system?
|
| It would be interested in hearing their thoughts directly,
| I'm also not an allocator engineer and someone who maintains
| an OS allocator probably knows wayyy more about this stuff
| than me. I'm sure there's some missing nuance or context or
| which would've made it make sense.
| o11c wrote:
| For a long time, one of the major problems with alternate
| allocators is that they would _never_ return free memory back
| to the OS, just keep the dirty pages in the process. This did
| eventually change, but it remains a strong indicator of
| different priorities.
|
| There's also the fact that ... a lot of processes only ever
| have a single thread, or at most have a few background threads
| that do very little of interest. So all these "multi-threading-
| first allocators" aren't actually buying anything of value, and
| they do have a lot of overhead.
|
| Semi-related: one thing that most people never think about: it
| is exactly the same amount of work for the kernel to zero a
| page of memory (in preparation for a future mmap) as for a
| userland process to zero it out (for its own internal reuse)
| vlovich123 wrote:
| That's actually particular try to alternate allocators and
| not true for glibc if I recall correctly (it's much worse at
| returning memory).
| senderista wrote:
| > Semi-related: one thing that most people never think about:
| it is exactly the same amount of work for the kernel to zero
| a page of memory (in preparation for a future mmap) as for a
| userland process to zero it out (for its own internal reuse)
|
| Possibly more work since the kernel can't use SIMD
| LtdJorge wrote:
| Why is that? Doesn't Linux use SIMD for the crypto
| operations?
| dwattttt wrote:
| Allowing SIMD instructions to be used arbitrarily in
| kernel actually has a fair penalty to it. I'm not sure
| what Linux does specifically, but:
|
| When a syscall is made, the kernel has to backup the user
| mode state of the thread, so it can restore it later.
|
| If any kernel code could use SIMD registers, you'll have
| to backup and restore that too, and those registers get
| big. You could easily be looking at adding a 1kb copy to
| every syscall, and most of the time it wouldn't be
| needed.
| kstrauser wrote:
| Why is that? Couldn't there be push_simd()/pop_simd()
| that the syscall itself uses around its SIMD calls?
|
| If no syscalls use SIMD today, I'd think we're starting
| from a safe position.
| durrrrrrrrrrrrr wrote:
| push_simd/pop_simd exist and are called
| kernel_fpu_begin/kernel_fpu_end. Their use is practically
| prohibited in most areas and iiuc not available on all
| archs, but it's available if needed.
| kstrauser wrote:
| Today I learned. Thanks!
| durrrrrrrrrrrrr wrote:
| It's not so much that you can't ever use it, it's more a
| you really shouldn't. It's more expensive, harder to use
| and rarely worth it. Main users currently are crypto and
| raid checksumming.
|
| https://www.kernel.org/doc/html/next/core-api/floating-
| point...
| toast0 wrote:
| It is on FreeBSD. :P Change your malloc, change your life? May
| as well change your libc while you're there and use FreeBSD
| libc too, and that'll be easier if you also adopt the FreeBSD
| kernel.
|
| I will say, the Facebook people were very excited to share
| jemalloc with us when they acquired my employer, but we were
| using FreeBSD so we already had it and thought it was normal.
| :)
| b0a04gl wrote:
| jemalloc's been battle tested in prod at scale, its license is
| permissive, and performance wins are known. so what exactly are
| we protecting by clinging to glibc malloc? ideological purity?
| legacy inertia? who's actually benefiting from this status quo,
| and why do we still pretend it's about "compatibility"?
| skeptrune wrote:
| Kind of nuts that he worked on Jemalloc for over a decade while
| having personal preference for garbage collection. I'm surprised
| he doesn't have more regret.
| kstrauser wrote:
| Why are those two mutually exclusive? I'd think that a high
| performance allocator would be especially crucial in the
| implementation of a fast garbage collected language. For
| example, in Python you can't alloc(n * sizeof(obj)) to reserve
| that much contiguous space for n objects. Instead, you use the
| builtins which isolate you from that low-level bookkeeping.
| Those builtins have to be pretty fast or performance would be
| terrible.
| fermentation wrote:
| A job is a job
| dikei wrote:
| I still remember the day when I used jemalloc debug features to
| triage and resolve some nasty memory bloat issues in our code
| that use RockDB.
|
| Good times.
| userbinator wrote:
| A bad choice of title, as "postmortem" made me think there was
| some severe outage caused by jemalloc.
| chrisweekly wrote:
| Well, that's not the only meaning of "postmortem". The fine
| article does open with,
|
| _" The jemalloc memory allocator was first conceived in early
| 2004, and has been in public use for about 20 years now. Thanks
| to the nature of open source software licensing, jemalloc will
| remain publicly available indefinitely. But active upstream
| development has come to an end. This post briefly describes
| jemalloc's development phases, each with some success/failure
| highlights, followed by some retrospective commentary."_
| stingraycharles wrote:
| I think this implies your understanding of the term "post-
| mortem" is incorrect, rather than the title.
| drysine wrote:
| Or maybe not
| runevault wrote:
| postmortem is looking back after an event. That can be a
| security event/outage, it can also be the completion of a
| project (see: game studios often do postmortems once their game
| is out to look back on what went wrong and right between
| preproduction, production, and post launch).
| gilgoomesh wrote:
| It's weird that we use "postmortem" in those cases since the
| word literally means "after death"; kind of implying
| something bad happened. I get that most of these postmortems
| are done after major development ceases, so it kind of is
| "dead" but still.
|
| Surely a "retrospective" would be a better word for a look
| back. It even means "look back.
| simonask wrote:
| It gets even better. Some companies use "mid-mortems",
| which are evaluation and reflection processes in the middle
| of a project...
| meepmorp wrote:
| sounds like an appropriate way to talk about death march
| projects, tbh
| bmacho wrote:
| The last part is unfortunate. However, it is a perfectly fine
| choice of title, as it does not make the majority of us think
| that there were an outage caused by jemalloc. You should update
| how you think of the word, and align it with the majority usage
| Svetlitski wrote:
| I understand the decision to archive the upstream repo; as of
| when I left Meta, we (i.e. the Jemalloc team) weren't really in a
| great place to respond to all the random GitHub issues people
| would file (my favorite was the time someone filed an issue
| because our test suite didn't pass on Itanium lol). Still, it
| makes me sad to see. Jemalloc is still IMO the best-performing
| general-purpose malloc implementation that's easily usable;
| TCMalloc is great, but is an absolute nightmare to use if you're
| not using bazel (this has become _slightly_ less true now that
| bazel 7.4.0 added cc_static_library so at least you can somewhat
| easily export a static library, but broadly speaking the point
| still stands).
|
| I've been meaning to ask Qi if he'd be open to cutting a final
| 6.0 release on the repo before re-archiving.
|
| At the same time it'd be nice to modernize the default settings
| for the final release. Disabling the (somewhat confusingly
| backwardly-named) "cache oblivious" setting by default so that
| the 16 KiB size-class isn't bloated to 20 KiB would be a major
| improvement. This isn't to disparage your (i.e. Jason's) original
| choice here; IIRC when I last talked to Qi and David about this
| they made the point that at the time you chose this default,
| typical TLB associativity was much lower than it is now. On a
| similar note, increasing the default "page size" from 4 KiB to
| something larger (probably 16 KiB), which would correspondingly
| increase the large size-class cutoff (i.e. the point at which the
| allocator switches from placing multiple allocations onto a slab,
| to backing individual allocations with their own extent directly)
| from 16 KiB up to 64 KiB would be pretty impactful. One of the
| last things I looked at before leaving Meta was making this
| change internally for major services, as it was worth a several
| percent CPU improvement (at the cost of a minor increase in RAM
| usage due to increased fragmentation). There's a few other things
| I'd tweak (e.g. switching the default setting of metadata_thp
| from "disabled" to "auto", changing the extent-sizing for slabs
| from using the nearest exact multiple of the page size that fits
| the size-class to instead allowing ~1% guaranteed wasted space in
| exchange for reducing fragmentation), but the aforementioned
| settings are the biggest ones.
| kstrauser wrote:
| Stuff like this is what keeps me coming back here. Thanks for
| posting this!
|
| What's hard about using TCMalloc if you're not using bazel?
| (Not asking to imply that it's not, but because I'm genuinely
| curious.)
| Svetlitski wrote:
| It's just a huge pain to build and link against. Before the
| bazel 7.4.0 change your options were basically:
|
| 1. Use it as a dynamically linked library. This is not great
| because you're taking at a minimum the performance hit of
| going through the PLT for every call. The forfeited
| performance is even larger if you compare against statically
| linking with LTO (i.e. so that you can inline calls to
| malloc, get the benefit of FDO , etc.). Not to mention all
| the deployment headaches associated with shared libraries.
|
| 2. Painfully _manually_ create a static library. I've done
| this, it's awful; especially if you want to go the extra mile
| to capture as much performance as possible and at least get
| _partial_ LTO (i.e. of TCMalloc independent of your
| application code, compiling all of TCMalloc's compilation
| units together to create a single object file).
|
| When I was at Meta I imported TCMalloc to benchmark against
| (to highlight areas where we could do better in Jemalloc) by
| pain-stakingly hand-translating its bazel BUILD files to
| buck2 because there was legitimately no better option.
|
| As a consequence of being so hard to use outside of Google,
| TCMalloc has many more unexpected (sometimes problematic)
| behaviors than Jemalloc when used as a general purpose
| allocator in other environments (e.g. it basically assumes
| that you are using a certain set of Linux configuration
| options [1] and behaves rather poorly if you're not)
|
| [1] https://google.github.io/tcmalloc/tuning.html#system-
| level-o...
| kstrauser wrote:
| Wow. That does sound quite unpleasant.
|
| Thanks again. This is far outside my regular work, but it
| fascinates me.
| prpl wrote:
| I've successfully used LLMs to migrate Makefiles to bazel,
| more or less. I've not tried the reverse but suspect (2)
| isn't so bad these days. YMMV, of course, but food for
| thought
| rfoo wrote:
| Dunno why you got downvoted, but I've also tried to let
| Claude translate a bunch of BUILD files to equivalent
| CMakeLists.txt. It worked. The resulting CMakeLists.txt
| looks super terrible, but so is 95% of CMakeLists.txt in
| this world, so why bother, it's doomed anyway.
| mort96 wrote:
| They got downvoted because 1) comments of the form "I
| gave a chat bot a toy example of a task and it managed
| it" are tired and uninformative, and 2) because nobody
| was talking about anything which would make translating a
| Makefile into Bazel somehow relevant, nobody here has a
| Makefile which we wish was Bazel, we wish Google code was
| easier to work with
| jeffbee wrote:
| The person above was saying they did a tedious manual
| port of tcmalloc to buck. Since tcmalloc provides both
| bazel and cmake builds, it seems relevant that in these
| days a person could have potentially forced a robot to do
| the job of writing the buck file given the cmake or bazel
| files.
| prpl wrote:
| People are discussing things that are tedious work. I
| think the conversion to Bazel from a makefile is much
| more tedious and error prone than the reverse, in part
| because of Bazel sandboxing although that shouldn't make
| much of a difference for a well-defined collection of
| Makefiles of a C library.
|
| The reverse should be much easier, which was the point of
| the post. Pointing it out as a capability (translation of
| build systems) that is handled well, is, well,
| informative. The future isn't evenly distributed and
| people aren't always aware of capabilities, even on HN
| mort96 wrote:
| What's really tedious is the constant chat bot spam.
| benced wrote:
| Yep I've done something similar. This is the only way I
| managed to compile Google's C++ S2 library (spatial
| indexing) which depends on absl and OpenSSL.
|
| (I managed to avoid infecting my project with boringSSL)
| MaskRay wrote:
| Thanks for sharing the insight!
|
| As I observed when I was at Google: tcmalloc wasn't a
| dedicated team but a project driven by server performance
| optimization engineers aiming to improve performance of
| important internal servers. Extracting it to
| github.com/google/tcmalloc was complex due to intricate
| dependencies (https://abseil.io/blog/20200212-tcmalloc ).
| As internal performance priorities demanded more focus,
| less time was available for maintaining the CMake build
| system. Maintaining the repo could at best be described as
| a community contribution activity.
|
| > Meta's needs stopped aligning well with those of external
| uses some time ago, and they are better off doing their own
| thing.
|
| I think Google's diverged from the external uses even long
| ago:) (For a long time google3 and gperftools's tcmalloc
| implementations were so different.)
| mort96 wrote:
| Everything from Google is an absolute pain to work with
| unless you're in Google using their systems, FWIW. Anything
| from the Chromium project is deeply intangled with
| everything else from the Chromium project as part of one
| gigantic Chromium source tree with all dependencies and
| toolchains vendored. They do not care about ABI what so
| ever, to the point that a lot of Google libraries change
| their public ABI based on whether address sanitizer is
| enabled or not, meaning you can't enable ASAN for your code
| if you use pre-built (e.g package manager provided)
| versions of their code. Their libraries also tend to break
| if you link against them from a project with RTTI enabled,
| a compiler set to a slightly different compiler version, or
| any number of other minute differences that most other
| developers don't let affect their ABI.
|
| And if you try to build their libraries from source, that
| involves downloading tens of gigabytes of sysroots and
| toolchains and vendored dependencies.
|
| Oh and you probably don't want multiple versions of a
| library in your binary, so be prepared to use Google's
| (probably outdated) version of whatever libraries they
| vendor.
|
| And they make no effort what so ever to distinguish between
| public header files and their source code, so if you wanna
| package up their libraries, be prepared to make scripts to
| extract the headers you need (including headers from
| vendored dependencies), you can't just copy all of some
| 'include/' folder.
|
| And their public headers tend to do idiotic stuff like
| `#include "base/pc.h"`, where that `"base/pc.h"` path is
| _not_ relative to the file doing the include. So you 're
| gonna have to pollute the include namespace. Make sure not
| to step on their toes! There's a lot of them.
|
| I have had the misfortune of working with Abseill, their
| WebRTC library, their gRPC library and their protobuf
| library, and it's all terrible. For personal projects where
| I don't have a very, _very_ good reason to use Google code,
| I try to avoid it like the plague. For professional
| projects where I 've had to use libwebrtc, the only
| reasonable approach is to silo off libwebrtc into its own
| binary which _only_ deals with WebRTC, typically with a
| line-delimited JSON protocol on stdin /stdout. For things
| like protobuf/gRPC where that hasn't been possible, you
| just have to live with the suffering.
|
| ..This comment should probably have been a blog post.
| pavlov wrote:
| This matches my own experience trying to use Google's C++
| open source. You should write the blog post!
| ahartmetz wrote:
| I think your rant isn't long enough to include everything
| relevant ;) The Blink web engine (which I sometimes
| compile for qtwebengine) takes a really long time to
| compile, several times longer than Gecko according to
| some info I found online. Google has a policy of not
| using forward declarations, including everything instead.
| That's a pretty big WTF for anyone who has ever optimized
| build time. Google probably just throws hardware and
| (distributed) caching at the problem, not giving a shit
| about anyone else building it. Oh, it also needs about 2
| GB of RAM per build thread - basically nothing else does.
| LtdJorge wrote:
| Even with Firefox using Rust and requiring a build of
| many crates, qtwebengine takes more time. It was so bad
| that I had to remove packaged from my system (Gentoo)
| that were pulling qtwebengine.
|
| And I build all Rust crates (including rustc) with -O3,
| same as C/C++.
| bialpio wrote:
| Chromium deviates from Google-wide policy and allows
| forward-declarations: https://chromium.googlesource.com/c
| hromium/src/+/main/styleg..., "Forward declarations vs.
| #includes".
| ahartmetz wrote:
| That is really nice to hear, but AFAICS it only means
| that it may change in the future. Because in current
| code, it was ~all includes last time I checked.
|
| Well, I remember one - very biased - example where I had
| a look at a class that was especially expensive to
| compile, like 40 seconds (on a Ryzen 7950X) and maybe 2
| GB of RAM. It had under 200 LOC and didn't seem to do
| anything that's typically expensive to compile... except
| for the stuff it included. Which also didn't seem to do
| anything fancy. But transitive includes can snowball if
| you don't add any "compile firewalls".
| stick_figure wrote:
| This is actually tracked at a publicly visible URL:
| https://commondatastorage.googleapis.com/chromium-
| browser-cl...
|
| And the include graph analysis:
| https://commondatastorage.googleapis.com/chromium-
| browser-cl...
|
| The annotated red dots correspond to the last time Chrome
| developers did a big push to prune the include graph to
| optimize build time. It was effective, but there was push
| back. C++ developers just want magic, they don't want to
| think about dependency management, and it's hard to blame
| them. But, at the end of the day, builds scale with
| sources times dependencies, and if you aren't
| disciplined, you can expect superlinear build times.
| ahartmetz wrote:
| Good that it's being tracked, but Jesus, these numbers!
|
| 110 CPU hours for a build. (Fortunately, it seems to be a
| little over half that for my CPU. "Cloud CPUs" are kinda
| slow.)
|
| I picked the 5001st largest file with includes. It's
| zoom_view_controller.cc, 140 lines in the .cc file, size
| with includes: 19.5 MB.
|
| Initially I picked the 5000th largest file with includes,
| but for devtools_target_ui.cc, I see a bit more
| legitimacy for having lots of includes. It has 384 "own"
| lines in he .cc file and, of course, also about 19.5 MB
| size with includes.
|
| A C++20 source file including some standard library
| headers easily bloats to a little under 1 MB IIRC, and
| that's already kind of unreasonable. 20x of that is very
| unreasonable.
|
| I don't think that I need to tell anyone on the Chrome
| team how to improve performance in software: you measure
| and then you grab the dumb low-hanging fruit first. From
| these results, it doesn't seem like anyone is working
| with the actual goal to improve the situation as long as
| the guidelines are followed on paper.
| bialpio wrote:
| > I picked the 5001st largest file with includes. It's
| zoom_view_controller.cc, 140 lines in the .cc file, size
| with includes: 19.5 MB.
|
| > Initially I picked the 5000th largest file with
| includes, but for devtools_target_ui.cc, I see a bit more
| legitimacy for having lots of includes. It has 384 "own"
| lines in he .cc file and, of course, also about 19.5 MB
| size with includes.
|
| > A C++20 source file including some standard library
| headers easily bloats to a little under 1 MB IIRC, and
| that's already kind of unreasonable. 20x of that is very
| unreasonable.
|
| I think you're not arguing pro-forward-declarations vs
| anti-forward-declarations here though - it sounds more
| like an argument for more granular header/source files?
| In .cc file, each and every include should be necessary
| for the file to compile (although looking at your
| example, bind.h seems to be unused and could be removed -
| looks like the file was refactored and the includes
| weren't cleaned up).
|
| With that said, in the corresponding
| zoom_view_controller.h, the tab_interface.h include looks
| to be unnecessary so you did find one good example. :)
| ahartmetz wrote:
| Yes, sure! I am arguing for whatever is necessary to
| reduce the total compilation cost. Pruning headers,
| rearranging source code to have fewer trivial modules and
| to reduce the size of very often included headers, even
| _gasp_ sometimes using pointers just to reduce compile
| time! I understand that runtime performance is a very
| high priority for Blink, but it really doesn 't matter
| sometimes if certain things are heap-allocated. Like
| things that are very expensive to instantiate anyway and
| that don't occur often. These will incidentally tend to
| have "heavy" headers, too.
| bialpio wrote:
| > Because in current code, it was ~all includes last time
| I checked.
|
| That's another matter - just because forward-declares are
| allowed, doesn't mean they are mandated, but in my
| experience the reviewers were paying attention to that
| pretty well.
|
| Counter-exeamples to "~all includes": https://source.chro
| mium.org/chromium/chromium/src/+/main:thi..., https://sou
| rce.chromium.org/chromium/chromium/src/+/main:thi..., htt
| ps://source.chromium.org/chromium/chromium/src/+/main:thi
| ....
|
| I picked couple random headers from the directory where
| I've contributed the most to blink, and from what I'm
| seeing, most of the classes that could be forward-
| declared, were. I have not looked at .cc files given that
| those tend to need to see the declaration (except when
| it's unused, but then why have a forward-decl at all?) or
| the compiler would complain about access into incomplete
| type.
|
| > Well, I remember one - very biased - example where I
| had a look at a class that was especially expensive to
| compile, like 40 seconds (on a Ryzen 7950X) and maybe 2
| GB of RAM. It had under 200 LOC and didn't seem to do
| anything that's typically expensive to compile... except
| for the stuff it included.
|
| Maybe the stuff was actually being compiled because of
| some member in a class (so it was actually expensive to
| compile). Or maybe you stumbled upon a place where folks
| weren't paying attention. Hard to say without a concrete
| example. The "compile firewall" was added pretty recently
| I think, but I don't know if it's going to block anything
| from landing.
|
| Edit: formatting (switched bulleted list into comma-
| separated because clearly I don't know how to format it).
| rfoo wrote:
| > they make no effort what so ever to distinguish between
| public header files and their source code
|
| They did, in a different way. The world is used to
| distinguish by convention, putting them in different
| directory hierarchy (src/, include/). google3 depends on
| the build system to do so, "which header file is public"
| is documented in BUILD files. You are then required to
| use their build system to grasp the difference :(
|
| > And their public headers tend to do idiotic stuff like
| `#include "base/pc.h"`, where that `"base/pc.h"` path is
| not relative to the file doing the include.
|
| I have to disagree on this one. Relying on relative
| include paths suck. Just having one `-I/project/root` is
| the way to go.
| mort96 wrote:
| > I have to disagree on this one. Relying on relative
| include paths suck. Just having one `-I/project/root` is
| the way to go.
|
| Oh to be clear, I'm not saying that they should've used
| relative includes. I'm complaining that they don't put
| their includes in their own namespace. If public headers
| were in a folder called `include/webrtc` as is the
| typical convention, and they all contained `#include
| <webrtc/base/pc.h>` or `#include "webrtc/base/pc.h"` I
| would've had no problem. But as it is, WebRTC's headers
| are in include paths which it's really difficult to avoid
| colliding with. You'll cause collisions if your project
| has a source directory called `api`, or `pc`, or `net`,
| or `media`, or a whole host of other common names.
| rfoo wrote:
| Thanks for the clarification. Yeah, that's pretty
| frustrating.
|
| Now I'm curious why grpc, webrtc and some other Chromium
| repos were set up like this. Google projects which
| started in google3 and later exported as an open source
| project don't have this defect, for example tensorflow,
| abseil etc. They all had a top-level directory containing
| all their codes so it becomes `#include "tensorflow/...`.
|
| Feels like a weird collision of coding style and starting
| a project outside of their monorepo
| alextingle wrote:
| >> `#include "base/pc.h"`, where that `"base/pc.h"` path
| is not relative to the file doing the include.
|
| > I have to disagree on this one.
|
| The double-quotes literally mean "this dependency is
| relative to the current file". If you want to depend on a
| -I, then signal that by using angle brackets.
| mort96 wrote:
| Eh, no. The quotes mean "this is not a dependency on a
| system library". Quotes can include relative to the
| files, or they can include things relative to directories
| specified with -I. The only thing they can't is include
| things relative to directories specified with -isystem
| and system include directories.
|
| I would be surprised if I read some project's code where
| angle brackets are used to include headers from within
| the same project. I'm not surprised when quotes are used
| to include code from within the project but relative to
| the project's root.
| fc417fc802 wrote:
| Reading this perspective was interesting. I can
| appreciate that things didn't fit into your workflow very
| well, but my experience has been the opposite. Their
| projects seem to be structured from the perspective of
| building literally everything from source on the spot.
| That matches my mindset - I choose to build from scratch
| in a network isolated environment. As a result google
| repos are some of the few that I can count on to be
| fairly easy to get up and running. An alarming number of
| projects apparently haven't been tested under such
| conditions and I'm forced to spend hours patching up
| cmake scripts. (Even worse are the projects that require
| 'npm install' as part of the build process. Absurdity.)
|
| > Oh and you probably don't want multiple versions of a
| library in your binary, so be prepared to use Google's
| (probably outdated) version of whatever libraries they
| vendor.
|
| This is the only complaint I can relate to. Sometimes
| they lag on rolling dependencies forward. Not so
| infrequently there are minor (or not so minor) issues
| when I try to do so myself and I don't want to waste time
| patching my dependencies up so I get stuck for a while
| until they get around to it. That said, usually rolling
| forward works without issue.
|
| > if you try to build their libraries from source, that
| involves downloading tens of gigabytes of sysroots and
| toolchains and vendored dependencies.
|
| Out of curiosity which project did you run into this
| with? That said, isn't the only alternative for them
| moving to something like nix? Otherwise how do you
| tightly specify the build environment?
| bluGill wrote:
| > I choose to build from scratch in a network isolated
| environment. As a result google repos are some of the few
| that I can count on to be fairly easy to get up and
| running.
|
| If you are building a single google project they are easy
| to get up and running. If you are building your own
| project on top of theirs things get difficult. those
| library issues will get you.
|
| I don't know about OP, but we have our own in house
| package manager. If Conan was ready a couple years sooner
| we would have used that instead.
| mort96 wrote:
| I don't really have the care nor time to respond as
| thoroughly as you deserve, but here are some thoughts:
|
| > Out of curiosity which project did you run into this
| with?
|
| Their WebRTC library for the most part, but also the gRPC
| C++ library. Unlike WebRTC, grpc++ is in most package
| managers so the need to build it myself is less, but
| WebRTC is a behemoth and not in any package manager.
|
| > That said, isn't the only alternative for them moving
| to something like nix? Otherwise how do you tightly
| specify the build environment?
|
| I don't expect my libraries to tightly specify the build
| environment. I expect my libraries to conform to my
| software's build environment, to use versions of other
| libraries that I provide to it, etc etc. I don't mind
| that Google builds their application software the way
| they do, Google Chrome should tightly constrain its build
| environment if Google wants; but their libraries should
| fit in to _my_ environment.
|
| I'm wondering, what is your relationship with Google
| software that you build from source? Are you building
| their libraries to integrate with your own applications,
| or do you just build Google's applications from source
| and use them as-is?
| ewalk153 wrote:
| I've hit similar problems with their Ruby gRPC library.
|
| The counter example is the language Go. The team running
| Go has put considerable care and attention into making
| this project welcoming for developers to contribute,
| while still adhering to Google code contribution
| requirements. Building for source is straightforward and
| iirc it's one of the easier cross compilers to setup.
|
| Install docs: https://go.dev/doc/install/source#bootstrap
| FromBinaryRelease
| rstat1 wrote:
| Go is kinda of a pain to build from source. Build one
| version to build another, and another..
|
| Or rather it was the last time I tried.
| rstat1 wrote:
| I agree to a point. grpc++ (and protobuf and boringssl
| and abseil and....) was the biggest pain in the ass to
| integrate in to a personal project I've ever seen. I
| ended up having to write a custom tool to convert their
| Bazel files to the format my projects tend to use (GN and
| Ninja). Many hours wasted. There were no library specfici
| "sysroots" or "toolchains" involved though thankfully
| because I'm sure that would made things even worse.
|
| Upside is (I guess) if I ever want to use grpc in another
| project the work's already done and it'll just be a
| matter of copy/paste.
| matoro wrote:
| That was me that filed the Itanium test suite failure. :)
| boulos wrote:
| The Itanic was kind of great :). I'm convinced it helped sink
| SGI.
| froh wrote:
| Sunk by the Great Itanic ?
| sitkack wrote:
| Why was the sinking of SGI great?
| boulos wrote:
| Oh, that wasn't the intent. I meant two separate things.
| The Itanic itself was kind of fascinating, but mostly
| panned (hence the nickname).
|
| SGI's decision to built out Itanium systems may have
| helped precipitate their own downfall. That was sad.
| cogman10 wrote:
| Still makes me sad. I partially think a major reason for
| the demise was that it was simply constructed too soon.
| Compiler tech wasn't nearly good enough to handle the
| ISA.
|
| Nowadays because of the efforts that have gone in to
| making SIMD effective, I'd think modern compilers would
| have an easier time taking advantage of that unique and
| strange uarch.
| acdha wrote:
| SGI and HP! Intel should have a statue of Rick Belluzzo on
| they'r campus.
| crest wrote:
| Itanium did its most most important job: it killed
| everything but ARM and POWER.
| apaprocki wrote:
| Ah, porting to HP Superdome servers. It's like being handed a
| brochure describing the intricate details of the iceberg the
| ship you just boarded is about to hit in a few days.
|
| A fellow traveler, ahoy!
| cogman10 wrote:
| I worked on the Superdome servers back in the day. What a
| weird product. I still can't believe it was a profitable
| division (at my time circa 2011).
|
| HP was going through some turbulent waters in those days.
| kabdib wrote:
| one of the best books on Linux architecture i've read was the
| one on the Itanium port
|
| i think, because Itanic broke a _ton_ of assumptions
| EnPissant wrote:
| Do you have any opinions on mimalloc?
| gazpacho wrote:
| I would love to see these changes - or even some sort of blog
| post or extended documentation explaining rational. As is the
| docs are somewhat barren. I feel that there's a lot of
| knowledge that folks like you have right now from all of the
| work that was done internally at Meta that would be best shared
| now before it is lost.
| einpoklum wrote:
| > TCMalloc is great, but is an absolute nightmare to use if
| you're not using bazel
|
| custom-malloc-newbie question: Why is the choice of build
| system (generator) significant when evaluating the usability of
| a library?
| fc417fc802 wrote:
| Because you need to build it to use it, and you likely
| already have significant build related infrastructure, and
| you are going to need to integrate any new dependencies into
| that. I'm increasingly convinced that the various build
| systems are elaborate and wildly successful ploys intended
| only to sap developer time and energy.
| CamouflagedKiwi wrote:
| Because you have to build it. If they don't use the same
| build system as you, you either want to invoke their system,
| or import it into yours. The former is unappealing if it's
| 'heavy' or doesn't play well as a subprocess; the latter can
| take a lot of time if the build process you're replicating is
| complex.
|
| I've done both before, and seen libraries at various levels
| of complexity; there is definitely a point where you just
| want to give up and not use the thing when it's very complex.
| username223 wrote:
| This. When step one is "install our weird build system,"
| I'll immediately look for something else that meets my
| needs. All build systems suck, so everyone thinks they can
| write a better one, and too many people try. Pretty soon
| you end up having to learn a majority of this (https://en.w
| ikipedia.org/wiki/List_of_build_automation_softw...) to get
| your code to compile.
| einpoklum wrote:
| If TCMalloc uses bazel, then you build it with Bazel. It
| just needs to install itself where you tell it to, and
| then either it has given you a pkg-config file, or
| otherwise, your own build system needs some library-
| finding logic for it ("find module" in CMake terms). Or -
| are you saying the problem is that you need to install
| Bazel?
| klabb3 wrote:
| > we (i.e. the Jemalloc team) weren't really in a great place
| to respond to all the random GitHub issues people would file
|
| Why not? I mean this is complete drive-by comment, so please
| correct me, but there was a fully staffed team at Meta that
| maintained it, but was not in the best place to manage the
| issues?
| xcrjm wrote:
| They said the team was not _in_ a great place to do it, eg.
| they probably had competing priorities that overshadowed
| triaging issues.
| anonymoushn wrote:
| Well, to be blunt, the company does not care about this, so
| it does not get done.
| Thaxll wrote:
| It's kind of wild that great software is hindered by a
| complicated build and integration process.
| mavis wrote:
| Switching to jemalloc instantly fixed an irksome memory leak in
| an embedded Linux appliance I inherited many moons ago. Thank you
| je, we salute you!
| vlovich123 wrote:
| That's because sane allocators that aren't glibc will return
| unused memory periodically to the OS while glibc prefers to
| permanently retain said memory.
| masklinn wrote:
| glibc will return memory to the OS just fine, the problem is
| that its arena design is _extremely_ prone to fragmentation,
| so you end up with a bunch of arenas which are almost but not
| quite empty and can 't be released, but can't really be used
| either.
|
| In fact, Jason himself (the author of jemalloc and TFA)
| posted an article on glibc malloc fragmentation 15 years ago:
| https://web.archive.org/web/20160417080412/http://www.canonw.
| ..
|
| And it's an issue to this day:
| https://blog.arkey.fr/drafts/2021/01/22/native-memory-
| fragme...
| nh2 wrote:
| glibc does NOT return memory to the OS just fine.
|
| In my experience it delays it way too much, causing memory
| overuse and OOMs.
|
| I have a Python program that allocates 100 GB for some
| work, free()s it, and then calls a subprocess that takes
| 100 GB as well. Because the memory use is serial, it should
| fit in 128 GB just fine. But it gets OOM-killed, because
| glibc does not turn the free() into an munmap() before the
| subprocess is launched, so it needs 200 GB total, with 100
| GB sitting around pointlessly unused in the Python process.
|
| This means if you use glibc, you have no idea how much
| memory your system will use and whether they will OOM-
| crash, even if your applications are carefully designed to
| avoid it.
|
| Similar experience:
| https://news.ycombinator.com/item?id=24242571
|
| I commented there 4 years ago the glibc settings
| MALLOC_MMAP_THRESHOLD_ and MALLOC_TRIM_THRESHOLD_ should
| fix that, but I was wrong: MALLOC_TRIM_THRESHOLD_ is
| apparently bugged and has no effect in some situations.
|
| A bug I think might be involved: "free() doesn't honor
| M_TRIM_THRESHOLD"
| https://sourceware.org/bugzilla/show_bug.cgi?id=14827
|
| Open since 13 years ago. This stuff doesn't seem to get
| fixed.
|
| The fix in general is to use jemalloc with
| MALLOC_CONF="retain:false,muzzy_decay_ms:0,dirty_decay_ms:0
| "
|
| which tells it to immediately munmap() at free().
|
| So in jemalloc, the settings to control this behaviour seem
| to actually work, in contrast to glibc malloc.
|
| (I'm happy to be proven wrong here, but so far no
| combination of settings seem to actually make glibc return
| memory as written in their docs.)
|
| From this perspective, it is frightening to see the
| jemalloc repo being archived, because that was my way to
| make sure stuff doesn't OOM in production all the time.
| Crespyl wrote:
| Can you elaborate on this? I don't know much about
| allocators.
|
| How would the allocator know that some block is unused, short
| of `free` being called? Does glibc not return all memory
| after a `free`? Do other allocators do something clever to
| automatically release things? Is there just a lot of
| bookkeeping overhead that some allocators are better at
| handling?
| adwn wrote:
| When `free()` is called, the allocator internally marks
| that specific memory area as _unused_ , but it doesn't
| necessarily return that area back to the OS, for two main
| reasons:
|
| 1. `malloc()` is usually called with sizes smaller than the
| sizes by which the allocator requests memory from the OS,
| which are at least page-sized (4096 bytes on x86/x86-64)
| and often much larger. After a `free()`, the freed memory
| can't be returned to the OS because it's only a small chunk
| in a larger OS allocation. Only after all memory within a
| page has been `free()`d, the allocator may, but doesn't
| have to, return that page back to the OS.
|
| 2. After a `free()`, the allocator wants to hang on to that
| memory area because the next `malloc()` is sure to follow
| soon.
|
| This is a very simplified overview, and different
| allocators have different strategies for gathering new
| `malloc()`s in various areas and for returning areas back
| to the OS (or not).
| mort96 wrote:
| They're not really correct, glibc will return stuff back to
| the OS. It just has some quirks about how and when it does
| it.
|
| First, some background: no allocator will return memory
| back to the kernel for every `free`. That's for performance
| and memory consumption reasons: the smallest unit of memory
| you can request from and return to the kernel is a _page_
| (typically 4kiB or 16kiB), and requesting and returning
| memory (typically called "mapping" and "unmapping" memory
| in the UNIX world) has some performance overhead.
|
| So if you allocate space for one 32-byte object for
| example, your `malloc` implementation won't map a whole new
| 4k or 16k page to store 32 bytes. The allocator probably
| has some pages from earlier allocations, and it will make
| space for your 32-byte allocation in pages it has already
| mapped. Or it can't fit your allocation, so it will map
| more pages, and then set aside 32 bytes for your
| allocation.
|
| This all means that when you call `free()` on a pointer,
| the allocator can't just unmap a page immediately, because
| there may be other allocations on the same page which
| haven't been freed yet. Only when all of the allocations
| which happen to be on a specific page are freed, can the
| page be unmapped. In a worst-case situation, you could in
| theory allocate and free memory in such a way that you end
| up with 100 1-byte allocations allocated across 100 pages,
| none of which can be unmapped; you'd be using 400kiB or
| 1600kiB of memory to store 100 bytes. (But that's not
| necessarily a huge problem, because it just means that
| future allocations would probably end up in the existing
| pages and not increase your memory consumption.)
|
| Now, the glibc-specific quirk: glibc will only ever unmap
| _the last page_ , from what I understand. So you can
| allocate megabytes upon megabytes of data, which causes
| glibc to map a bunch of pages, then free() every allocation
| except for the last one, and you'd end up still consuming
| many megabytes of memory. Glibc won't unmap those megabytes
| of unused pages until you free the allocation that sits in
| the last page that glibc mapped.
|
| This typically isn't a huge deal; yes, you're keeping more
| memory mapped than you strictly need, but if the
| application needs more memory in the future, it'll just re-
| use the free space in all the pages it has already mapped.
| So it's not like those pages are "leaked", they're just
| kept around for future use.
|
| It can sometimes be a real problem though. For example, a
| program could do a bunch of memory-intensive computation on
| launch requiring gigabytes of memory at once, then all that
| computation culminates in one relatively small allocated
| object, then the program calls free() on all the
| allocations it did as part of that computation. The
| application could potentially keep around gigabytes worth
| of pages which serve no purpose but can't be unmapped due
| to that last small allocation.
|
| If any of this is wrong, I would love to be corrected. This
| is my current impression of the issue but I'm not an
| authoritative source.
| p0w3n3d wrote:
| Thank you. Jemalloc was recently recommended to me on some
| presentation about Java optimization.
|
| I wonder if you did get everything you should from the companies
| that use it. I mean sometimes I feel that big tech firms only use
| free software, never giving anything to it, so I hope you were
| the exception here.
| jeffbee wrote:
| Imagine being a Java developer and thinking "what have big tech
| corporations ever done for me?"
| keybored wrote:
| That are good for me, the developer.
| masklinn wrote:
| > jemalloc was probably booted from Rust binaries sooner than the
| natural course of development might have otherwise dictated.
|
| FWIW while it was _a_ factor it was just one of a number:
| https://github.com/rust-lang/rust/issues/36963#issuecomment-...
|
| And jemalloc was only removed two years after that issue was
| opened: https://github.com/rust-lang/rust/pull/55238
| Aissen wrote:
| Interesting that one of the factor listed in there, the
| hardcoded page-size on arm64, is still is an unsolved issue
| upstream, and that forces app developers to either ship
| multiple arm64 linux binaries, or drop support for some
| platforms.
|
| I wonder if some kind of dynamic page-size (with dynamic
| ftrace-style binary patching for performance?) would have been
| that much slower.
| pkhuong wrote:
| You can run jemalloc configured with 16KB pages on a 4KB page
| system.
| schrep wrote:
| Your work was so impactful over a long period from Firefox to
| Facebook. Honored to have been a small part of it.
| lbrandy wrote:
| Suppose this is as good a place to pile-on as any.
|
| Though this was not the post I was expecting to show up today,
| it was super awesome for me to get to have played my tiny part
| in this big journey. Thanks for everything @je (and qi + david
| -- and all the contributors before and after my time!).
| liuliu wrote:
| Your leadership on continuing investing in core technologies in
| Facebook were as fruitful as it could ever being. GraphQL,
| PyTorch, React to name a few cannot happen without.
| dao- wrote:
| Hmm, if I had to choose between not having Facebook and
| having React, I'd pick the former in a heartbeat. Not that
| this was a real choice, but it was nonetheless bitter to see
| colleagues join the behemoth that was Facebook.
| Omarbev wrote:
| This is a good thing
| adityapatadia wrote:
| Jason, here is a story about how much your work impacts us. We
| run a decently sized company that processes hundreds of millions
| of images/videos per day. When we first started about 5 years
| ago, we spent countless hours debugging issues related to memory
| fragmentation.
|
| One fine day, we discovered Jemalloc and put it in our service,
| which was causing a lot of memory fragmentation. We did not think
| that those 2 lines of changes in Dockerfile were going to fix all
| of our woes, but we were pleasantly surprised. Every single issue
| went away.
|
| Today, our multi-million dollar revenue company is using your
| memory allocator on every single service and on every single
| Dockerfile.
|
| Thank you! From the bottom of our hearts!
| laszlojamf wrote:
| I really don't mean to be snarky, but honest question: Did you
| donate? Nothing says thank you like some $$$...
| onli wrote:
| It was a meta project and development ceased. For a regular
| project that expectation is fine, but here it does not apply
| IMHO.
| adityapatadia wrote:
| We regularly donate to project via open collective. We
| frankly did not see here due to FB involvement I think.
| thewisenerd wrote:
| indeed! most image processing golang services suggest/use
| jemalloc
|
| the top 3 from https://github.com/topics/resize-images (as of
| 2025-06-13)
|
| imaginary:
| https://github.com/h2non/imaginary/blob/1d4e251cfcd58ea66f83...
|
| imgproxy:
| https://web.archive.org/web/20210412004544/https://docs.imgp...
| (linked from a discussion in the imaginary repo)
|
| imagor:
| https://github.com/cshum/imagor/blob/f6673fa6656ee8ef17728f2...
| tecleandor wrote:
| Yep, imgproxy seems to use libvips, that recommends jemalloc.
| I was checking and this is a funny (not) bug report:
|
| https://github.com/libvips/libvips/discussions/3019
| b0a04gl wrote:
| been using jemalloc unknowingly for a long time. only after
| reading this post it hit how much of it was under the hood in
| things I've built. didn't know the gc-style decay mechanism was
| that involved, or that it handled fragmentation with time-based
| heuristics. surprising how much tuning was exposed through env
| vars. solid closure
| brcmthrowaway wrote:
| What allocator does Apple use?
| half-kh-hacker wrote:
| you probably want to look at their 'libmalloc'
| forty wrote:
| Probably iMalloc ;)
| wiz21c wrote:
| FTA:
|
| > And people find themselves in impossible situations where the
| main choices are 1) make poor decisions under extreme pressure,
| 2) comply under extreme pressure, or 3) get routed around.
|
| It doesn't sound like a work place :-(
| bravetraveler wrote:
| Sounds like every workplace I've 'enjoyed' since ~2008
| throwaway314155 wrote:
| nice username
|
| - fsociety
| mrweasel wrote:
| Now I'm not one for victim blaming, but if that's more than
| three places of employment, maybe you need to rethink the
| positions you apply for.
| acdha wrote:
| There's something to that but it is victim blaming if
| you're not acknowledging the larger trends. There are a lot
| of places whose MBAs are attending the same conferences,
| getting the same recommendations from consultants, and
| hearing the same demands from investors. The push against
| remote work, for example, was all driven by ideology
| against most of the available data but it affected a huge
| number of jobs.
| throw0101d wrote:
| > _The push against remote work, for example, was all
| driven by ideology against most of the available data but
| it affected a huge number of jobs._
|
| And before that, open office plans.
|
| You're saving on rent: great. But what is it doing to
| productivity?
|
| * https://business.adobe.com/blog/perspectives/what-
| science-sa...
|
| Of course productivity doesn't show up on a spreadsheet,
| but rent does, so it's what about "the numbers" say.
| the_mitsuhiko wrote:
| All the allocators have the same issue. They largely work against
| a shared set of allocation APIs. Many of their users mostly
| engage via malloc and free.
|
| So the flow is like this: user has an allocation looking issue.
| Picks up $allocator. If they have an $allocator type problem then
| they keep using it, otherwise they use something else.
|
| There are tons of users if these allocators but many rarely
| engage with the developers. Many wouldn't even notice
| improvements or regressions on upgrades because after the initial
| choice they stop looking.
|
| I'm not sure how to fix that, but this is not healthy for such
| projects.
| Cloudef wrote:
| malloc is bad api in general, if you want to go fast you don't
| rely on general purpose allocator
| const_cast wrote:
| This is true, but the unfortunate thing with how C and C++
| were developed is that pretty much everything just assumes
| the existence of malloc/free. So if you're using third-party
| libraries then it's out of your control mostly. Linking a new
| allocator is a very easy and pretty much free way to improve
| performance.
| dazzawazza wrote:
| I've used jemalloc in every game engine I've written for years.
| It's just the thing to do. WAY faster on win32 than the default
| allocator. It's also nice to have the same allocator across all
| platforms.
|
| I learned of it from it's integration in FreeBSD and never looked
| back.
|
| jemalloc has help entertained a lot of people :)
| Iwan-Zotow wrote:
| +1
|
| windows def allocator is pos. Jemalloc rules
| ahartmetz wrote:
| >windows def allocator is pos
|
| Wow, still? I remember allocator benchmarks from 10-15 years
| ago where there were some notable differences between
| allocators... and then Windows with like 20% the performance
| of everything else!
| int_19h wrote:
| > windows def allocator
|
| Which one of them? These days it could mean HeapAlloc, or it
| could mean malloc from uCRT.
| carey wrote:
| malloc in uCRT just calls HeapAlloc, though? You can see
| the code in ucrt\heap\malloc_base.cpp if you have the
| Windows SDK installed.
|
| Programs can opt in to the _segment_ heap in their
| manifest, but it's not necessarily any faster.
| mrweasel wrote:
| Looking at all the comments and lightly browsing the source code,
| I'm amazed. Both at how much impact a memory allocator can make,
| but also how much code is involved.
|
| I'm not really sure what I expected, but somehow I expect a
| memory allocator to be ... smaller, simpler perhaps?
| ratorx wrote:
| Memory allocators can be simple. In fact it was an assignment
| for a course in the 2nd year of my CS degree to make an
| (almost) complete allocator.
|
| However it is typically always more complex to make production
| quality software, especially in a performance sensitive domain.
| burnt-resistor wrote:
| Naive allocators are very easy: just subdivide RAM and
| defragment only when absolutely necessary (if virtual memory
| is unavailable). Performant allocators are _hard._
|
| I think we lost a great deal of potential when ORCA was too
| tied to Pony and not extracted to a framework, tool, and/or
| library useful outside of it such as integrated or working
| with LLVM.
| const_cast wrote:
| It's the same way with garbage collectors.
|
| You can write a naive mark-and-sweep in an afternoon. You can
| write a reference counter in even less time. And for some
| runtimes this is fine.
|
| But writing a generational, concurrent, moving GC takes a lot
| of time. But if you can achieve it, you can get amazing
| performance gains. Just look at recent versions of Java.
| swinglock wrote:
| mimalloc is cleaner but lacks the very useful profiling
| features. To be fair it also has not gone through decades of
| changes as described in the postmortem either.
| senderista wrote:
| You can write a simple size-class allocator (even lock-free) in
| just a couple dozen lines of code. (I've done it both for
| interviews and for a work presentation.) But an allocator that
| is fast, scalable, and performs well over diverse workloads--
| that is HARD.
| burnt-resistor wrote:
| Lesson: Don't let one megacorp dominate or take over your FOSS
| project. Push back somewhat and say "no" to too much help from
| one source.
| igrunert wrote:
| I think the author was happy to be employed by a megacorp,
| along with a team to push jemalloc forward.
|
| He and the other previous contributors are free to find new
| employers to continue such an arrangement, if any are willing
| to make that investment. Alternatively they could cobble
| together funding from a variety of smaller vendors. I think the
| author is happy to move on to other projects, after spending a
| long time in this problem space.
|
| I don't think that "don't let one megacorp hire a team of
| contributors for your FOSS project" is the lesson here. I'd say
| it's a lesson in working upstream - the contributions made
| during their Facebook / Meta investment are available for the
| community to build upon. They could've just as easily been made
| in a closed source fork inside Facebook, without violating the
| terms of the license.
|
| Also Mozilla were unable to switch from their fork to the
| upstream version, and didn't easily benefit from the Facebook /
| Meta investment as a result.
| ecshafer wrote:
| He worked for like a decade at Facebook it looks like. I would
| guess at least at a Staff level. How many millions of dollars
| do you think he got from that? It doesnt sound like the worse
| trade in the world.
| didip wrote:
| Thanks for everything, JE!
|
| jemalloc is always the first thing I installed whenever I had to
| provision bare servers.
|
| If jemalloc is somehow the default allocator in Linux, I think it
| will not have a hard time retaining contributors.
| soulbadguy wrote:
| Maybe add a link to the post on the github repo. I feel like this
| is important context for people visiting the repo in the future
___________________________________________________________________
(page generated 2025-06-13 23:00 UTC)