https://erikmcclure.com/blog/windows-malloc-implementation-is-a-trash-fire/ * Blog * Projects * Bandcamp * Github * Twitter * Websites Erik McClure The Windows malloc() Implementation Is A Trash Fire --------------------------------------------------------------------- I am currently porting an experimental language to Windows. This experimental language was built in C++ with LLVM, and relies heavily on GCC extensions like VLAs and Compound Statement Expressions, which basically made it impossible to build with MSVC (although I have a truly horrifying idea I may attempt later). Luckily, you can now build things on Windows with Clang, which solves a lot of problems. However, clang-cl simply compiles the code - it still uses the Microsoft C++ headers and links to the Microsoft C++ runtime. This is a good thing, because it ensures maximum compatibility with win32 APIs and other Windows executables. Unfortunately, it also means you get the Windows malloc() implementation from MSVCRT (specifically, it's statically linking with the CRT shipped with Visual Studio), which is quite possibly one of the worst piles of rotten garbage ever compiled in the history of C. I learned how to program, like many people did, by doing indie game dev. Like many people, I never released a single game, but I did write a bunch of code which now lies forgotten in lost GitHub repositories. I was taught that to allocate memory was to summon death itself to ruin your performance. A single call to malloc() during any frame is likely to render your game unplayable. Any sort of allocations that needed to happen with any regularity required writing a custom, purpose-built allocator, usually either a fixed-size block allocator using a freelist, or a greedy allocator freed after the level ended. Even more optimization could be made by using thread-local storage, to maintain thread-specific allocators without ever needing to waste time on concurrency. It turns out you don't actually need any of this on Linux. You can basically malloc() whatever you want (within reason) and it'll be surprisingly fast. LLVM was built for Linux - or rather, it was built to work for Mac OSX, which is POSIX compliant and looks like a Linux system if you squint. Most optimizations are designed to make it go fast on Mac or Linux. Since it's a compiler, it makes a lot of tiny allocations, because it basically represents control flow as a gigantic digraph. I actually thought it used a custom allocator for allocating these tiny nodes, because that's what I would've done, but in reality, it just calls new everywhere and lets the Linux malloc() implementation deal with it. The reason I care is because this experimental language I'm working on needs to JIT its core library when booting up - it takes about 1.1 seconds to do this on Linux, and 31 seconds on Windows. At first, I thought this inefficiency came from this experimental language utilizing std::unordered_map everywhere, since this allocates a new chunk of memory for every single item to ensure iterator stability, and is well known for being incredibly inefficient compared to basically any other hash implementation. I substituted this with Google's Abseil flat_hash_map implementation, and achieved an impressive 2x speedup on Windows, dropping the startup time to 16 seconds. Pretty good, and in line with what I expected. Can you guess what the corresponding Linux speedup was? Nothing. Literally. Fucking. NOTHING. We tested this both on my WSL instance on Windows and a native NixOS desktop, with identical results. Deciding to forge on ahead, I found another inefficient area of the compiler that was needlessly allocating an entire vector to support bignum types, even though none of the tests ever actually used this, so the vector ended up only holding a single integer. I changed it to skip the vector if there was only 1 element. This cut another 2 seconds from Windows. It appeared to either do nothing on Linux, or somehow made it slightly slower. Worse, based on my profiler analysis, I was running out of stuff outside of LLVM to optimize - the remaining 14 seconds of startup time was mostly LLVM. Okay, fuck it, this is clearly an allocator issue, and incidentally, some game developers had started using LLVM for… some reason, and of course, most games work on Windows, so they needed LLVM to be fast on Windows. They introduced a very janky way of replacing malloc () in LLVM, which I kinda had to hack into a custom vcpkg fork to make it work with my dependency handling, but it seemed to work! Except it made LLVM crash. It turns out that if you replace malloc() with either rpmalloc or mimalloc, the windows kernel would sometimes trigger a debug break instruction while inside std::recursive_mutex, but only if you are JITing code. This happens even if you disable threading in LLVM. At first I thought this was some kind of ABI mismatch (since this has happened before), but no amount of tweaking things to match up fixed the issue. I spoke with a personal friend of mine who happens to work at Apple on LLVM, and they suspect this is a kernel sanity check intended to catch potential deadlocks, maybe because the critical section isn't re-entrant and something screwed up. However, this was inside of a std::recursive_mutex implementation, so… it should be re-entrant, by definition. There may simply be a race-condition or implementation error somewhere inside LLVM, but I'm really not in the mood to debug extremely arcane multithreading issues in LLVM. So, instead, I did the worst hack of my entire life by replacing the efficient std::recursive_mutex implementation that used critical sections with an extremely inefficient re-entrant spinlock. This actually worked, and instantly made the startup time on windows approximately 1.4 seconds, now within the realm of the Linux start times. The fucking Windows allocator was the source of all of my performance problems the entire time. aganea, who has my undying gratitude for taking over the task of unfucking the catastrophic trainwreck that was LLD, happens to also be the one who introduced the method of patching LLVM's allocator. Sadly, it seems whatever they were using LLVM for did not involve JITing code, or may have used a different method, as nobody seems to have run into this problem except me. The best part about this whole fiasco is that mimalloc was developed by Microsoft for Microsoft services because of how slow Microsoft's own MSVCRT malloc() was. So this entire time I've been trying to replace Microsoft's malloc() with Microsoft's other malloc() because Microsoft's malloc() was too slow for Microsoft. For some insane reason, mimalloc is not shipped in Visual Studio, not even as something that you have to opt-in to (in case there's some backwards compatibility concern). Microsoft's instructions provide operator new overloads instead of actually integrating this with their own compiler despite being an order of magnitude faster in rapid small allocation scenarios! So, at this point, we have learned three things: the Windows allocator is a complete trash fire, Microsoft invented an alternative to their own allocator but refuse to provide it as an alternative, and there's clearly some kind of race condition in LLVM's JIT in certain edge cases related to the usage of mutexes on Windows. If someone wants to try to reproduce this and file a proper bug on LLVM, go ahead, but it would take days to distill a minimal example for this and I honestly have better things to do right now. The inefficient spinlock doesn't matter when threading is disabled because it has no contention, and having threading enabled does not appear to actually make my use-case go any faster anyway, for whatever reason. Everything is broken and I am too tired to do anything but contribute another obscene hack on top of this pile of cards we have built modern civilization on. --------------------------------------------------------------------- Published on July 2, 2022 at 2:36am share: * * * * Comments --------------------------------------------------------------------- Avatar Archive 1. 2022 + The Windows malloc() Implementation Is A Trash Fire + Camp Vista - Growing Up Next To Microsoft 2. 2021 + Blockchain Is The New JavaScript + C++ Constructors, Memory, and Lifetimes + Factorio Is The Best Technical Interview We Have 3. 2020 + Why You Can't Use Prebuilt LLVM 10.0 with C++17 + Someone Is Stealing Tracker Songs And Selling Them + Pressure Based Anti-Spam for Discord Bots 4. 2019 + Name Shadowing Should Be An Operator + A Rant On Terra + RISC Is Fundamentally Unscalable 5. 2018 + Software Engineering Is Bad, But That's Not Why + Why Do People Use The Wrong Email? + Software Optimizes to Single Points of Failure + Migrating To A Static Blog + How To Avoid Memorizing Times Tables 6. 2017 + Ignoring Outliers Creates Racist Algorithms + I Used To Want To Work For Google + Sexist Programmers Are Awful Engineers + Why I Never Built My SoundCloud Killer + Integrating LuaJIT and Autogenerating C Bindings In Visual Studio + Discord: Rise Of The Bot Wars + Programmers Should Take Linguistics + Companies Can't Be Apolitical + I Can't Hear Anything Below 80 Hz* + Windows Won't Let My Program Crash + DirectX Is Terrifying + Our Software Is a Beacon of Hope 7. 2016 + Everyone Does sRGB Wrong Because Everyone Else Does sRGB Wrong + Mathematical Notation Is Awful + The GPL Is Usually Overkill + The Right To Ignore: The Difference Between Free Speech And Harassment 8. 2015 + There Will Never Be One True Programming Language + Abortion Has No Moral High Ground + I Tried To Install Linux And Now I Regret Everything + We Aren't Designing Software For Robots + Using Data To Balance Your Game: Pony Clicker Analysis + Is There A Commercial Open Source License? + Does Anyone Actually Want Good Software? + Why Don't You Just Fire Them? 9. 2014 + How Not To Install Software + Never Reinventing The Wheel Is Anticompetitive + Everyone Can Be Above Average + Can We Choose What We Enjoy? + How To Make Your Profiler 10x Faster + The Problem With Photorealism + Success Is Relative 10. 2013 + Google's Decline Really Bugs Me + The Educational Imbroglio + The Ladder-Climbing Generation + The Microsoft Word Problem + Write Less Code + Most People Have Shitty Computers + Leap Motion Impressions, Input Sanitation, and 3D Gesture Ideas + Aurora Theory Released! + What I Learned In College + How To Complain About Men And Be Sexist At The Same Time + Course Notes + Contact + The Dark Side of Web Development + Windows Breaks assert() Inside WM_CANCELMODE + Dreams Are Worth Fighting For + The Productivity Fallacy + The Earbud Loudness Wars 11. 2012 + Giant List of FREE SAMPLES + The Weekend I Got Lost All The Time + C# to C++ Tutorial - Part 4: Operator Overload + 7 Problems Raytracing Doesn't Solve + Teenage Rebellion as a Failure of Society + Analyzing XKCD: Click and Drag + What Is A Right Answer? + Coordinate Systems And Cascading Stupidity + How Joysticks Ruined My Graphics Engine + Properly Dreaming About Success + Multithreading Problems In Game Design + IP Law Makes You an Asshole + Stop Following The Rules + The Standards Problem + An evidence-based refutation of the Project Glass parodies + Language Wars Are Pointless + Why Windows 8 Does The Right Thing The Wrong Way + Well That Was Interesting + Visual Studio Broke My Computer + Hiring the Wrong People + Chill Out + Implicit UI Design + Linux Mint 12 KDE + 'Programmer' is an Overgeneralization + Wikipedia's Identity Crisis 12. 2011 + Your Esoteric Language is Useless + The Great Mystery of Linear Gradient Lighting + Signed Integers Considered Stupid (Like This Title) + Why Kids Hate Math + Don't Work on Someone Else's Dream + C# to C++ Tutorial - Part 3: Classes and Structs and Inheritance (OH MY!) + The Problem of Vsync + Musical Genres + C# to C++ Tutorial - Part 2: Pointers Everywhere! + Radians: An explanation + C# to C++ Tutorial - Part 1: Basics of Syntax + On Hacking (or Why We Need Security Ratings) + My Mom Had a Heart Attack + The Religion Problem: Perspectives + The Ninth Circle of Bugs + Save RSS + College Is Broken + An Outside Perspective on the PSN Fiasco + Future Predictions + Investigating Low-level CPU Performance + The Bullying Never Stops 13. 2010 + Aquaria Hints + The IM Failure + How To Train Your Dragon + Album For Sale! [Renascent] + WavSaver + Pixel Perfect Hit Testing + Fractal Cycle Screensaver + 8-bit color cycling + Physics Networking + Assembly CAS implementation + Function Pointer Speed + Most Bizarre Error Ever + Floating Point Preformance + Watch This Now 14. 2009 + Physics-oriented Network Interpolation + Proving Strong AI + I now have an LJ Copyright (c)2022 Erik McClure Sitemap | RSS Feed