[HN Gopher] Honey, I shrunk {fmt}: bringing binary size to 14k a...
___________________________________________________________________
Honey, I shrunk {fmt}: bringing binary size to 14k and ditching the
C++ runtime
Author : karagenit
Score : 212 points
Date : 2024-09-01 08:30 UTC (14 hours ago)
(HTM) web link (vitaut.net)
(TXT) w3m dump (vitaut.net)
| magnio wrote:
| > All the formatting in {fmt} is locale-independent by default
| (which breaks with the C++'s tradition of having wrong defaults)
|
| _Chuckles_
| tialaramex wrote:
| It's really more of a committee thing - so we wouldn't
| necessarily expect fmt, a third party library, to have wrong
| defaults.
|
| Astoundingly, when this was standardised (as std::format for
| C++ 20) the committee didn't add back this mistake (which is
| present in numerous other parts of the standard). Which does
| give small hope for the proposers who plead with the committee
| to not make things unnecessarily worse in order to make C++
| "consistent".
| formerly_proven wrote:
| I'm filing a Defect Report about std::format disrespecting
| locale as we speak.
| arunc wrote:
| How/where do you do that?
| vitaut wrote:
| You send an email to the Library Working Group chair.
| johannes1234321 wrote:
| See https://isocpp.org/std/submit-issue
| tialaramex wrote:
| Of course, just because a defect is reported doesn't mean
| it'll get fixed, or that the fix will be of any use.
|
| The most famous (technically a C defect) is probably
| DR#260: https://www.open-
| std.org/jtc1/sc22/wg14/www/docs/dr_260.htm
| ape4 wrote:
| You can pass in a locale as a parameter. (Of course this
| doesn't fix the default)
| h4ck_th3_pl4n3t wrote:
| It's kind of mindblowing to see how much code floating point
| formatting needs.
|
| The linked dragonbox [1] project is also worth a read. Pretty
| optimized for the least used branches.
|
| [1] https://github.com/jk-jeon/dragonbox
| mananaysiempre wrote:
| > It's kind of mindblowing to see how much code floating point
| formatting needs.
|
| If you want it to be fast. The baseline implementation isn't
| terrible[1,2] even if it is still ultimately an implementation
| of arbitrary-precision arithmetic.
|
| [1] https://research.swtch.com/ftoa
|
| [2] https://go.dev/src/strconv/ftoa.go
| vitaut wrote:
| {fmt} has an optional implementation of the old Dragon4
| algorithm that is smaller in terms of code size but not as
| fast.
| ziml77 wrote:
| I learned how much floating point formatting needs when I was
| doing work with Zig recently.
|
| Usually the Zig compiler can generate binaries smaller than
| MSVC because it doesn't link in a bunch of useless junk from
| the C Runtime (on Windows, Zig has no dependency on the C
| runtime). But this time the binary seemed to be much larger
| than I've seen Zig generate before and it didn't make sense
| based on how little the tool was actually doing. Dropping it
| into Binary Ninja revealed that the majority of the code was
| there to support floating point formatting. So I changed the
| code to cast the floating point number to an integer before
| printing it out. That change resulted in a binary that was down
| at the size I had been expecting.
| jk-jeon wrote:
| https://github.com/jk-jeon/dragonbox/discussions/57#discussi...
|
| We have been doing some experiment on optimizing for size, and
| currently it can be reduced to ~3k on 8-bit AVR. It only
| contains impl/table for single-precision binary32, and double-
| precision requires quite more, but at the same time much of the
| bloat is due to how limited AVR is. On platforms like x64 it
| should be much smaller.
|
| You can certainly say 3k is still huge though.
| londons_explore wrote:
| I kinda hoped a formatting library designed to be small and able
| to print strings, and ints ought to be ~50 bytes...
|
| strings are ~4 instructions (test for null terminator, output
| character, branch back two).
|
| Ints are ~20 instructions. Check if negative and if so output '-'
| and invert. Put 1000000000 into R1. divide input by R1, saving
| remainder. add ASCII '0' to result. Output character. Divide R1
| by 10. put remainder into input. Loop unless R1=0.
|
| Floats aren't used by many programs so shouldn't be compiled
| unless needed. Same with hex and pointers and leading zeros etc.
|
| I can assure you that when writing code for microcontrollers with
| 2 kilobytes of code space, we don't include a 14 kilobyte string
| formatting library...
| vient wrote:
| It is a featureful formatting library, not simply a library for
| slow printing of ints and strings without any modifiers. You
| can't create a library which is full of features, fast, and
| small simultaneously.
| jstimpfle wrote:
| You'd hope the unused stuff gets stripped out but I don't
| know much about this topic so not going to argue.
| vlovich123 wrote:
| Ffunction-sections and fdata-sections would need at a
| minimum to be used to strip dead code. But even with LTO
| it's highly unlikely this could be trimmed unless all
| format strings are parsed at compile time because the
| compiler wouldn't know that the code wouldn't be asked to
| format a floating point number at some point. There could
| be other subtle things that hide it from the compiler as
| dead code.
|
| The surest bet would be a compile time feature flag to
| disable floating point formatting support which it does
| have.
|
| Still, that's 8kib of string formatting library code
| without floating point and a bunch of other optimizations
| which is really heavy in a microcontroller context
| CoastalCoder wrote:
| I think this is one scenario where C++ type-templated
| string formatters _could_ shine.
|
| Especially if you extended them to indicate assumptions
| about the values at compile time. E.g., possible ranges
| for integers, whether or not a floating point value can
| have certain special values, etc.
| vlovich123 wrote:
| You'd be surprised. I'm pretty sure std::format is
| templated. That doesn't mean that it's still easy to
| convince the compiler to delete that code.
| josephg wrote:
| > it's highly unlikely this could be trimmed unless all
| format strings are parsed at compile time
|
| They probably should be passed at compile time, like how
| zig does it. It seems so weird to me that in C & C++
| something as simple as format strings are handled
| dynamically.
|
| Clang even parses format strings anyway, to look for
| mismatched arguments. It just - I suppose - doesn't do
| anything with that.
| vitaut wrote:
| It is indeed possible to remove unused code with techniques
| like format string compilation but that's a topic for
| another post.
| jstimpfle wrote:
| Curious what space space you work in? What kind of devices,
| what are they used for?
| londons_explore wrote:
| Not me but a friend. Things like making electronics for
| singing birthday cards and toys that make noise.
|
| But there are plenty of other similar things - like making
| the code that determines the flashing pattern of a bicycle
| light or flashlight. Or the code that does the countdown
| timer on a microwave. Or the code that makes the 'ding' sound
| on a non-smart doorbell. Or the code that makes a hotel safe
| open when the right combination is entered. Or the code that
| measures the battery voltage on a USB battery bank and puts
| 1-4 indicator LED's on so you know how full it is.
|
| You don't tend to hear about it because the design of most of
| this stuff doesn't happen in the USA anymore - the software
| devs are now in China for all except high-end stuff.
| furyofantares wrote:
| Do any of those need a string formatting library?
| toast0 wrote:
| Hotel safe might, if it logs somewhere (serial port?).
|
| The others may have a serial port setup during
| development, too. If you have a truly small formatter,
| you can just disable it for final builds (or leave it on,
| asssuming output is non blocking, if someone finds the
| serial pins, great for them), rather than having larger
| rom for development and smaller for production.
| londons_explore wrote:
| mostly used for debugging with "printf debugging" -
| either on the developers desk, or in the field ("we've
| got a dead one. Can you hook up this pin to a USB-serial
| converter and tell me what it's saying?")
| IshKebab wrote:
| It isn't designed to be small; it's designed to be a fully
| featured string formatting library with size as an important
| secondary goal.
|
| If you want something that _has_ to be microscopic at the cost
| of not supporting basic features there are definitely better
| options.
|
| > I can assure you that when writing code for microcontrollers
| with 2 kilobytes of code space, we don't include a 14 kilobyte
| string formatting library...
|
| No shit. If you only have 2kB (unlikely these days) don't use
| this. Fortunately the vast majority of modern microcontrollers
| have way more than that. E.g. esp32 _starts_ at 1MB. Perfectly
| reasonable to use a 14kB formatting library there.
| londons_explore wrote:
| When you're designing something that sells for a dollar to
| retailers, eg. a birthday card that sings, your boss won't
| let you spend more than about 5 cents on the microcontroller,
| and probably wants you to spend 1-2 cents if you can.
| edflsafoiewq wrote:
| Perhaps a singing birthday card doesn't need to format
| strings.
| nikbackm wrote:
| How else would you get nice looking logs for debugging
| it?
| a1o wrote:
| Using log4c
| swagonomixxx wrote:
| I kind of get where you're coming from but at what point do
| we admit that such use cases are the fringe and not the
| main?
| IshKebab wrote:
| Sure, but such extreme use cases are rare and don't need to
| be constantly brought up.
| cozzyd wrote:
| Even on larger microcontrollers you often have to write a
| bootloader...
| IshKebab wrote:
| Very occasionally I guess. They're almost always bare
| metal.
| cozzyd wrote:
| You still want a bootloader to support firmware updates,
| typically in the first 8 kB of flash or something like
| that.
| IshKebab wrote:
| Good point. I guess don't use `fmt` for that...
| tialaramex wrote:
| > When you're designing something that sells for a dollar
| to retailers
|
| Then you shouldn't prioritize compatibility with 1980s Unix
| code, which is what C++ is for.
| Narishma wrote:
| > esp32 starts at 1MB
|
| Which models? The most I've ever seen on an ESP32 is 512KB of
| SRAM.
| ta988 wrote:
| I think they are talking about the flash. The code is by
| default run from flash (a mechanism called XIP execute in
| place). But you can annotate functions (with a macro called
| IRAM_ATTR) that you want to have in ram if you need
| performance (you have to also be careful about the data
| types you use inside as they are not guaranteed to be put
| in RAM).
| maccard wrote:
| What do you use instead?
|
| Iostream is... far bigger than this, for example.
| Sharlin wrote:
| I presume the sort of custom routines that GP described?
| londons_explore wrote:
| most platforms come with their own libraries for this, which
| are usually a mix of hand coded assembly and C. You #include
| the whole library/sdk, but the linker strips out all bits you
| don't use.
|
| Even then, if you read the disassembled code, you can usually
| find within a few minutes looking some
| stupid/unused/inefficient code - so you could totally do a
| better job if you wrote the assembly by hand, but it would
| take much more time (especially since most of these
| architectures tend to have very irregular instruction sets)
| maccard wrote:
| If you're just going to use the platform built in, then the
| size of a third party library doesn't matter to you.
| criddell wrote:
| If you only have 2 kB of code space, you would likely be
| doing custom routines in assembly that do exactly what you
| need and nothing more.
| maccard wrote:
| Right - so no matter how small libfmt gets Op isn't going
| to use it
| secondcoming wrote:
| Would someone writing code for a 2Kb microcontroller even be
| using full-fledged C++, or just C With Classes?
| londons_explore wrote:
| It's still full fledged C++, you just don't use many of the
| features, and the compiler leaves out all of the associated
| code.
|
| Pretty easy to accidentally use some iostream and
| accidentally pull in loads of code you didn't want though.
| astrobe_ wrote:
| To me the only reason to use C++ over C in that case is a
| the slightly stronger type-checking and maybe some extra
| syntactic sugar.
| formerly_proven wrote:
| avrlibc's small variant of printf (which still has a ton of
| features) is like 600 bytes.
| jeroenhd wrote:
| I don't think the requirements for your specific programming
| niche should influence the language like that. Your
| requirements are valid, but they should be served by a bottom
| of the barrel microcontroller compiler rather than the language
| spec.
| MobiusHorizons wrote:
| It's relevant because the author mentions microcontrollers as
| the reason for focusing on binary size.
| Karliss wrote:
| There are many orders of magnitude difference between
| smallest and higher end microcontrollers. You can have an
| 8bit micro with <1k of ram and <8k of flash memory, and you
| can have something with >8MB of flash memory or more
| running a RTOS possibly even capable of running Linux with
| moderate effort. In the later case 14k of formatting
| library is probably fine.
| jeroenhd wrote:
| All of the optimisation work in this article is done for
| Linux aarch64 ELF binaries.
|
| Besides that, microcontrollers have megabytes of storage
| these days. To be restricted by a dozen or two kilobytes of
| code storage, you need to be _very_ storage restricted.
|
| I have run into code size issues myself when trying to run
| Rust on an ESP32 with 2MiB of storage (thought I bought
| 16MB, but MB stood for megabits, oops). Through tweaking
| the default Rust options, I managed to save half a megabyte
| or more to make the code work again. The article also links
| to a project where the fmt library is still much bigger
| (over 300KiB rather than the current 57KiB).
|
| There are microcontrollers where you need to watch out for
| your dependencies and compiler options, and then there are
| _tiny_ microcontrollers where every bit matters. For those
| specific constraints, it doesn't make a lot of sense to
| assume you can touch every language feature and load every
| standard library to just work. Much older language features
| (such as template classes) will also add hundreds of
| kilobytes of code to your program already, you have to work
| around that stuff if you're in an extremely constrained
| environment.
|
| The important thing with language features that includes
| targets like these is that you can disable the entire
| feature and enable your own. Sharing design goals between
| x64 supercomputers and RISC-V chips with literal dozens of
| bytes of RAM makes for an unreasonably restricted language
| for anything but the minimal spec. Floats are just
| expensive on minimum cost chips.
| sixfiveotwo wrote:
| > I can assure you that when writing code for microcontrollers
| with 2 kilobytes of code space, we don't include a 14 kilobyte
| string formatting library...
|
| I'm pretty sure you wouldn't use C++ in that situation anyway,
| so I don't really see your point.
| usrnm wrote:
| If you get rid of the runtime, which most compilers allow you
| to do, C++ is just as suitable for this task as C. Not as
| good as hand-rolled assembly, but usable
| sixfiveotwo wrote:
| Okay, vtables in 2kb code space?
| CyberDildonics wrote:
| Where are the vtables coming from if you use no
| inheritance and no memory allocation?
| ta988 wrote:
| You can use c++ yes and a lot of people do. You just keep the
| exceptions, stdlib and runtime in general at the door.
| aseipp wrote:
| The design of any library for a microcontroller and an
| "equivalent" for general end user application is going to be
| different in pretty much every major design point. I'm not sure
| how this is any more relevant to fmt than it is just general
| complaining out in the open.
|
| The code for an algorithm like Dragonbox or Dragon4 alone is
| already blowing your size budget, so the "optional" stuff
| doesn't really matter. And that's 1 of like 20 features people
| want.
| ska wrote:
| Sure, but there are also microcontrollers with a lot more
| space. This probably won't ever usefully target the small ones,
| but that doesn't mean it isn't useful in the space at all.
| fsckboy wrote:
| > _I can assure you that when writing code for microcontrollers
| with 2 kilobytes of code space, we don 't include a 14 kilobyte
| string formatting library..._
|
| then the thing to do is publish the libraries you do use,
| right, then document what formatting features they support?
| then other people might discover more and clever ways to pack
| more features in than you thought of
|
| otherwise, I don't get your point.
| dinkumthinkum wrote:
| What? I mean, you realize that fmtlib is much more complicated
| than that, right? What you are describing is something very
| basic, primitive by comparison. I'm also puzzled why you think
| floats are not used by many programs, that's kind of mind-
| boggling. I get that you wouldn't load it on a microcontroller
| but you wouldn't do that with the standard library either.
| pzmarzly wrote:
| > However, since it may be used elsewhere, a better solution is
| to replace the default allocator with one that uses malloc and
| free instead of new and delete.
|
| C++ noob here, but is libc++'s default allocator (I mean, the
| default implementation of new and delete) actually doing
| something different than calling libc's malloc and free under the
| hood? If so, why?
| murderfs wrote:
| No, modulo the aligned allocation overloads, but applications
| are allowed to override the default standard library operator
| new with their own, even on platforms that don't have an
| equivalent to ELF symbol interposition.
| masklinn wrote:
| That doesn't really explain where the dependency on the C++
| runtime come from tho, as far as I know the dependency chain
| is std::allocator -> operator new -> malloc, but from the
| post the replacement only strips out the `operator new`.
|
| Notably I thought the issue would be the throwing of
| `std::bad_alloc`, but the new version still implements
| std::allocator, and throws bad_alloc.
|
| And so I assume the issue is that the global `operator new`
| is concrete (it just takes the size of the allocation), thus
| you need to link to the C++ runtime just to get that
| function? In which case you might be able to get the same
| gains by redefining the global `operator new` and `operator
| delete`, without touching the allocator.
|
| Alternatively, you might be able to statically link the C++
| runtime and have DCE take care of the rest.
| kllrnohj wrote:
| Yes they could have just defined their own global operator
| new/delete to have a micro-runtime. Same as you'd do if you
| were doing a kernel in C++. Super easy, barely an
| inconvenience
| gobblegobble2 wrote:
| > Notably I thought the issue would be the throwing of
| `std::bad_alloc`, but the new version still implements
| std::allocator, and throws bad_alloc.
|
| The new version uses `FMT_THROW` macro instead of a bare
| throw. The article says "One obvious problem is exceptions
| and those can be disabled via FMT_THROW, e.g. by defining
| it to abort". If you check the `g++` invocation, that's
| exactly what the author does.
| pjmlp wrote:
| ISO C++ doesn't require new and delete default implementations
| to call down into malloc()/free().
|
| Many implementations do it, only because it is already there
| and thus it is easy just to reach for them.
| 1000100_1000101 wrote:
| Not the strongest on C++ myself, but the new[] will attempt to
| run constructors on each element after calling the new operator
| to get the RAM. The delete[] will attempt to run destructors
| for each element before calling operator delete[] to free the
| RAM.
|
| In order for delete[] to work, C++ must track the allocation
| size somewhere. This could be co-located with the allocation
| (at ptr - sizeof(size_t) for example), or it could be in some
| other structure. Using another structure lowers the odds of it
| getting trampled if/when something writes to memory beyond an
| object, but comes with a lookup cost, and code to handle this
| new structure.
|
| I'm sure proper C++ libraries are doing even more, but you
| already get the idea, new and delete are not the same as malloc
| and free.
| OskarS wrote:
| > In order for delete[] to work, C++ must track the
| allocation size somewhere.
|
| That is super-interesting, I had never considered this, but
| you're absolutely right. I am now incredibly curious how the
| standard library implementations do this. I've heard normal
| malloc() sometimes colocates data in similar ways, I wonder
| if C++ then "doubles up" on that metadata. Or maybe the
| standard library has it's own entirely custom allocator that
| doesn't use malloc() at all? I can't imagine that's true,
| because you'd want to be able to swap system allocators with
| e.g. LD_PRELOAD (especially for Valgrind and stuff). They
| could also just be tracking it "to the side" in some hash
| table or something, but that seems bad for performance.
| tom_ wrote:
| new[] and delete[] both know the type of the object.
| Therefore both know whether a destructor needs to be
| called.
|
| When a destructor doesn't - e.g., new int[] - operator
| new[] is called upon to allocate N*sizeof(T) bytes. The
| code stores off no metadata. The result of operator new[]
| is the array address.
|
| When a destructor does - e.g., new std::string[] - operator
| new[] is called upon to allocate sizeof(size_t)+N*sizeof(T)
| bytes. The code stores off the item count in the size_t,
| adds sizeof(size_t) to the value returned by operator
| new[], uses that as the address for the array, and calls
| T() on each item. And delete[] performs the opposite:
| fishes out the size_t, calls ~T() on each item, subtracts
| sizeof(size_t) from the array address, and passes that to
| operator delete[] to free the buffer.
|
| (There are also some additional things to cater for: null
| checks, alignment, and so on. Just details.)
|
| Note that operator new[] is not given any information about
| whether a destructor needs to run, or whether there is any
| metadata being stored off. It just gets called with a byte
| count. Exercise caution when using placement operator
| new[], because a preallocated buffer of N*sizeof(T) may not
| be large enough.
| jeffbee wrote:
| jemalloc and tcmalloc use size classes, so if you allocate
| 23 bytes the allocator reserves 32 bytes of space on your
| behalf. Both of them can find the size class of a pointer
| with simple manipulation of the pointer itself, not with
| some global hash table. E.g. in tcmalloc the pointer
| belongs to a "page" and every pointer on that page has the
| same size.
| Someone wrote:
| That doesn't help for C++ if you allocated an array of
| objects with destructors. It has to know that you
| allocated 23 objects, so that it can call 23 destructors,
| not 32 ones, 9 of which on uninitialized memory.
| jeffbee wrote:
| I believe the question was more around how the program
| knows how much memory to deallocate. The compiler
| generates the destructor calls the same way the compiler
| generates everything else in the program.
| bangaladore wrote:
| realloc is the same, as the old memory needs to be copied to
| the new memory.
| progmetaldev wrote:
| Isn't it also possible for other logic to run in a
| destructor, such as freeing pointers to external resources?
| Doesn't this cause (at the very least) the possibility for
| more advanced logic to be run beyond freeing the object's own
| memory?
| janos95 wrote:
| The main point of replacing it with malloc is that new will
| throw std::bal_alloc so using it requires linking against the
| c++ runtime.
| neonsunset wrote:
| It's always fmt. Incredibly funny that _this exact problem_ now
| happens in .NET. If you touch enough numeric (esp. fp and
| decimal) formatting /parsing bits, linker ends up rooting a lot
| of floating point and BigInt related code, bloating binary size.
| pjmlp wrote:
| Still looking forward for the Delphi like experience with
| Native AOT, thankfully getting better.
| msephton wrote:
| Very enjoyable. I love these sort of thinking outside the box
| optimisations.
| a1o wrote:
| > Considering that a C program with an empty main function is 6kB
| on this system, {fmt} now adds less than 10kB to the binary.
|
| Interesting, I've never done this test!
| JonChesterfield wrote:
| It varies widely with whether the C library is dynamically or
| statically linked and with how the application (and C library)
| were built. And on which C library it is. Also a little on
| whether you're using elf or some other container.
| ptspts wrote:
| Shameless plug: printf(Hello, World!\n"); is possible with an
| executable size of 1008 bytes, including libc with output
| buffering: https://github.com/pts/minilibc686
|
| Please note that a direct comparison would be apples-to-oranges
| though.
| jart wrote:
| That's because the compiler turns it into fputs
| rty32 wrote:
| Maybe I am slow, it took me a while to realize the "14k" in the
| title refers to "14kB"
| hrydgard wrote:
| What else would it possibly mean?
|
| k is very common shorthand for kB, at least historically.
| Rygian wrote:
| 14000 lines of assembler?
___________________________________________________________________
(page generated 2024-09-01 23:00 UTC)