[HN Gopher] Honey, I shrunk {fmt}: bringing binary size to 14k a...
       ___________________________________________________________________
        
       Honey, I shrunk {fmt}: bringing binary size to 14k and ditching the
       C++ runtime
        
       Author : karagenit
       Score  : 212 points
       Date   : 2024-09-01 08:30 UTC (14 hours ago)
        
 (HTM) web link (vitaut.net)
 (TXT) w3m dump (vitaut.net)
        
       | magnio wrote:
       | > All the formatting in {fmt} is locale-independent by default
       | (which breaks with the C++'s tradition of having wrong defaults)
       | 
       |  _Chuckles_
        
         | tialaramex wrote:
         | It's really more of a committee thing - so we wouldn't
         | necessarily expect fmt, a third party library, to have wrong
         | defaults.
         | 
         | Astoundingly, when this was standardised (as std::format for
         | C++ 20) the committee didn't add back this mistake (which is
         | present in numerous other parts of the standard). Which does
         | give small hope for the proposers who plead with the committee
         | to not make things unnecessarily worse in order to make C++
         | "consistent".
        
           | formerly_proven wrote:
           | I'm filing a Defect Report about std::format disrespecting
           | locale as we speak.
        
             | arunc wrote:
             | How/where do you do that?
        
               | vitaut wrote:
               | You send an email to the Library Working Group chair.
        
               | johannes1234321 wrote:
               | See https://isocpp.org/std/submit-issue
        
               | tialaramex wrote:
               | Of course, just because a defect is reported doesn't mean
               | it'll get fixed, or that the fix will be of any use.
               | 
               | The most famous (technically a C defect) is probably
               | DR#260: https://www.open-
               | std.org/jtc1/sc22/wg14/www/docs/dr_260.htm
        
           | ape4 wrote:
           | You can pass in a locale as a parameter. (Of course this
           | doesn't fix the default)
        
       | h4ck_th3_pl4n3t wrote:
       | It's kind of mindblowing to see how much code floating point
       | formatting needs.
       | 
       | The linked dragonbox [1] project is also worth a read. Pretty
       | optimized for the least used branches.
       | 
       | [1] https://github.com/jk-jeon/dragonbox
        
         | mananaysiempre wrote:
         | > It's kind of mindblowing to see how much code floating point
         | formatting needs.
         | 
         | If you want it to be fast. The baseline implementation isn't
         | terrible[1,2] even if it is still ultimately an implementation
         | of arbitrary-precision arithmetic.
         | 
         | [1] https://research.swtch.com/ftoa
         | 
         | [2] https://go.dev/src/strconv/ftoa.go
        
         | vitaut wrote:
         | {fmt} has an optional implementation of the old Dragon4
         | algorithm that is smaller in terms of code size but not as
         | fast.
        
         | ziml77 wrote:
         | I learned how much floating point formatting needs when I was
         | doing work with Zig recently.
         | 
         | Usually the Zig compiler can generate binaries smaller than
         | MSVC because it doesn't link in a bunch of useless junk from
         | the C Runtime (on Windows, Zig has no dependency on the C
         | runtime). But this time the binary seemed to be much larger
         | than I've seen Zig generate before and it didn't make sense
         | based on how little the tool was actually doing. Dropping it
         | into Binary Ninja revealed that the majority of the code was
         | there to support floating point formatting. So I changed the
         | code to cast the floating point number to an integer before
         | printing it out. That change resulted in a binary that was down
         | at the size I had been expecting.
        
         | jk-jeon wrote:
         | https://github.com/jk-jeon/dragonbox/discussions/57#discussi...
         | 
         | We have been doing some experiment on optimizing for size, and
         | currently it can be reduced to ~3k on 8-bit AVR. It only
         | contains impl/table for single-precision binary32, and double-
         | precision requires quite more, but at the same time much of the
         | bloat is due to how limited AVR is. On platforms like x64 it
         | should be much smaller.
         | 
         | You can certainly say 3k is still huge though.
        
       | londons_explore wrote:
       | I kinda hoped a formatting library designed to be small and able
       | to print strings, and ints ought to be ~50 bytes...
       | 
       | strings are ~4 instructions (test for null terminator, output
       | character, branch back two).
       | 
       | Ints are ~20 instructions. Check if negative and if so output '-'
       | and invert. Put 1000000000 into R1. divide input by R1, saving
       | remainder. add ASCII '0' to result. Output character. Divide R1
       | by 10. put remainder into input. Loop unless R1=0.
       | 
       | Floats aren't used by many programs so shouldn't be compiled
       | unless needed. Same with hex and pointers and leading zeros etc.
       | 
       | I can assure you that when writing code for microcontrollers with
       | 2 kilobytes of code space, we don't include a 14 kilobyte string
       | formatting library...
        
         | vient wrote:
         | It is a featureful formatting library, not simply a library for
         | slow printing of ints and strings without any modifiers. You
         | can't create a library which is full of features, fast, and
         | small simultaneously.
        
           | jstimpfle wrote:
           | You'd hope the unused stuff gets stripped out but I don't
           | know much about this topic so not going to argue.
        
             | vlovich123 wrote:
             | Ffunction-sections and fdata-sections would need at a
             | minimum to be used to strip dead code. But even with LTO
             | it's highly unlikely this could be trimmed unless all
             | format strings are parsed at compile time because the
             | compiler wouldn't know that the code wouldn't be asked to
             | format a floating point number at some point. There could
             | be other subtle things that hide it from the compiler as
             | dead code.
             | 
             | The surest bet would be a compile time feature flag to
             | disable floating point formatting support which it does
             | have.
             | 
             | Still, that's 8kib of string formatting library code
             | without floating point and a bunch of other optimizations
             | which is really heavy in a microcontroller context
        
               | CoastalCoder wrote:
               | I think this is one scenario where C++ type-templated
               | string formatters _could_ shine.
               | 
               | Especially if you extended them to indicate assumptions
               | about the values at compile time. E.g., possible ranges
               | for integers, whether or not a floating point value can
               | have certain special values, etc.
        
               | vlovich123 wrote:
               | You'd be surprised. I'm pretty sure std::format is
               | templated. That doesn't mean that it's still easy to
               | convince the compiler to delete that code.
        
               | josephg wrote:
               | > it's highly unlikely this could be trimmed unless all
               | format strings are parsed at compile time
               | 
               | They probably should be passed at compile time, like how
               | zig does it. It seems so weird to me that in C & C++
               | something as simple as format strings are handled
               | dynamically.
               | 
               | Clang even parses format strings anyway, to look for
               | mismatched arguments. It just - I suppose - doesn't do
               | anything with that.
        
             | vitaut wrote:
             | It is indeed possible to remove unused code with techniques
             | like format string compilation but that's a topic for
             | another post.
        
         | jstimpfle wrote:
         | Curious what space space you work in? What kind of devices,
         | what are they used for?
        
           | londons_explore wrote:
           | Not me but a friend. Things like making electronics for
           | singing birthday cards and toys that make noise.
           | 
           | But there are plenty of other similar things - like making
           | the code that determines the flashing pattern of a bicycle
           | light or flashlight. Or the code that does the countdown
           | timer on a microwave. Or the code that makes the 'ding' sound
           | on a non-smart doorbell. Or the code that makes a hotel safe
           | open when the right combination is entered. Or the code that
           | measures the battery voltage on a USB battery bank and puts
           | 1-4 indicator LED's on so you know how full it is.
           | 
           | You don't tend to hear about it because the design of most of
           | this stuff doesn't happen in the USA anymore - the software
           | devs are now in China for all except high-end stuff.
        
             | furyofantares wrote:
             | Do any of those need a string formatting library?
        
               | toast0 wrote:
               | Hotel safe might, if it logs somewhere (serial port?).
               | 
               | The others may have a serial port setup during
               | development, too. If you have a truly small formatter,
               | you can just disable it for final builds (or leave it on,
               | asssuming output is non blocking, if someone finds the
               | serial pins, great for them), rather than having larger
               | rom for development and smaller for production.
        
               | londons_explore wrote:
               | mostly used for debugging with "printf debugging" -
               | either on the developers desk, or in the field ("we've
               | got a dead one. Can you hook up this pin to a USB-serial
               | converter and tell me what it's saying?")
        
         | IshKebab wrote:
         | It isn't designed to be small; it's designed to be a fully
         | featured string formatting library with size as an important
         | secondary goal.
         | 
         | If you want something that _has_ to be microscopic at the cost
         | of not supporting basic features there are definitely better
         | options.
         | 
         | > I can assure you that when writing code for microcontrollers
         | with 2 kilobytes of code space, we don't include a 14 kilobyte
         | string formatting library...
         | 
         | No shit. If you only have 2kB (unlikely these days) don't use
         | this. Fortunately the vast majority of modern microcontrollers
         | have way more than that. E.g. esp32 _starts_ at 1MB. Perfectly
         | reasonable to use a 14kB formatting library there.
        
           | londons_explore wrote:
           | When you're designing something that sells for a dollar to
           | retailers, eg. a birthday card that sings, your boss won't
           | let you spend more than about 5 cents on the microcontroller,
           | and probably wants you to spend 1-2 cents if you can.
        
             | edflsafoiewq wrote:
             | Perhaps a singing birthday card doesn't need to format
             | strings.
        
               | nikbackm wrote:
               | How else would you get nice looking logs for debugging
               | it?
        
               | a1o wrote:
               | Using log4c
        
             | swagonomixxx wrote:
             | I kind of get where you're coming from but at what point do
             | we admit that such use cases are the fringe and not the
             | main?
        
             | IshKebab wrote:
             | Sure, but such extreme use cases are rare and don't need to
             | be constantly brought up.
        
               | cozzyd wrote:
               | Even on larger microcontrollers you often have to write a
               | bootloader...
        
               | IshKebab wrote:
               | Very occasionally I guess. They're almost always bare
               | metal.
        
               | cozzyd wrote:
               | You still want a bootloader to support firmware updates,
               | typically in the first 8 kB of flash or something like
               | that.
        
               | IshKebab wrote:
               | Good point. I guess don't use `fmt` for that...
        
             | tialaramex wrote:
             | > When you're designing something that sells for a dollar
             | to retailers
             | 
             | Then you shouldn't prioritize compatibility with 1980s Unix
             | code, which is what C++ is for.
        
           | Narishma wrote:
           | > esp32 starts at 1MB
           | 
           | Which models? The most I've ever seen on an ESP32 is 512KB of
           | SRAM.
        
             | ta988 wrote:
             | I think they are talking about the flash. The code is by
             | default run from flash (a mechanism called XIP execute in
             | place). But you can annotate functions (with a macro called
             | IRAM_ATTR) that you want to have in ram if you need
             | performance (you have to also be careful about the data
             | types you use inside as they are not guaranteed to be put
             | in RAM).
        
         | maccard wrote:
         | What do you use instead?
         | 
         | Iostream is... far bigger than this, for example.
        
           | Sharlin wrote:
           | I presume the sort of custom routines that GP described?
        
           | londons_explore wrote:
           | most platforms come with their own libraries for this, which
           | are usually a mix of hand coded assembly and C. You #include
           | the whole library/sdk, but the linker strips out all bits you
           | don't use.
           | 
           | Even then, if you read the disassembled code, you can usually
           | find within a few minutes looking some
           | stupid/unused/inefficient code - so you could totally do a
           | better job if you wrote the assembly by hand, but it would
           | take much more time (especially since most of these
           | architectures tend to have very irregular instruction sets)
        
             | maccard wrote:
             | If you're just going to use the platform built in, then the
             | size of a third party library doesn't matter to you.
        
           | criddell wrote:
           | If you only have 2 kB of code space, you would likely be
           | doing custom routines in assembly that do exactly what you
           | need and nothing more.
        
             | maccard wrote:
             | Right - so no matter how small libfmt gets Op isn't going
             | to use it
        
         | secondcoming wrote:
         | Would someone writing code for a 2Kb microcontroller even be
         | using full-fledged C++, or just C With Classes?
        
           | londons_explore wrote:
           | It's still full fledged C++, you just don't use many of the
           | features, and the compiler leaves out all of the associated
           | code.
           | 
           | Pretty easy to accidentally use some iostream and
           | accidentally pull in loads of code you didn't want though.
        
             | astrobe_ wrote:
             | To me the only reason to use C++ over C in that case is a
             | the slightly stronger type-checking and maybe some extra
             | syntactic sugar.
        
         | formerly_proven wrote:
         | avrlibc's small variant of printf (which still has a ton of
         | features) is like 600 bytes.
        
         | jeroenhd wrote:
         | I don't think the requirements for your specific programming
         | niche should influence the language like that. Your
         | requirements are valid, but they should be served by a bottom
         | of the barrel microcontroller compiler rather than the language
         | spec.
        
           | MobiusHorizons wrote:
           | It's relevant because the author mentions microcontrollers as
           | the reason for focusing on binary size.
        
             | Karliss wrote:
             | There are many orders of magnitude difference between
             | smallest and higher end microcontrollers. You can have an
             | 8bit micro with <1k of ram and <8k of flash memory, and you
             | can have something with >8MB of flash memory or more
             | running a RTOS possibly even capable of running Linux with
             | moderate effort. In the later case 14k of formatting
             | library is probably fine.
        
             | jeroenhd wrote:
             | All of the optimisation work in this article is done for
             | Linux aarch64 ELF binaries.
             | 
             | Besides that, microcontrollers have megabytes of storage
             | these days. To be restricted by a dozen or two kilobytes of
             | code storage, you need to be _very_ storage restricted.
             | 
             | I have run into code size issues myself when trying to run
             | Rust on an ESP32 with 2MiB of storage (thought I bought
             | 16MB, but MB stood for megabits, oops). Through tweaking
             | the default Rust options, I managed to save half a megabyte
             | or more to make the code work again. The article also links
             | to a project where the fmt library is still much bigger
             | (over 300KiB rather than the current 57KiB).
             | 
             | There are microcontrollers where you need to watch out for
             | your dependencies and compiler options, and then there are
             | _tiny_ microcontrollers where every bit matters. For those
             | specific constraints, it doesn't make a lot of sense to
             | assume you can touch every language feature and load every
             | standard library to just work. Much older language features
             | (such as template classes) will also add hundreds of
             | kilobytes of code to your program already, you have to work
             | around that stuff if you're in an extremely constrained
             | environment.
             | 
             | The important thing with language features that includes
             | targets like these is that you can disable the entire
             | feature and enable your own. Sharing design goals between
             | x64 supercomputers and RISC-V chips with literal dozens of
             | bytes of RAM makes for an unreasonably restricted language
             | for anything but the minimal spec. Floats are just
             | expensive on minimum cost chips.
        
         | sixfiveotwo wrote:
         | > I can assure you that when writing code for microcontrollers
         | with 2 kilobytes of code space, we don't include a 14 kilobyte
         | string formatting library...
         | 
         | I'm pretty sure you wouldn't use C++ in that situation anyway,
         | so I don't really see your point.
        
           | usrnm wrote:
           | If you get rid of the runtime, which most compilers allow you
           | to do, C++ is just as suitable for this task as C. Not as
           | good as hand-rolled assembly, but usable
        
             | sixfiveotwo wrote:
             | Okay, vtables in 2kb code space?
        
               | CyberDildonics wrote:
               | Where are the vtables coming from if you use no
               | inheritance and no memory allocation?
        
           | ta988 wrote:
           | You can use c++ yes and a lot of people do. You just keep the
           | exceptions, stdlib and runtime in general at the door.
        
         | aseipp wrote:
         | The design of any library for a microcontroller and an
         | "equivalent" for general end user application is going to be
         | different in pretty much every major design point. I'm not sure
         | how this is any more relevant to fmt than it is just general
         | complaining out in the open.
         | 
         | The code for an algorithm like Dragonbox or Dragon4 alone is
         | already blowing your size budget, so the "optional" stuff
         | doesn't really matter. And that's 1 of like 20 features people
         | want.
        
         | ska wrote:
         | Sure, but there are also microcontrollers with a lot more
         | space. This probably won't ever usefully target the small ones,
         | but that doesn't mean it isn't useful in the space at all.
        
         | fsckboy wrote:
         | > _I can assure you that when writing code for microcontrollers
         | with 2 kilobytes of code space, we don 't include a 14 kilobyte
         | string formatting library..._
         | 
         | then the thing to do is publish the libraries you do use,
         | right, then document what formatting features they support?
         | then other people might discover more and clever ways to pack
         | more features in than you thought of
         | 
         | otherwise, I don't get your point.
        
         | dinkumthinkum wrote:
         | What? I mean, you realize that fmtlib is much more complicated
         | than that, right? What you are describing is something very
         | basic, primitive by comparison. I'm also puzzled why you think
         | floats are not used by many programs, that's kind of mind-
         | boggling. I get that you wouldn't load it on a microcontroller
         | but you wouldn't do that with the standard library either.
        
       | pzmarzly wrote:
       | > However, since it may be used elsewhere, a better solution is
       | to replace the default allocator with one that uses malloc and
       | free instead of new and delete.
       | 
       | C++ noob here, but is libc++'s default allocator (I mean, the
       | default implementation of new and delete) actually doing
       | something different than calling libc's malloc and free under the
       | hood? If so, why?
        
         | murderfs wrote:
         | No, modulo the aligned allocation overloads, but applications
         | are allowed to override the default standard library operator
         | new with their own, even on platforms that don't have an
         | equivalent to ELF symbol interposition.
        
           | masklinn wrote:
           | That doesn't really explain where the dependency on the C++
           | runtime come from tho, as far as I know the dependency chain
           | is std::allocator -> operator new -> malloc, but from the
           | post the replacement only strips out the `operator new`.
           | 
           | Notably I thought the issue would be the throwing of
           | `std::bad_alloc`, but the new version still implements
           | std::allocator, and throws bad_alloc.
           | 
           | And so I assume the issue is that the global `operator new`
           | is concrete (it just takes the size of the allocation), thus
           | you need to link to the C++ runtime just to get that
           | function? In which case you might be able to get the same
           | gains by redefining the global `operator new` and `operator
           | delete`, without touching the allocator.
           | 
           | Alternatively, you might be able to statically link the C++
           | runtime and have DCE take care of the rest.
        
             | kllrnohj wrote:
             | Yes they could have just defined their own global operator
             | new/delete to have a micro-runtime. Same as you'd do if you
             | were doing a kernel in C++. Super easy, barely an
             | inconvenience
        
             | gobblegobble2 wrote:
             | > Notably I thought the issue would be the throwing of
             | `std::bad_alloc`, but the new version still implements
             | std::allocator, and throws bad_alloc.
             | 
             | The new version uses `FMT_THROW` macro instead of a bare
             | throw. The article says "One obvious problem is exceptions
             | and those can be disabled via FMT_THROW, e.g. by defining
             | it to abort". If you check the `g++` invocation, that's
             | exactly what the author does.
        
         | pjmlp wrote:
         | ISO C++ doesn't require new and delete default implementations
         | to call down into malloc()/free().
         | 
         | Many implementations do it, only because it is already there
         | and thus it is easy just to reach for them.
        
         | 1000100_1000101 wrote:
         | Not the strongest on C++ myself, but the new[] will attempt to
         | run constructors on each element after calling the new operator
         | to get the RAM. The delete[] will attempt to run destructors
         | for each element before calling operator delete[] to free the
         | RAM.
         | 
         | In order for delete[] to work, C++ must track the allocation
         | size somewhere. This could be co-located with the allocation
         | (at ptr - sizeof(size_t) for example), or it could be in some
         | other structure. Using another structure lowers the odds of it
         | getting trampled if/when something writes to memory beyond an
         | object, but comes with a lookup cost, and code to handle this
         | new structure.
         | 
         | I'm sure proper C++ libraries are doing even more, but you
         | already get the idea, new and delete are not the same as malloc
         | and free.
        
           | OskarS wrote:
           | > In order for delete[] to work, C++ must track the
           | allocation size somewhere.
           | 
           | That is super-interesting, I had never considered this, but
           | you're absolutely right. I am now incredibly curious how the
           | standard library implementations do this. I've heard normal
           | malloc() sometimes colocates data in similar ways, I wonder
           | if C++ then "doubles up" on that metadata. Or maybe the
           | standard library has it's own entirely custom allocator that
           | doesn't use malloc() at all? I can't imagine that's true,
           | because you'd want to be able to swap system allocators with
           | e.g. LD_PRELOAD (especially for Valgrind and stuff). They
           | could also just be tracking it "to the side" in some hash
           | table or something, but that seems bad for performance.
        
             | tom_ wrote:
             | new[] and delete[] both know the type of the object.
             | Therefore both know whether a destructor needs to be
             | called.
             | 
             | When a destructor doesn't - e.g., new int[] - operator
             | new[] is called upon to allocate N*sizeof(T) bytes. The
             | code stores off no metadata. The result of operator new[]
             | is the array address.
             | 
             | When a destructor does - e.g., new std::string[] - operator
             | new[] is called upon to allocate sizeof(size_t)+N*sizeof(T)
             | bytes. The code stores off the item count in the size_t,
             | adds sizeof(size_t) to the value returned by operator
             | new[], uses that as the address for the array, and calls
             | T() on each item. And delete[] performs the opposite:
             | fishes out the size_t, calls ~T() on each item, subtracts
             | sizeof(size_t) from the array address, and passes that to
             | operator delete[] to free the buffer.
             | 
             | (There are also some additional things to cater for: null
             | checks, alignment, and so on. Just details.)
             | 
             | Note that operator new[] is not given any information about
             | whether a destructor needs to run, or whether there is any
             | metadata being stored off. It just gets called with a byte
             | count. Exercise caution when using placement operator
             | new[], because a preallocated buffer of N*sizeof(T) may not
             | be large enough.
        
             | jeffbee wrote:
             | jemalloc and tcmalloc use size classes, so if you allocate
             | 23 bytes the allocator reserves 32 bytes of space on your
             | behalf. Both of them can find the size class of a pointer
             | with simple manipulation of the pointer itself, not with
             | some global hash table. E.g. in tcmalloc the pointer
             | belongs to a "page" and every pointer on that page has the
             | same size.
        
               | Someone wrote:
               | That doesn't help for C++ if you allocated an array of
               | objects with destructors. It has to know that you
               | allocated 23 objects, so that it can call 23 destructors,
               | not 32 ones, 9 of which on uninitialized memory.
        
               | jeffbee wrote:
               | I believe the question was more around how the program
               | knows how much memory to deallocate. The compiler
               | generates the destructor calls the same way the compiler
               | generates everything else in the program.
        
           | bangaladore wrote:
           | realloc is the same, as the old memory needs to be copied to
           | the new memory.
        
           | progmetaldev wrote:
           | Isn't it also possible for other logic to run in a
           | destructor, such as freeing pointers to external resources?
           | Doesn't this cause (at the very least) the possibility for
           | more advanced logic to be run beyond freeing the object's own
           | memory?
        
         | janos95 wrote:
         | The main point of replacing it with malloc is that new will
         | throw std::bal_alloc so using it requires linking against the
         | c++ runtime.
        
       | neonsunset wrote:
       | It's always fmt. Incredibly funny that _this exact problem_ now
       | happens in .NET. If you touch enough numeric (esp. fp and
       | decimal) formatting /parsing bits, linker ends up rooting a lot
       | of floating point and BigInt related code, bloating binary size.
        
         | pjmlp wrote:
         | Still looking forward for the Delphi like experience with
         | Native AOT, thankfully getting better.
        
       | msephton wrote:
       | Very enjoyable. I love these sort of thinking outside the box
       | optimisations.
        
       | a1o wrote:
       | > Considering that a C program with an empty main function is 6kB
       | on this system, {fmt} now adds less than 10kB to the binary.
       | 
       | Interesting, I've never done this test!
        
         | JonChesterfield wrote:
         | It varies widely with whether the C library is dynamically or
         | statically linked and with how the application (and C library)
         | were built. And on which C library it is. Also a little on
         | whether you're using elf or some other container.
        
       | ptspts wrote:
       | Shameless plug: printf(Hello, World!\n"); is possible with an
       | executable size of 1008 bytes, including libc with output
       | buffering: https://github.com/pts/minilibc686
       | 
       | Please note that a direct comparison would be apples-to-oranges
       | though.
        
         | jart wrote:
         | That's because the compiler turns it into fputs
        
       | rty32 wrote:
       | Maybe I am slow, it took me a while to realize the "14k" in the
       | title refers to "14kB"
        
         | hrydgard wrote:
         | What else would it possibly mean?
         | 
         | k is very common shorthand for kB, at least historically.
        
           | Rygian wrote:
           | 14000 lines of assembler?
        
       ___________________________________________________________________
       (page generated 2024-09-01 23:00 UTC)