[HN Gopher] Reducing code size with LLVM Machine Outliner on 32-...
       ___________________________________________________________________
        
       Reducing code size with LLVM Machine Outliner on 32-bit Arm targets
        
       Author : matt_d
       Score  : 45 points
       Date   : 2021-04-14 11:55 UTC (11 hours ago)
        
 (HTM) web link (www.linaro.org)
 (TXT) w3m dump (www.linaro.org)
        
       | viraptor wrote:
       | I'm surprised that cache sizes and jump distance were not
       | mentioned. Would that mean the size reduction is good enough that
       | we're safely ignoring potential fetch of another page of code?
        
         | cesaref wrote:
         | Yes, I noticed this omission. It would be helpful to at least
         | discuss the performance implications of this. I imagine the
         | target audience are compiling for limited memory embedded
         | devices where reducing the memory pressure could allow a
         | cheaper SoC to be used where performance is not a concern.
         | 
         | From a performance perspective, I was wondering how this would
         | interact with inlining, as there is the potential to inline and
         | then extract more generalised call sequences and hence get
         | better performance without the memory bloat from inlining.
         | Definitely sounds like an interesting option.
        
           | yroux wrote:
           | Yes that's it, I wanted to put the focus on code size but
           | maybe I should have said a word about performance as well.
           | 
           | I replied to the speed question in a previous comment, but
           | can put it here as well, I don't have exact numbers to report
           | but on average you can expect a regression around 2%, but it
           | will depend of call latency of the core and is also impacted
           | by cache effects, so in some cases you can have a performance
           | improvement
        
         | yroux wrote:
         | The focus is pure code size reduction here and it doesn't look
         | at the cache locality or performance which might be improved if
         | the size reduction avoids some i-cache misses or not but that's
         | a price to pay.
        
         | wyldfire wrote:
         | With a code size reduction like this, I'd expect an icache hit
         | ratio improvement. But to your point - do we have to add more
         | trampolines to critical paths? If so it might not be a net
         | performance improvement.
         | 
         | The fact that so many other architectures enable it seem to
         | hint that it would likely pay off for arm-32 too.
        
       | wuxb wrote:
       | In addition to code reduction, this may also help reducing
       | occupied branch prediction slots (mostly x86 I guess?). Say two
       | conditional branches have 50/50 chance which does not benefit
       | from branch prediction. Them merging the code with outlining can
       | reduce them to one conditional branch instruction. Since it's
       | still "unpredictable" during execution, one prediction slot is
       | saved for free.
        
       | chrisseaton wrote:
       | Does anyone know if there's ever been a compiler which completely
       | ignores the programmer's function boundaries and rediscovers
       | sensible compilation units from scratch itself? Rather than
       | starting with the user's function boundaries and inlining and
       | outlining a bit around that?
       | 
       | I suppose tracing is a bit like this, but I've only seen that
       | done dynamically.
        
         | nine_k wrote:
         | Read about supercompilation.
        
           | [deleted]
        
         | [deleted]
        
       | pfdietz wrote:
       | What does it do to speed?
        
         | rurban wrote:
         | If the outlined code is cold, the Icache benefits from this
         | part bring skipped (because it's behind a new function call),
         | so that more hot code is being cached. Which could benefit a
         | few percent speedwise, in my case more than 3%.
         | 
         | If it's hot, it's a bit slower, because you have the added fn
         | call overhead. So better do that manually.
        
         | yroux wrote:
         | The focus is code size, but in terms of performance I don't
         | have numbers to report but on average you can expect a
         | regression around 2%, but it will depend of call latency of
         | core and is also impacted by cache effects, so in some cases
         | you can have a performance improvement
        
       | Someone wrote:
       | In the first example, one of the                 b
       | OUTLINED_FUNCTION_0
       | 
       | jumps can be removed (in the code as presented, the last one. A
       | sufficiently smart compiler would figure out which of the
       | functions is called most often). If the outliner pass doesn't do
       | that, is there a LLVM pass that can do that kind of stuff?
        
         | throwaway222145 wrote:
         | The Jump Threading pass likely removes that unnecessary branch
         | instruction.
        
       | [deleted]
        
       | userbinator wrote:
       | I remember playing around with something like this many years ago
       | in a demoscene context, using a slightly modified LZ compressor
       | to identify areas of repeated instructions. Interestingly enough,
       | you can repeat ths process multiple times, because jump offsets
       | will also get smaller with each block compressed, until there's
       | no more opportunities to do so.
       | 
       | "Machine Outliner" maybe somewhat more descriptive, but I think
       | calling it "LLVM-LZ" might be more catchy and memorable.
        
       ___________________________________________________________________
       (page generated 2021-04-14 23:02 UTC)