[HN Gopher] Reducing code size with LLVM Machine Outliner on 32-...
___________________________________________________________________
Reducing code size with LLVM Machine Outliner on 32-bit Arm targets
Author : matt_d
Score : 45 points
Date : 2021-04-14 11:55 UTC (11 hours ago)
(HTM) web link (www.linaro.org)
(TXT) w3m dump (www.linaro.org)
| viraptor wrote:
| I'm surprised that cache sizes and jump distance were not
| mentioned. Would that mean the size reduction is good enough that
| we're safely ignoring potential fetch of another page of code?
| cesaref wrote:
| Yes, I noticed this omission. It would be helpful to at least
| discuss the performance implications of this. I imagine the
| target audience are compiling for limited memory embedded
| devices where reducing the memory pressure could allow a
| cheaper SoC to be used where performance is not a concern.
|
| From a performance perspective, I was wondering how this would
| interact with inlining, as there is the potential to inline and
| then extract more generalised call sequences and hence get
| better performance without the memory bloat from inlining.
| Definitely sounds like an interesting option.
| yroux wrote:
| Yes that's it, I wanted to put the focus on code size but
| maybe I should have said a word about performance as well.
|
| I replied to the speed question in a previous comment, but
| can put it here as well, I don't have exact numbers to report
| but on average you can expect a regression around 2%, but it
| will depend of call latency of the core and is also impacted
| by cache effects, so in some cases you can have a performance
| improvement
| yroux wrote:
| The focus is pure code size reduction here and it doesn't look
| at the cache locality or performance which might be improved if
| the size reduction avoids some i-cache misses or not but that's
| a price to pay.
| wyldfire wrote:
| With a code size reduction like this, I'd expect an icache hit
| ratio improvement. But to your point - do we have to add more
| trampolines to critical paths? If so it might not be a net
| performance improvement.
|
| The fact that so many other architectures enable it seem to
| hint that it would likely pay off for arm-32 too.
| wuxb wrote:
| In addition to code reduction, this may also help reducing
| occupied branch prediction slots (mostly x86 I guess?). Say two
| conditional branches have 50/50 chance which does not benefit
| from branch prediction. Them merging the code with outlining can
| reduce them to one conditional branch instruction. Since it's
| still "unpredictable" during execution, one prediction slot is
| saved for free.
| chrisseaton wrote:
| Does anyone know if there's ever been a compiler which completely
| ignores the programmer's function boundaries and rediscovers
| sensible compilation units from scratch itself? Rather than
| starting with the user's function boundaries and inlining and
| outlining a bit around that?
|
| I suppose tracing is a bit like this, but I've only seen that
| done dynamically.
| nine_k wrote:
| Read about supercompilation.
| [deleted]
| [deleted]
| pfdietz wrote:
| What does it do to speed?
| rurban wrote:
| If the outlined code is cold, the Icache benefits from this
| part bring skipped (because it's behind a new function call),
| so that more hot code is being cached. Which could benefit a
| few percent speedwise, in my case more than 3%.
|
| If it's hot, it's a bit slower, because you have the added fn
| call overhead. So better do that manually.
| yroux wrote:
| The focus is code size, but in terms of performance I don't
| have numbers to report but on average you can expect a
| regression around 2%, but it will depend of call latency of
| core and is also impacted by cache effects, so in some cases
| you can have a performance improvement
| Someone wrote:
| In the first example, one of the b
| OUTLINED_FUNCTION_0
|
| jumps can be removed (in the code as presented, the last one. A
| sufficiently smart compiler would figure out which of the
| functions is called most often). If the outliner pass doesn't do
| that, is there a LLVM pass that can do that kind of stuff?
| throwaway222145 wrote:
| The Jump Threading pass likely removes that unnecessary branch
| instruction.
| [deleted]
| userbinator wrote:
| I remember playing around with something like this many years ago
| in a demoscene context, using a slightly modified LZ compressor
| to identify areas of repeated instructions. Interestingly enough,
| you can repeat ths process multiple times, because jump offsets
| will also get smaller with each block compressed, until there's
| no more opportunities to do so.
|
| "Machine Outliner" maybe somewhat more descriptive, but I think
| calling it "LLVM-LZ" might be more catchy and memorable.
___________________________________________________________________
(page generated 2021-04-14 23:02 UTC)