[HN Gopher] Exploring Clang/LLVM optimization on programming horror
___________________________________________________________________
Exploring Clang/LLVM optimization on programming horror
Author : maattdd
Score : 167 points
Date : 2021-08-17 07:37 UTC (15 hours ago)
(HTM) web link (blog.matthieud.me)
(TXT) w3m dump (blog.matthieud.me)
| woodruffw wrote:
| Great writeup! This is a nice overview of how LLVM's optimization
| passes compose to produce nicely optimized assembly.
|
| One nitpicky observation: mem2reg doesn't "move values from
| memory [...] to registers (directly inside the CPU)." It's an SSA
| expansion pass, taking the minimal SSA form produced by the C/C++
| frontend to turning it into a more aggressive SSA form by
| eliminating as many `alloca`s as possible. SSA is characterized
| by the presence of infinitely many "fresh" _abstract_ registers,
| none of which correspond to actual CPU registers -- LLVM is
| responsible at a later stage for performing efficient lowering
| and allocation of registers /stack slots for the SSA registers.
| slavik81 wrote:
| Could you give an example of an alloca mem2reg would eliminate?
| dragontamer wrote:
| I don't think that would be helpful in learning what's going
| on.
|
| Instead, I'll point you out to the keyword "Dominator
| Frontier", and "SSA Construction Algorithm".
|
| https://en.wikipedia.org/wiki/Static_single_assignment_form#.
| ..
|
| You should read that only after you understand SSA and Phi
| functions in particular. Once you know how Phi works, the
| next question is the algorithm for how they're constructed.
|
| https://en.wikipedia.org/wiki/Dominator_(graph_theory)
|
| This page may also help.
| dragontamer wrote:
| https://llvm.org/docs/Passes.html#passes-mem2reg
|
| > -mem2reg: Promote Memory to Register
|
| > This file promotes memory references to be register
| references. It promotes alloca instructions which only have
| loads and stores as uses. An alloca is transformed by using
| dominator frontiers to place phi nodes, then traversing the
| function in depth-first order to rewrite loads and stores as
| appropriate. This is just the standard SSA construction
| algorithm to construct "pruned" SSA form.
|
| -----------
|
| https://llvm.org/docs/LangRef.html#alloca-instruction
|
| > The 'alloca' instruction allocates memory on the stack frame
| of the currently executing function, to be automatically
| released when this function returns to its caller. If the
| address space is not explicitly specified, the object is
| allocated in the alloca address space from the datalayout
| string.
|
| --------
|
| Based on my reading and understanding of alloca and mem2reg
| above, I have to disagree with your assessment. It seems like
| alloca roughly corresponds to a "push" in your typical x86
| assembly language, or maybe a "add esp, #ofBytes".
|
| By removing alloca instructions, the mem2reg step is turning
| stack-memory into registers.
| woodruffw wrote:
| > By removing alloca instructions, the mem2reg step is
| turning stack-memory into registers.
|
| This is true, but it's also misleading: the transformation of
| stack slots into machine registers is a beneficial _side
| effect_ of mem2reg. Reducing stack use is an important
| optimization, but producing an optimal SSA form is even
| _more_ important, since key passes like folding, DCE and SRoA
| rely _heavily_ on the SSA form. The "reg" in "mem2reg"
| refers explicitly to the latter (SSA registers), as the text
| you've excerpted directly says.
|
| You can also prove this to yourself by contriving an IR
| function that contains more SSA registers than machine
| registers: you'll see that, in addition to any ABI
| constraints, LLVM will be forced to spill any excess SSA
| registers back onto the stack.
|
| Edit: but also yes, to confirm: `alloca` corresponds more or
| less directly to `add ESP, <adjust>` within the x86 stack
| model. But it doesn't have to! LLVM's semantics are target
| independent even when the IR isn't.
| dragontamer wrote:
| Gotcha. I think I see what you're talking about now.
|
| I have crude familiarity with SSA (no personal experience
| but I've read a book once) but not with LLVM's terminology.
| I'm looking at the example in the blogpost, and I can see
| that its clearly adding in the important Phi functions
| needed
|
| Strange that llvm names their aggressive SSA step in this
| manner. But yes, adding the phi functions in this manner is
| a very important step before we can think about
| optimizations.
|
| The "inductive variable" is far easier to detect once its
| in a "Phi-form" so to speak.
| woodruffw wrote:
| It's an extremely confusing naming choice! I don't know
| why they did it, but to take a guess: a lot of compiler
| literature refers to SSA values as "registers," since the
| SSA model of a computer program directly corresponds to
| the concept of an (infinite) register machine in abstract
| computation.
|
| The end result being that we have the same word
| "register" referring to two _opposed_ resources (one
| infinite, one extremely finite) :-).
| gsnedders wrote:
| At least historically LLVM has itself been inconsistent
| in the naming of SSA "values", and mem2reg is arguably
| just the most prominent example of the "register" name
| being used.
| eatonphil wrote:
| The core simplification is due to LLVM's induction-detection
| step. Even after doing induction-based proofs in school, until
| reading this I had the sense that induction is kinda magical and
| kinda not real.
|
| I still cannot fathom how you can encode induction detection into
| an algorithm, any pointers welcome (keeping it simple, please,
| and ideally with working code).
|
| The only case that makes sense to me is if you did numeric
| computation against some fixed number of cases and if that worked
| out then you assume it's right.
|
| I guess this is what proof assistants do (among other things).
| Maybe I should look into how to write a basic proof assistant.
| jcranmer wrote:
| Induction detection is relatively simple in LLVM IR. The first
| thing to remember is that the compiler doesn't work on the same
| source code that the human does; it works on a different
| representation. The canonical loop structure would look
| something like this: loop.preheader: ;
| This is the block that executes before the loop does.
| br label %loop.body; loop.body: %i = phi i32
| [0, %loop.preheader], [%i.next, %loop.body] %val = phi
| i1 [0, %loop.preheader], [%val.next, %loop.body]
| %val.next = xor i1 %val, -1 %i.next = add i32 %i, 1
| %cmp = icmp slt i32 %i, %N br i1 %cmp, %loop.exit,
| %loop.body loop.exit: ; After the loop is
| done
|
| Because of how phis work, in constructing this form, we
| immediately realize how to describe a value either in terms of
| its value in the first iteration (this is the value it takes
| coming from the preheader), or in terms _solely_ of its value
| in the previous iteration. In other words, %val.next = f(%val)
| for some function f, which we can easily identify in the source
| code.
|
| Now, if we can compute the repeated composition of f easily, we
| can replace the recurrence in the loop with repeated
| composition. Repeated addition of the same value is
| multiplication, and repeated multiplication of the same value
| is exponentiation--that's two very easy examples to do so. We
| recognize that there is no use of the value except of its final
| occurrence, and so instead of doing the loop, we can generate
| the composition itself. Thus, after N iterations of the loop,
| we can conclude that %val.next = f^N(0), and if we can identify
| the loop iteration count (that's %N in this case), we can
| simply write out f^N instead of having to compute every
| individual value.
| vnorilo wrote:
| Well, basically it's about converting iteratively computated
| state to a function of iteration count.
|
| Consider indvar[n] = indvar[n-1] + k, indvar[0] = a. It follows
| that indvar[n] = a + n * k.
|
| This is the type of indvar canonicalization that can zap loops
| such as the one in the article. Suddenly the final state is
| just a function of the loop trip count.
|
| The return value is replaced with canonicalized indvar. That
| makes the loop itself dead (nobody observes any of its effects)
| and removed by subsequent dead code elimination pass.
|
| My example transforms addition to multiplication. Another
| common one is multiplication into power. That's close to what's
| going on in the article example: xor is just a multiplication
| of signed one-bit integers.
| benmmurphy wrote:
| the induction step is pretty cool. it will remove the loop
| calculating the sum of an arithmetic progression and replace it
| with just multiplies/shifts/subtracts and adds.
| contravariant wrote:
| Well in this case it seems reasonable that they simply added a
| rule for the specific line: even = !even
|
| and the line numberCompare++
|
| resulting in even = (indvar & 1)
| numberCompare = indvar
|
| You can then solve from the exit condition that number ==
| numberCompare, which gives you the indvar which can then be
| substituted into 'even'.
|
| I'm not saying it isn't magical, and certainly requires some
| symbolic solving, but it's doable.
|
| Of course the real goal is to eventually get it to optimize
| while (number != 1) { if (number & 1)
| number = number * 3 + 1 number >>= 1; }
| [deleted]
| onedognight wrote:
| Maybe LLVM should add a Collatz[0] optimization pass that
| would replace your loop with number = 1
|
| for at least up to 64bit types?
|
| [0] https://en.m.wikipedia.org/wiki/Collatz_conjecture
| missblit wrote:
| > Of course the real goal is to eventually get it to optimize
| `while (number != 1) { ... }`
|
| Valid C++ programs are guaranteed to be able to make forward
| progress [1].
|
| So if `number` is a normal stack allocated integer (and not
| volatile, etc), then the infinite looping case is undefined
| behavior here.
|
| So it would be a valid optimization to transform this to
| `number = 1;`.
|
| (And indeed: https://godbolt.org/z/eodhfWe6h )
|
| [1] https://en.cppreference.com/w/cpp/language/memory_model
| secondcoming wrote:
| Amazing, but crazy.
| [deleted]
| geofft wrote:
| This appears to be the source code:
| https://github.com/llvm/llvm-project/blob/main/llvm/lib/Tran...
| It's long, but it looks well-commented! Some Googling also
| finds
| https://llvm.org/devmtg/2009-10/ScalarEvolutionAndLoopOptimi...
| which looks relevant.
|
| I think "induction" here isn't exactly the sense of proofs by
| mathematical induction as you learn in school (as in proving
| something for all positive integers given a base case and a
| recursive case) - I think all they mean by it is "a loop, over
| a single integer." Quoting
| https://en.wikipedia.org/wiki/Induction_variable :
|
| > _In computer science, an induction variable is a variable
| that gets increased or decreased by a fixed amount on every
| iteration of a loop or is a linear function of another
| induction variable._
|
| The description at
| https://www.cs.cornell.edu/courses/cs6120/2019fa/blog/ive/
| makes it clearer:
|
| > _An induction variable is any variable whose value can be
| represented as a function of: loop invariants; the number of
| loop iterations that have executed; and other induction
| variables._
|
| That is to say, the link to proofs-by-induction is that if you
| have a loop like "for (i = 0; i < 10; i++)", you can prove the
| statement "after N loops, i = N". Why? Because after 0 loops, i
| = 0 (the first part), and on each loop, i is one greater (the
| i++ part), and nothing else that happens in the loop changes i
| (which is what that post means by "loop invariants").
|
| In other words, it's not that the compiler is _detecting_ that
| the programmer is doing induction (the programmer is not, in
| fact the programmer is simply writing a for loop and not
| thinking about it), it 's that the compiler is _performing_
| induction on the loop.
|
| So a simple case is if you have a loop like
| int i; for (i = 0; i < 10; i++)
| printf("Hi!\n"); printf("i = %d\n", i");
|
| You can prove that at the end of the loop, i = 10, and you
| don't need to do actually increment i during each iteration to
| get there. And if you didn't have that printf (i.e., if the
| loop body was empty), you could optimize the loop out entirely
| and just print out 10.
|
| So it's not terribly magical that LLVM is able to turn
| while (number != numberCompare) { even =
| !even; numberCompare++; }
|
| into, in pseudocode, numberCompare = number;
| even = !!!...!!!even (numberCompare times);
|
| i.e., that it's able to get numberCompare out of the loop, to
| conclude that even in turn is only a function of numberCompare,
| and then that there's no loop left.
|
| What's more magical to me, I think, is that it's able to make
| sense of "run the ! operation numberCompare times" and turn
| that into "if (numberCompare % 2 == 1) even = !even". I don't
| think that's too surprising either - a compiler should know
| that, for booleans, !!foo can be simplified to foo. But
| repeatedly applying that rule across an _abstract_ number of
| calls to !!, plus possibly one more, seems pretty clever.
|
| If I'm skimming the code right, that part of the simplification
| is a different optimization pass entirely. I don't know off
| hand where that's implemented. I'd be curious if someone can
| find it.
| CalChris wrote:
| These optimizations are on LLVM IR into LLVM IR. So basically
| every backend benefits. I don't think most backend engineers
| would even understand them. I don't.
| dragontamer wrote:
| None of these steps are too hard to think about in SSA form (a
| specific transformation of code that LLVM does near the
| beginning of its analysis).
|
| So the key is to first understand SSA form. After that, a lot
| of optimization passes become obvious to think about.
|
| ---------
|
| SSA has two bits to understand:
|
| 1. Single static assignment -- variables can be assigned, but
| once assigned they can never change. (This part is easy).
|
| 2. Graph-based, and therefore requires the 'Phi' function to
| understand -- variables may take on two different values. EX:
| "r0" may come from Node#1, while "r0" comes from Node#2. r0 may
| have been renamed %1 in #1, and %2 in #2. "Phi" is a "fake
| function" that resolves the ambiguity and kinda makes SSA code
| "real" again.
|
| Hmm, maybe that's a bad explanation. But whatever, my point is
| that understanding "Phi" is the concept you want to focus on in
| SSA. Phi is the harder part, but its still not that hard to
| understand.
|
| Once you get it, the whole form "clicks" in your brain and you
| suddenly understand how optimizations can become possible.
| jcranmer wrote:
| Another explanation of SSA:
|
| SSA is static single assignment. Every variable has a single
| point of declaration. If you want to change the value of a
| variable (say, i = i + 1), you instead have to create a new
| variable. The advantage of this form is that tracking the
| definition of a variable is trivial; in LLVM IR, to use a
| variable, you literally pass in a pointer to the instruction
| that defined it. The inverse, finding all uses of a value, is
| done by building a use-list: whenever you create an
| instruction, it adds itself to the use-list of all the values
| it's using, which allows easy iteration of the users of a
| value.
|
| In short, SSA gives you a trivially correct-by-construction
| to understand the dataflow of a program (at least, that data
| which lives in registers; memory is more complicated).
| There's no separate analysis that has to be maintained
| throughout optimizations, and hence a possible source of
| programs (as there is for failure to maintain the dominator
| tree through CFG transformations, a historically bug-prone
| step in LLVM).
|
| Now, as you may notice, there are cases where it would seem
| hard to pin down a _single_ definition of a variable:
| conditional assignment. SSA solves this by creating Phi
| values. Phi values can be better thought of as basic block
| arguments: when you enter a basic block, you must provide a
| value for each of the phi values in that basic block. The
| actual value of the phi is dependent on which edge you used
| to enter a basic block, or in other words, the control
| dependencies of the code.
|
| In short, phis are the intersection of control flow in the
| code with the dataflow--and they are the only things in that
| intersection.
| mhh__ wrote:
| LLVMs IR has linear basic blocks inside a function graph,
| there are IRs which are graph all the way down as well.
| woodruffw wrote:
| > I don't think most backend engineers would even understand
| them. I don't.
|
| Most engineers are intimidated by compilers, but they really
| shouldn't be! LLVM's source code is _very_ approachable and
| well-isolated by concern, and most of the fundamental concepts
| in an optimizing compiler are understandable with a beginner 's
| understanding of data structures, assembly, and resource
| allocation.
|
| One slightly funny thing, though: LLVM's optimizations are on
| IR, so every frontend can _theoretically_ benefit from them.
| However, many assume patterns originating from the C /C++
| frontend or heavy canonicalization, so you need to be a
| _little_ careful as a language frontend developer to take full
| advantage of all optimizations!
| jagger27 wrote:
| > so you need to be a little careful as a language frontend
| developer to take full advantage of all optimizations!
|
| So does this mean that it's possible to express exotic
| patterns in LLVM-IR that aren't typically produced when
| compiling C or C++?
|
| Are Rust's optimizer passes significantly different than
| clang's? I would guess borrow checking has a special pass
| that wouldn't be applicable to C.
|
| This topic is really fascinating to me!
| CalChris wrote:
| I'm actually working on an LLVM backend. I'd read the Dragon
| Book before, long before, but I think my real education was
| the LLVM Developer Meeting tutorials.
|
| LLVM is pass oriented but broadly there are front ends
| (Clang, Rust, Fortran), a middle end (LLVM IR SSA
| optimizations) and multiple in-tree and out-of-tree backends.
| I was a little resentful of having to write all of 5 lines or
| so inside of Clang. Otherwise, I happily live in the backend.
| Doing a merge with a new release is an hour or so of conflict
| resolution.
|
| The pass structure importantly means that you can test your
| backend code, even at the GlobalISel subpass level, without
| needing to understand Clang, LLVM, .... Pretty sweet if you
| ask me.
| mhh__ wrote:
| LLVM is approachable but some things are no better than GCC,
| I find. In particular I have noticed there seems to be very
| little high level documentation for the instruction
| scheduler.
|
| I find GCC machine def files actually slightly easier to read
| than LLVM's
| srcreigh wrote:
| Every computer science student in Waterloo learns about this in
| 2nd year (i.e. age ~19, in some cases ~1 year after learning to
| code). The course is called CS241. You built a compiler from
| scratch including tokenizer, assembler, parsing, type checking,
| and optimizations. The class ends with an optional code size
| optimization contest.
|
| Here's some publicly available notes, search "Optimization".
| The passes discussed in the article are register allocation and
| constant propagation.
|
| http://anthony-zhang.me/University-Notes/CS241/CS241.html
| CalChris wrote:
| Your notes, while honestly very good, don't cover data flow
| analysis or _Single Static Assignment_ (SSA) which is what
| LLVM IR [1] uses and which are the tools that the article 's
| optimizations are based on. Hell, LLVM backend engineers
| rarely even look at IR. We mostly look at Generic Machine IR
| [2].
|
| [1] https://llvm.org/docs/LangRef.html
|
| [2] https://llvm.org/docs/GlobalISel/GMIR.html
| dleslie wrote:
| Interesting, in my day SFU kept this til 3rd year, as CMPT379
| IIRC, and only as an option for Theoretical or Systems paths.
|
| Looks like it's still 379; not sure about the optional
| component.
|
| http://anoopsarkar.github.io/compilers-class/
| macintux wrote:
| Decades ago I participated in some group programming contest
| for universities (in person, not online). We did terribly,
| but mainly I remember being in awe of how quickly Waterloo
| was solving the challenges.
| martincmartin wrote:
| _you get a linear time O(n) isEven function_
|
| In complexity theory, the size of a problem is the number of bits
| to represent the input, so if the input integer is _i_ , then the
| size is _n = log_2(i)_ so the algorithm is actually exponential
| in the number of bits it takes to represent the input.
| trias wrote:
| is complexity theory constrained to binary representations? Why
| should it?
| dragontamer wrote:
| In a strict sense, you're right.
|
| But in practice, you're not. Case in point: matrix
| multiplication is commonly quoted as O(n^3) (naive), when in
| fact, the amount of data used is O(n^2), and therefore should
| be quoted as O(size^2) (quadratic) with respect to data-size.
| (EDIT: was bad at math for a second).
|
| But no, people mean "n-by-n matrix" has "O(n^3) naive"
| implementation. In practice, the "n" is just whatever is most
| convenient for a problem.
|
| In many cases, n is proportional to the input size. But in
| other cases (such as the famous matrix multiplication
| examples), n is the sqrt(input-size).
| thrasumachos wrote:
| Nice that it even provides the right answer for negative numbers
| as undefined behavior but only when optimizations are enabled!
___________________________________________________________________
(page generated 2021-08-17 23:00 UTC)