[HN Gopher] Meta LLM Compiler: neural optimizer and disassembler
___________________________________________________________________
Meta LLM Compiler: neural optimizer and disassembler
Author : foobazgt
Score : 122 points
Date : 2024-06-28 11:12 UTC (11 hours ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| HanClinto wrote:
| Huh. This is a very... "interesting" application for an LLM. I'm
| not the brightest crayon in the box, but if anyone else would
| like to follow along with my non-expert opinion as I read through
| the paper, here's my take on it.
|
| It's pretty important for compilers / decompilers to be reliable
| and accurate -- compilers behaving in a deterministic and
| predictable way is an important fundamental of pipelines.
|
| LLMs are inherently unpredictable, and so using an LLM for
| compilation / decompilation -- even an LLM that has 99.99%
| accuracy -- feels a bit odd to include as a piece in my build
| pipeline.
|
| That said, let's look at the paper and see what they did.
|
| They essentially started with CodeLlama, and then went further to
| train the model on three tasks -- one primary, and two
| downstream.
|
| The first task is compilation: given input code and a set of
| compiler flags, can we predict the output assembly? Given the
| inability to verify correctness without using a traditional
| compiler, this feels like it's of limited use on its own.
| However, training a model on this as a primary task enables a
| couple of downstream tasks. Namely:
|
| The second task (and first downstream task) is compiler flag
| prediction / optimization to predict / optimize for smaller
| assembly sizes. It's a bit disappointing that they only seem to
| be able to optimize for assembly size (and not execution speed),
| but it's not without its uses. Because the output of this task
| (compiler flags) are then passed to a deterministic function (a
| traditional compiler), then the instability of the LLM is
| mitigated.
|
| The third task (second downstream task) is decompilation. This is
| not the first time that LLMs have been trained to do better
| decompilation -- however, because of the pretraining that they
| did on the primary task, they feel that this provides some
| advantages over previous approaches. Sadly, they only compare LLM
| Compiler to Code Llama and GPT-4 Turbo, and not against any other
| LLMs fine-tuned for the decompilation task, so it's difficult to
| see in context how much better their approach is.
|
| Regarding the verifiability of the disassembly approach, the
| authors note that there are issues regarding correctness. So the
| authors employ round-tripping -- recompiling the decompiled code
| (using the same compiler flags) to verify correctness / exact-
| match. This still puts accuracy in the 45% or so (if I understand
| their output numbers), so it's not entirely trustworthy yet, but
| it might be able to still be useful (especially if used alongside
| a traditional decompiler, and this model's outputs only used when
| they are verifiably correct).
|
| Overall I'm happy to see this model be released as it seems like
| an interesting use-case. I may need to read more, but at first
| blush I'm not immediately excited by the possibilities that this
| unlocks. Most of all, I would like to see it explored if these
| methods could be extended to optimize for performance -- not just
| size of assembly.
| riedel wrote:
| It is normally not a necessary feature of a compiler to be
| determistic. A compiler should be correct against a
| specification. If the specification allows indeterminism a
| compiler should be able to exploit them. I remember the story
| of the sather-k compiler that did things differently based on
| the phase of the moon.
| munificent wrote:
| It's technically correct that a language specification is
| rarely precise enough to require compiler output to be
| deterministic.
|
| But it's pragmatically true that engineers will want to
| murder you if your compiler is non-deterministic. All sorts
| of build systems, benchmark harnesses, supply chain
| validation tools, and other bits of surrounding ecosystem
| will shit the bed if the compiler doesn't produce bitwise
| identical output on the same input and compiler flags.
| foobazgt wrote:
| Can vouch for this having fixed non-determinism bugs in a
| compiler. Nobody is happy if your builds aren't
| reproducible. You'll also suffer crazy performance problems
| as everything downstream rebuilds randomly and all your
| build caches randomly miss.
| SushiHippie wrote:
| NixOS with its nixpkgs [0] and cache [1] would also not
| work if compilers weren't reproducible. Though they won't
| use something like PGO or some specific optimization
| flags as these would very likely lead to unreproducible
| builds. For example most distros ship a PGO optimized
| build of Python while nixos does not.
|
| [0] https://github.com/nixos/nixpkgs
|
| [1] https://cache.nixos.org/
| boomanaiden154 wrote:
| PGO can be used in such situations, but the profile needs
| to be checked in. Same code + same profile -> same binary
| (assuming the compiler is deterministic, which is tested
| quite extensively).
|
| There are several big projects that use PGO (like
| Chrome), and you can get a deterministic build at
| whatever revision using PGO as the profiles are checked
| in to the repository.
| pton_xd wrote:
| > It is normally not a necessary feature of a compiler to be
| determistic. A compiler should be correct against a
| specification.
|
| That sounds like a nightmare. Optimizing code to play nice
| with black-box heuristic compilers like V8's TurboFan is,
| already in fact, a continual maintenance nightmare.
|
| If you don't care about performance, non-deterministic
| compilation is probably "good enough." See TurboFan.
| stavros wrote:
| LLMs are deterministic. We inject randomness after the fact,
| just because we don't like our text being deterministic. Turn
| temperature to 0 and you're good.
| vessenes wrote:
| Thank you for the summary. My memory of SOTA on disassembly
| about a year ago was sub--30% accuracy, so this is a
| significant step forward.
|
| I do think the idea of a 90%+-ish forward and backward
| assembler LLM is pretty intriguing. There's bound to be a lot
| of uses for it; especially if you're of the mind that to get
| there it would have to have learned a lot about computers in
| the foundation model training phase.
|
| Like, you'd definitely want to have those weights somehow baked
| into a typical coding assistant LLM, and of course you'd be
| able to automate round one of a lot of historical archiving
| projects that would like to get compilable modern code but only
| have a binary, you'd be able to turn some PDP-1 code into
| something that would compile on a modern machine, ... you'd
| probably be able to leverage it into building chip simulations
| / code easily, it would be really useful for writing Verilog,
| (maybe), anyway, the use cases seem pretty broad to me.
| boomanaiden154 wrote:
| Sure, performance is more interesting, but it's significantly
| harder.
|
| With code size, you just need to run the code through the
| compiler and you have a deterministic measurement for
| evaluation.
|
| Performance has no such metric. Benchmarks are expensive and
| noisy. Cost models seem like a promising direction, but they
| aren't really there yet.
| mmphosis wrote:
| https://xcancel.com/AIatMeta/status/1806361623831171318
| chad1n wrote:
| As usual, Twitter is impressed by this, but I'm very skeptical,
| the chance of it breaking your program is pretty high. The thing
| that makes optimizations so hard to make is that they have to
| match the behavior without optimizations (unless you have UBs),
| which is something that LLMs probably will struggle with since
| they can't exactly understand the code and execution tree.
| ramon156 wrote:
| Let's be real, at least 40% of those comments are bots
| extheat wrote:
| People simply have no idea what they're talking about. It's
| just jumping on to the latest hype train. My first impression
| here was per the name that it was actually some sort of
| compiler in it of itself--ie programming language in and pure
| machine code or some other IR out. It's got bits and pieces
| of that here and there but that's not what it really is at
| all. It's more of a predictive engine for an optimizer and
| not a very generalized one for that.
|
| What would be more interesting is training a large model on
| pure (code, assembly) pairs like a normal translation task.
| Presumably a very generalized model would be good at even
| doing the inverse: given some assembly, write code that will
| produce the given assembly. Unlike human language there is a
| finite set of possible correct answers here and you have the
| convenience of being able to generate synthetic data for
| cheap. I think optimizations would arise as a natural side
| effect this way: if there's multiple trees of possible
| generations (like choosing between logits in an LLM) you
| could try different branches to see what's smaller in terms
| of byte code or faster in terms of execution.
| quonn wrote:
| > Presumably a very generalized model would be good at even
| doing the inverse: given some assembly, write code that
| will produce the given assembly.
|
| ChatGPT does this, unreliably.
| hughleat wrote:
| It can emulate the compiler (IR + passes -> IR or ASM).
|
| > What would be more interesting is training a large model
| on pure (code, assembly) pairs like a normal translation
| task.
|
| It is that.
|
| > Presumably a very generalized model would be good at even
| doing the inverse: given some assembly, write code that
| will produce the given assembly.
|
| Is has been trained to disassemble. It is much, much better
| than other models at that.
| solarexplorer wrote:
| If I understand correctly, the AI is only choosing the
| optimization passes and their relative order. Each individual
| optimization step would still be designed and verified
| manually, and maybe even proven to be correct mathematically.
| boomanaiden154 wrote:
| Right, it's only solving phase ordering.
|
| In practice though, correctness even over ordering of hand-
| written passes is difficult. Within the paper they describe a
| methodology to evaluate phase orderings against a small test
| set as a smoke test for correctness (PassListEval) and
| observe that ~10% of the phase orderings result in assertion
| failures/compiler crashes/correctness issues.
|
| You will end up with a lot more correctness issues adjusting
| phase orderings like this than you would using one of the
| more battle-tested default optimization pipelines.
|
| Correctness in a production compiler is a pretty hard
| problem.
| hughleat wrote:
| There are two models.
|
| - foundation model is pretrained on asm and ir. Then it is
| trained to emulate the compiler (ir + passes -> ir or asm)
|
| - ftd model is fine tuned for solving phase ordering and
| disassembling
|
| FTD is there to demo capabilities. We hope people will fine
| tune for other optimisations. It will be much, much cheaper
| than starting from scratch.
|
| Yep, correctness in compilers is a pain. Auto-tuning is a
| very easy way to break a compiler.
| Lockal wrote:
| As this LLM operates on LLVM intermediate representation
| language, the result can be fed into
| https://alive2.llvm.org/ce/ and formally verified. For those
| who don't know what to print there: here is an example of C++
| spaceship operator: https://alive2.llvm.org/ce/z/YJPr84 (try to
| replace -1 with -2 there to break). This is kind of a Swiss
| knife for LLVM developers, they often start optimizations with
| this tool.
|
| What they missed is to mention verification (they probably
| don't know about alive2) and comparison with other compilers.
| It is very likely that LLM Compiler "learned" from GCC and with
| huge computational effort simply generates what GCC can do out
| of the box.
| boomanaiden154 wrote:
| I'm reasonably certain the authors are aware of alive2.
|
| The problem with using alive2 to verify LLM based compilation
| is that alive2 isn't really designed for that. It's an
| amazing tool for catching correctness issues in LLVM, but
| it's expensive to run and will time out reasonably often,
| especially on cases involving floating point. It's explicitly
| designed to minimize the rate of false-positive correctness
| issues to serve the primary purpose of alerting compiler
| developers to correctness issues that need to be fixed.
| hughleat wrote:
| Yep, we tried it :-) These were exactly the problems we had
| with it.
| boomanaiden154 wrote:
| I'm not sure it's likely that the LLM here learned from gcc.
| The size optimization work here is focused on learning phase
| orderings for LLVM passes/the LLVM pipeline, which wouldn't
| be at all applicable to gcc.
|
| Additionally, they train approximately half on assembly and
| half on LLVM-IR. They don't talk much about how they generate
| the dataset other than that they generated it from the
| CodeLlama dataset, but I would guess they compile as much
| code as they can into LLVM-IR and then just lower that into
| assembly, leaving gcc out of the loop completely for the vast
| majority of the compiler specific training.
| hughleat wrote:
| Yep! No GCC on this one. And yep, that's not far off how
| the pretraining data was gathered - but with random
| optimisations to give it a bit of variety.
| boomanaiden154 wrote:
| Do you have more information on how the dataset was
| constructed?
|
| It seems like somehow build systems were invoked given
| the different targets present in the final version?
|
| Was it mostly C/C++ (if so, how did you resolve missing
| includes/build flags), or something else?
| moffkalast wrote:
| > C++ spaceship operator
|
| > (A <=> B) < 0 is true if A < B
|
| > (A <=> B) > 0 is true if A > B
|
| > (A <=> B) == 0 is true if A and B are equal/equivalent.
|
| TIL of the spaceship operator. Was this added as an april
| fools?
| samatman wrote:
| This is one of the oldest computer operators in the game:
| the arithmetic IF statement from FORTRAN.
|
| It's useful for stable-sorting collections with a single
| test. Also, overloading <=> for a type, gives all
| comparison operators "for free": ==, !=, <, <=, >=, >
| bbor wrote:
| AFAIK this is a heuristic, not a category. The underlying
| grammar would be preserved.
|
| Personally I thought we were way too close to perfect to make
| meaningful progress on compilation, but that's probably just
| naivete
| boomanaiden154 wrote:
| I would not say we are anywhere close to perfect in
| compilation.
|
| Even just looking at inlining for size, there are multiple
| recent studies showing ~10+% improvement
| (https://dl.acm.org/doi/abs/10.1145/3503222.3507744,
| https://arxiv.org/abs/2101.04808).
|
| There is a massive amount of headroom, and even tiny bits
| still matter as ~0.5% gains on code size, or especially
| performance, can be huge.
| cec wrote:
| Hey! The idea isn't to replace the compiler with an LLM, the
| tech is not there yet. Where we see value is in using these
| models to guide an existing compiler. E.g. orchestrating
| optimization passes. That way the LLM won't break your code,
| nor will the compiler (to the extent that your compiler is free
| from bugs, which can tricky to detect - cf Sec 3.1 of our
| paper).
| verditelabs wrote:
| I've done some similar LLM compiler work, obviously not on
| Meta's scale, teaching an LLM to do optimization by feeding
| an encoder/decoder pairs of -O0 and -O3 code and even on my
| small scale I managed to get the LLM to spit out the correct
| optimization every once and a while.
|
| I think there's a lot of value in LLM compilers to
| specifically be used for superoptimization where you can
| generate many possible optimizations, verify the correctness,
| and pick the most optimal one. I'm excited to see where y'all
| go with this.
| viraptor wrote:
| Thank you for freeing me from one of my to-do projects. I
| wanted to do a similar autoencoder with optimisations. Did
| you write about it anywhere? I'd love to read the details.
| verditelabs wrote:
| No writeup, but the code is here:
|
| https://github.com/SuperOptimizer/supercompiler
|
| There's code there to generate unoptimized / optimized
| pairs via C generators like yarpgen and csmith, then
| compile, train, inference, and disassemble the results
| moffkalast wrote:
| > The idea isn't to replace the compiler with an LLM, the
| tech is not there yet
|
| What do you mean the tech isn't there _yet_ , why would it
| ever even go into that direction? I mean we do those kinds of
| things for shits and giggles but for any practical use? I
| mean come on. From fast and reliable to glacial and not even
| working a quarter of the time.
|
| I guess maybe if all compiler designers die in a freak
| accident and there's literally nobody to replace them, then
| we'll have to resort to that after the existing versions
| break.
| ldjkfkdsjnv wrote:
| I love this company. Advancing ai and keeping the rest of us in
| the loop.
| LoganDark wrote:
| I hate the company (Facebook), but I still think them having
| been publicly releasing a bunch of the research they've been
| doing (and models they've been making) has been a net good for
| almost everybody, at least in terms of exploring the field of
| LLMs.
| ein0p wrote:
| My love for Meta is strictly confined to FAIR and the PyTorch
| team. The rest of the company is basically cancer.
| muglug wrote:
| Unlike many other AI-themed papers at Meta this one omits any
| mention of the model output getting used at Instagram, Facebook
| or Meta. Research is great! But doesn't seem all that actionable
| today.
| boomanaiden154 wrote:
| This would be difficult to deploy as-is in production.
|
| There are correctness issues mentioned in the paper regarding
| adjusting phase orderings away from the well-trodden
| O0/O1/O2/O3/Os/Oz path. Their methodology works for a research
| project quite well, but I personally wouldn't trust it in
| production. While some obvious issues can be caught by a small
| test suite and unit tests, there are others that won't be, and
| that's really risky in production scenarios.
|
| There are also some practical software engineering things like
| deployment in the compiler. There is actually tooling in
| upstream LLVM to do this
| (https://www.youtube.com/watch?v=mQu1CLZ3uWs), but running
| models on a GPU would be difficult and I would expect CPU
| inference to massively blow up compile times.
| soist wrote:
| How do they verify the output preserves semantics of the input?
| zitterbewegung wrote:
| Some previous work in the space is at
| https://github.com/albertan017/LLM4Decompile
| LoganDark wrote:
| Reading the title, I thought this was a tool for optimizing and
| disassembling LLMs, not an LLM designed to optimize and
| disassemble. Seeing it's just a model is a little disappointing
| in comparison.
| jameshart wrote:
| Pretty sure I remember trading 300 creds for a Meta Technologies
| Neural Optimizer and Disassembler in one of the early _Deus Ex_
| games.
___________________________________________________________________
(page generated 2024-06-28 23:00 UTC)