[HN Gopher] Meta LLM Compiler: neural optimizer and disassembler
       ___________________________________________________________________
        
       Meta LLM Compiler: neural optimizer and disassembler
        
       Author : foobazgt
       Score  : 122 points
       Date   : 2024-06-28 11:12 UTC (11 hours ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | HanClinto wrote:
       | Huh. This is a very... "interesting" application for an LLM. I'm
       | not the brightest crayon in the box, but if anyone else would
       | like to follow along with my non-expert opinion as I read through
       | the paper, here's my take on it.
       | 
       | It's pretty important for compilers / decompilers to be reliable
       | and accurate -- compilers behaving in a deterministic and
       | predictable way is an important fundamental of pipelines.
       | 
       | LLMs are inherently unpredictable, and so using an LLM for
       | compilation / decompilation -- even an LLM that has 99.99%
       | accuracy -- feels a bit odd to include as a piece in my build
       | pipeline.
       | 
       | That said, let's look at the paper and see what they did.
       | 
       | They essentially started with CodeLlama, and then went further to
       | train the model on three tasks -- one primary, and two
       | downstream.
       | 
       | The first task is compilation: given input code and a set of
       | compiler flags, can we predict the output assembly? Given the
       | inability to verify correctness without using a traditional
       | compiler, this feels like it's of limited use on its own.
       | However, training a model on this as a primary task enables a
       | couple of downstream tasks. Namely:
       | 
       | The second task (and first downstream task) is compiler flag
       | prediction / optimization to predict / optimize for smaller
       | assembly sizes. It's a bit disappointing that they only seem to
       | be able to optimize for assembly size (and not execution speed),
       | but it's not without its uses. Because the output of this task
       | (compiler flags) are then passed to a deterministic function (a
       | traditional compiler), then the instability of the LLM is
       | mitigated.
       | 
       | The third task (second downstream task) is decompilation. This is
       | not the first time that LLMs have been trained to do better
       | decompilation -- however, because of the pretraining that they
       | did on the primary task, they feel that this provides some
       | advantages over previous approaches. Sadly, they only compare LLM
       | Compiler to Code Llama and GPT-4 Turbo, and not against any other
       | LLMs fine-tuned for the decompilation task, so it's difficult to
       | see in context how much better their approach is.
       | 
       | Regarding the verifiability of the disassembly approach, the
       | authors note that there are issues regarding correctness. So the
       | authors employ round-tripping -- recompiling the decompiled code
       | (using the same compiler flags) to verify correctness / exact-
       | match. This still puts accuracy in the 45% or so (if I understand
       | their output numbers), so it's not entirely trustworthy yet, but
       | it might be able to still be useful (especially if used alongside
       | a traditional decompiler, and this model's outputs only used when
       | they are verifiably correct).
       | 
       | Overall I'm happy to see this model be released as it seems like
       | an interesting use-case. I may need to read more, but at first
       | blush I'm not immediately excited by the possibilities that this
       | unlocks. Most of all, I would like to see it explored if these
       | methods could be extended to optimize for performance -- not just
       | size of assembly.
        
         | riedel wrote:
         | It is normally not a necessary feature of a compiler to be
         | determistic. A compiler should be correct against a
         | specification. If the specification allows indeterminism a
         | compiler should be able to exploit them. I remember the story
         | of the sather-k compiler that did things differently based on
         | the phase of the moon.
        
           | munificent wrote:
           | It's technically correct that a language specification is
           | rarely precise enough to require compiler output to be
           | deterministic.
           | 
           | But it's pragmatically true that engineers will want to
           | murder you if your compiler is non-deterministic. All sorts
           | of build systems, benchmark harnesses, supply chain
           | validation tools, and other bits of surrounding ecosystem
           | will shit the bed if the compiler doesn't produce bitwise
           | identical output on the same input and compiler flags.
        
             | foobazgt wrote:
             | Can vouch for this having fixed non-determinism bugs in a
             | compiler. Nobody is happy if your builds aren't
             | reproducible. You'll also suffer crazy performance problems
             | as everything downstream rebuilds randomly and all your
             | build caches randomly miss.
        
               | SushiHippie wrote:
               | NixOS with its nixpkgs [0] and cache [1] would also not
               | work if compilers weren't reproducible. Though they won't
               | use something like PGO or some specific optimization
               | flags as these would very likely lead to unreproducible
               | builds. For example most distros ship a PGO optimized
               | build of Python while nixos does not.
               | 
               | [0] https://github.com/nixos/nixpkgs
               | 
               | [1] https://cache.nixos.org/
        
               | boomanaiden154 wrote:
               | PGO can be used in such situations, but the profile needs
               | to be checked in. Same code + same profile -> same binary
               | (assuming the compiler is deterministic, which is tested
               | quite extensively).
               | 
               | There are several big projects that use PGO (like
               | Chrome), and you can get a deterministic build at
               | whatever revision using PGO as the profiles are checked
               | in to the repository.
        
           | pton_xd wrote:
           | > It is normally not a necessary feature of a compiler to be
           | determistic. A compiler should be correct against a
           | specification.
           | 
           | That sounds like a nightmare. Optimizing code to play nice
           | with black-box heuristic compilers like V8's TurboFan is,
           | already in fact, a continual maintenance nightmare.
           | 
           | If you don't care about performance, non-deterministic
           | compilation is probably "good enough." See TurboFan.
        
           | stavros wrote:
           | LLMs are deterministic. We inject randomness after the fact,
           | just because we don't like our text being deterministic. Turn
           | temperature to 0 and you're good.
        
         | vessenes wrote:
         | Thank you for the summary. My memory of SOTA on disassembly
         | about a year ago was sub--30% accuracy, so this is a
         | significant step forward.
         | 
         | I do think the idea of a 90%+-ish forward and backward
         | assembler LLM is pretty intriguing. There's bound to be a lot
         | of uses for it; especially if you're of the mind that to get
         | there it would have to have learned a lot about computers in
         | the foundation model training phase.
         | 
         | Like, you'd definitely want to have those weights somehow baked
         | into a typical coding assistant LLM, and of course you'd be
         | able to automate round one of a lot of historical archiving
         | projects that would like to get compilable modern code but only
         | have a binary, you'd be able to turn some PDP-1 code into
         | something that would compile on a modern machine, ... you'd
         | probably be able to leverage it into building chip simulations
         | / code easily, it would be really useful for writing Verilog,
         | (maybe), anyway, the use cases seem pretty broad to me.
        
         | boomanaiden154 wrote:
         | Sure, performance is more interesting, but it's significantly
         | harder.
         | 
         | With code size, you just need to run the code through the
         | compiler and you have a deterministic measurement for
         | evaluation.
         | 
         | Performance has no such metric. Benchmarks are expensive and
         | noisy. Cost models seem like a promising direction, but they
         | aren't really there yet.
        
       | mmphosis wrote:
       | https://xcancel.com/AIatMeta/status/1806361623831171318
        
       | chad1n wrote:
       | As usual, Twitter is impressed by this, but I'm very skeptical,
       | the chance of it breaking your program is pretty high. The thing
       | that makes optimizations so hard to make is that they have to
       | match the behavior without optimizations (unless you have UBs),
       | which is something that LLMs probably will struggle with since
       | they can't exactly understand the code and execution tree.
        
         | ramon156 wrote:
         | Let's be real, at least 40% of those comments are bots
        
           | extheat wrote:
           | People simply have no idea what they're talking about. It's
           | just jumping on to the latest hype train. My first impression
           | here was per the name that it was actually some sort of
           | compiler in it of itself--ie programming language in and pure
           | machine code or some other IR out. It's got bits and pieces
           | of that here and there but that's not what it really is at
           | all. It's more of a predictive engine for an optimizer and
           | not a very generalized one for that.
           | 
           | What would be more interesting is training a large model on
           | pure (code, assembly) pairs like a normal translation task.
           | Presumably a very generalized model would be good at even
           | doing the inverse: given some assembly, write code that will
           | produce the given assembly. Unlike human language there is a
           | finite set of possible correct answers here and you have the
           | convenience of being able to generate synthetic data for
           | cheap. I think optimizations would arise as a natural side
           | effect this way: if there's multiple trees of possible
           | generations (like choosing between logits in an LLM) you
           | could try different branches to see what's smaller in terms
           | of byte code or faster in terms of execution.
        
             | quonn wrote:
             | > Presumably a very generalized model would be good at even
             | doing the inverse: given some assembly, write code that
             | will produce the given assembly.
             | 
             | ChatGPT does this, unreliably.
        
             | hughleat wrote:
             | It can emulate the compiler (IR + passes -> IR or ASM).
             | 
             | > What would be more interesting is training a large model
             | on pure (code, assembly) pairs like a normal translation
             | task.
             | 
             | It is that.
             | 
             | > Presumably a very generalized model would be good at even
             | doing the inverse: given some assembly, write code that
             | will produce the given assembly.
             | 
             | Is has been trained to disassemble. It is much, much better
             | than other models at that.
        
         | solarexplorer wrote:
         | If I understand correctly, the AI is only choosing the
         | optimization passes and their relative order. Each individual
         | optimization step would still be designed and verified
         | manually, and maybe even proven to be correct mathematically.
        
           | boomanaiden154 wrote:
           | Right, it's only solving phase ordering.
           | 
           | In practice though, correctness even over ordering of hand-
           | written passes is difficult. Within the paper they describe a
           | methodology to evaluate phase orderings against a small test
           | set as a smoke test for correctness (PassListEval) and
           | observe that ~10% of the phase orderings result in assertion
           | failures/compiler crashes/correctness issues.
           | 
           | You will end up with a lot more correctness issues adjusting
           | phase orderings like this than you would using one of the
           | more battle-tested default optimization pipelines.
           | 
           | Correctness in a production compiler is a pretty hard
           | problem.
        
             | hughleat wrote:
             | There are two models.
             | 
             | - foundation model is pretrained on asm and ir. Then it is
             | trained to emulate the compiler (ir + passes -> ir or asm)
             | 
             | - ftd model is fine tuned for solving phase ordering and
             | disassembling
             | 
             | FTD is there to demo capabilities. We hope people will fine
             | tune for other optimisations. It will be much, much cheaper
             | than starting from scratch.
             | 
             | Yep, correctness in compilers is a pain. Auto-tuning is a
             | very easy way to break a compiler.
        
         | Lockal wrote:
         | As this LLM operates on LLVM intermediate representation
         | language, the result can be fed into
         | https://alive2.llvm.org/ce/ and formally verified. For those
         | who don't know what to print there: here is an example of C++
         | spaceship operator: https://alive2.llvm.org/ce/z/YJPr84 (try to
         | replace -1 with -2 there to break). This is kind of a Swiss
         | knife for LLVM developers, they often start optimizations with
         | this tool.
         | 
         | What they missed is to mention verification (they probably
         | don't know about alive2) and comparison with other compilers.
         | It is very likely that LLM Compiler "learned" from GCC and with
         | huge computational effort simply generates what GCC can do out
         | of the box.
        
           | boomanaiden154 wrote:
           | I'm reasonably certain the authors are aware of alive2.
           | 
           | The problem with using alive2 to verify LLM based compilation
           | is that alive2 isn't really designed for that. It's an
           | amazing tool for catching correctness issues in LLVM, but
           | it's expensive to run and will time out reasonably often,
           | especially on cases involving floating point. It's explicitly
           | designed to minimize the rate of false-positive correctness
           | issues to serve the primary purpose of alerting compiler
           | developers to correctness issues that need to be fixed.
        
             | hughleat wrote:
             | Yep, we tried it :-) These were exactly the problems we had
             | with it.
        
           | boomanaiden154 wrote:
           | I'm not sure it's likely that the LLM here learned from gcc.
           | The size optimization work here is focused on learning phase
           | orderings for LLVM passes/the LLVM pipeline, which wouldn't
           | be at all applicable to gcc.
           | 
           | Additionally, they train approximately half on assembly and
           | half on LLVM-IR. They don't talk much about how they generate
           | the dataset other than that they generated it from the
           | CodeLlama dataset, but I would guess they compile as much
           | code as they can into LLVM-IR and then just lower that into
           | assembly, leaving gcc out of the loop completely for the vast
           | majority of the compiler specific training.
        
             | hughleat wrote:
             | Yep! No GCC on this one. And yep, that's not far off how
             | the pretraining data was gathered - but with random
             | optimisations to give it a bit of variety.
        
               | boomanaiden154 wrote:
               | Do you have more information on how the dataset was
               | constructed?
               | 
               | It seems like somehow build systems were invoked given
               | the different targets present in the final version?
               | 
               | Was it mostly C/C++ (if so, how did you resolve missing
               | includes/build flags), or something else?
        
           | moffkalast wrote:
           | > C++ spaceship operator
           | 
           | > (A <=> B) < 0 is true if A < B
           | 
           | > (A <=> B) > 0 is true if A > B
           | 
           | > (A <=> B) == 0 is true if A and B are equal/equivalent.
           | 
           | TIL of the spaceship operator. Was this added as an april
           | fools?
        
             | samatman wrote:
             | This is one of the oldest computer operators in the game:
             | the arithmetic IF statement from FORTRAN.
             | 
             | It's useful for stable-sorting collections with a single
             | test. Also, overloading <=> for a type, gives all
             | comparison operators "for free": ==, !=, <, <=, >=, >
        
         | bbor wrote:
         | AFAIK this is a heuristic, not a category. The underlying
         | grammar would be preserved.
         | 
         | Personally I thought we were way too close to perfect to make
         | meaningful progress on compilation, but that's probably just
         | naivete
        
           | boomanaiden154 wrote:
           | I would not say we are anywhere close to perfect in
           | compilation.
           | 
           | Even just looking at inlining for size, there are multiple
           | recent studies showing ~10+% improvement
           | (https://dl.acm.org/doi/abs/10.1145/3503222.3507744,
           | https://arxiv.org/abs/2101.04808).
           | 
           | There is a massive amount of headroom, and even tiny bits
           | still matter as ~0.5% gains on code size, or especially
           | performance, can be huge.
        
         | cec wrote:
         | Hey! The idea isn't to replace the compiler with an LLM, the
         | tech is not there yet. Where we see value is in using these
         | models to guide an existing compiler. E.g. orchestrating
         | optimization passes. That way the LLM won't break your code,
         | nor will the compiler (to the extent that your compiler is free
         | from bugs, which can tricky to detect - cf Sec 3.1 of our
         | paper).
        
           | verditelabs wrote:
           | I've done some similar LLM compiler work, obviously not on
           | Meta's scale, teaching an LLM to do optimization by feeding
           | an encoder/decoder pairs of -O0 and -O3 code and even on my
           | small scale I managed to get the LLM to spit out the correct
           | optimization every once and a while.
           | 
           | I think there's a lot of value in LLM compilers to
           | specifically be used for superoptimization where you can
           | generate many possible optimizations, verify the correctness,
           | and pick the most optimal one. I'm excited to see where y'all
           | go with this.
        
             | viraptor wrote:
             | Thank you for freeing me from one of my to-do projects. I
             | wanted to do a similar autoencoder with optimisations. Did
             | you write about it anywhere? I'd love to read the details.
        
               | verditelabs wrote:
               | No writeup, but the code is here:
               | 
               | https://github.com/SuperOptimizer/supercompiler
               | 
               | There's code there to generate unoptimized / optimized
               | pairs via C generators like yarpgen and csmith, then
               | compile, train, inference, and disassemble the results
        
           | moffkalast wrote:
           | > The idea isn't to replace the compiler with an LLM, the
           | tech is not there yet
           | 
           | What do you mean the tech isn't there _yet_ , why would it
           | ever even go into that direction? I mean we do those kinds of
           | things for shits and giggles but for any practical use? I
           | mean come on. From fast and reliable to glacial and not even
           | working a quarter of the time.
           | 
           | I guess maybe if all compiler designers die in a freak
           | accident and there's literally nobody to replace them, then
           | we'll have to resort to that after the existing versions
           | break.
        
       | ldjkfkdsjnv wrote:
       | I love this company. Advancing ai and keeping the rest of us in
       | the loop.
        
         | LoganDark wrote:
         | I hate the company (Facebook), but I still think them having
         | been publicly releasing a bunch of the research they've been
         | doing (and models they've been making) has been a net good for
         | almost everybody, at least in terms of exploring the field of
         | LLMs.
        
         | ein0p wrote:
         | My love for Meta is strictly confined to FAIR and the PyTorch
         | team. The rest of the company is basically cancer.
        
       | muglug wrote:
       | Unlike many other AI-themed papers at Meta this one omits any
       | mention of the model output getting used at Instagram, Facebook
       | or Meta. Research is great! But doesn't seem all that actionable
       | today.
        
         | boomanaiden154 wrote:
         | This would be difficult to deploy as-is in production.
         | 
         | There are correctness issues mentioned in the paper regarding
         | adjusting phase orderings away from the well-trodden
         | O0/O1/O2/O3/Os/Oz path. Their methodology works for a research
         | project quite well, but I personally wouldn't trust it in
         | production. While some obvious issues can be caught by a small
         | test suite and unit tests, there are others that won't be, and
         | that's really risky in production scenarios.
         | 
         | There are also some practical software engineering things like
         | deployment in the compiler. There is actually tooling in
         | upstream LLVM to do this
         | (https://www.youtube.com/watch?v=mQu1CLZ3uWs), but running
         | models on a GPU would be difficult and I would expect CPU
         | inference to massively blow up compile times.
        
       | soist wrote:
       | How do they verify the output preserves semantics of the input?
        
       | zitterbewegung wrote:
       | Some previous work in the space is at
       | https://github.com/albertan017/LLM4Decompile
        
       | LoganDark wrote:
       | Reading the title, I thought this was a tool for optimizing and
       | disassembling LLMs, not an LLM designed to optimize and
       | disassemble. Seeing it's just a model is a little disappointing
       | in comparison.
        
       | jameshart wrote:
       | Pretty sure I remember trading 300 creds for a Meta Technologies
       | Neural Optimizer and Disassembler in one of the early _Deus Ex_
       | games.
        
       ___________________________________________________________________
       (page generated 2024-06-28 23:00 UTC)