[HN Gopher] Meta LLM Compiler: neural optimizer and disassembler
       ___________________________________________________________________
        
       Meta LLM Compiler: neural optimizer and disassembler
        
       Author : foobazgt
       Score  : 231 points
       Date   : 2024-06-28 11:12 UTC (1 days ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | HanClinto wrote:
       | Huh. This is a very... "interesting" application for an LLM. I'm
       | not the brightest crayon in the box, but if anyone else would
       | like to follow along with my non-expert opinion as I read through
       | the paper, here's my take on it.
       | 
       | It's pretty important for compilers / decompilers to be reliable
       | and accurate -- compilers behaving in a deterministic and
       | predictable way is an important fundamental of pipelines.
       | 
       | LLMs are inherently unpredictable, and so using an LLM for
       | compilation / decompilation -- even an LLM that has 99.99%
       | accuracy -- feels a bit odd to include as a piece in my build
       | pipeline.
       | 
       | That said, let's look at the paper and see what they did.
       | 
       | They essentially started with CodeLlama, and then went further to
       | train the model on three tasks -- one primary, and two
       | downstream.
       | 
       | The first task is compilation: given input code and a set of
       | compiler flags, can we predict the output assembly? Given the
       | inability to verify correctness without using a traditional
       | compiler, this feels like it's of limited use on its own.
       | However, training a model on this as a primary task enables a
       | couple of downstream tasks. Namely:
       | 
       | The second task (and first downstream task) is compiler flag
       | prediction / optimization to predict / optimize for smaller
       | assembly sizes. It's a bit disappointing that they only seem to
       | be able to optimize for assembly size (and not execution speed),
       | but it's not without its uses. Because the output of this task
       | (compiler flags) are then passed to a deterministic function (a
       | traditional compiler), then the instability of the LLM is
       | mitigated.
       | 
       | The third task (second downstream task) is decompilation. This is
       | not the first time that LLMs have been trained to do better
       | decompilation -- however, because of the pretraining that they
       | did on the primary task, they feel that this provides some
       | advantages over previous approaches. Sadly, they only compare LLM
       | Compiler to Code Llama and GPT-4 Turbo, and not against any other
       | LLMs fine-tuned for the decompilation task, so it's difficult to
       | see in context how much better their approach is.
       | 
       | Regarding the verifiability of the disassembly approach, the
       | authors note that there are issues regarding correctness. So the
       | authors employ round-tripping -- recompiling the decompiled code
       | (using the same compiler flags) to verify correctness / exact-
       | match. This still puts accuracy in the 45% or so (if I understand
       | their output numbers), so it's not entirely trustworthy yet, but
       | it might be able to still be useful (especially if used alongside
       | a traditional decompiler, and this model's outputs only used when
       | they are verifiably correct).
       | 
       | Overall I'm happy to see this model be released as it seems like
       | an interesting use-case. I may need to read more, but at first
       | blush I'm not immediately excited by the possibilities that this
       | unlocks. Most of all, I would like to see it explored if these
       | methods could be extended to optimize for performance -- not just
       | size of assembly.
        
         | riedel wrote:
         | It is normally not a necessary feature of a compiler to be
         | determistic. A compiler should be correct against a
         | specification. If the specification allows indeterminism a
         | compiler should be able to exploit them. I remember the story
         | of the sather-k compiler that did things differently based on
         | the phase of the moon.
        
           | munificent wrote:
           | It's technically correct that a language specification is
           | rarely precise enough to require compiler output to be
           | deterministic.
           | 
           | But it's pragmatically true that engineers will want to
           | murder you if your compiler is non-deterministic. All sorts
           | of build systems, benchmark harnesses, supply chain
           | validation tools, and other bits of surrounding ecosystem
           | will shit the bed if the compiler doesn't produce bitwise
           | identical output on the same input and compiler flags.
        
             | foobazgt wrote:
             | Can vouch for this having fixed non-determinism bugs in a
             | compiler. Nobody is happy if your builds aren't
             | reproducible. You'll also suffer crazy performance problems
             | as everything downstream rebuilds randomly and all your
             | build caches randomly miss.
        
               | SushiHippie wrote:
               | NixOS with its nixpkgs [0] and cache [1] would also not
               | work if compilers weren't reproducible. Though they won't
               | use something like PGO or some specific optimization
               | flags as these would very likely lead to unreproducible
               | builds. For example most distros ship a PGO optimized
               | build of Python while nixos does not.
               | 
               | [0] https://github.com/nixos/nixpkgs
               | 
               | [1] https://cache.nixos.org/
        
               | boomanaiden154 wrote:
               | PGO can be used in such situations, but the profile needs
               | to be checked in. Same code + same profile -> same binary
               | (assuming the compiler is deterministic, which is tested
               | quite extensively).
               | 
               | There are several big projects that use PGO (like
               | Chrome), and you can get a deterministic build at
               | whatever revision using PGO as the profiles are checked
               | in to the repository.
        
               | vlovich123 wrote:
               | It's called autofdo although I've struggled to get it
               | working well in Rust.
        
               | boomanaiden154 wrote:
               | It's not called AutoFDO. AutoFDO refers to a specific
               | sampling-based profile technique out of Google
               | (https://dl.acm.org/doi/abs/10.1145/2854038.2854044).
               | Sometimes people will refer to that as PGO though (with
               | PGO and FDO being somewhat synonymous, but PGO seeming to
               | be the preferred term in the open source LLVM world).
               | Chrome specifically uses instrumented PGO which is very
               | much not AutoFDO.
               | 
               | PGO works just fine in Rust and has support built into
               | the compiler (https://doc.rust-lang.org/rustc/profile-
               | guided-optimization....).
        
               | vlovich123 wrote:
               | I wasn't trying to conflate the two. PGO traditionally
               | meant a trace build but as a term it's pretty generic, at
               | least to me to the general concept of "you have profile
               | information that replaces generically tuned heuristics
               | that the compiler uses). AutoFDO I'd classify as an
               | extension to that concept to a more general PGO
               | technique; kind of ThinLTO vs LTO. Specifically, it
               | generates the "same" information to supplant compiler
               | heuristics, but is more flexible in that the sample can
               | be fed back into "arbitrary" versions of the code using
               | normal sampling techniques instead of an instrumented
               | trace. The reason sampling is better is that it more
               | easily fits into capturing data from production which is
               | much harder to accomplish for the tracing variant (due to
               | perf overheads). Additionally, because it works across
               | versions the amortized compile cost drops from 2x to 1x
               | because you only need to reseed your profile data
               | periodically.
               | 
               | I was under the impression they had switched to AutoFDO
               | across the board but maybe that's just for their cloud
               | stuff and Chrome continues to run a representative
               | workload since that path is more mature. I would guess
               | that if it's not being used already, they're exploring
               | how to make Chrome run AutoFDO for the same reason
               | everyone started using ThinLTO - it brought most of the
               | advantages while fixing the disadvantages that hampered
               | adoption.
               | 
               | And yes, while PGO is available natively, AutoFDO isn't
               | quite as smooth.
        
               | boomanaiden154 wrote:
               | I'm not sure where you're getting your information from.
               | 
               | Chrome (and many other performance-critical workloads) is
               | using instrumented PGO because it gives better
               | performance gains, not because it's a more mature path.
               | AutoFDO is only used in situations where collecting data
               | with an instrumented build is difficult.
        
               | vlovich123 wrote:
               | Last I looked AutoFDO builds were similar in performance
               | to PGO as ThinLTO vs LTO is. I'd say that collecting data
               | with an instrumented Chrome build is extremely difficult
               | - you're relying on your synthetic benchmark environment
               | which is very very different from the real world (eg
               | extensions aren't installed, the patterns of websites
               | being browsed is not realistic, etc). There's also a 2x
               | compile cost because you have to build Chrome twice in
               | the exact same way + you have to run a synthetic
               | benchmark on each build to generate the trace.
               | 
               | I'm just using an educated guess to say that at some
               | point in the future Chrome will switch to AutoFDO,
               | potentially using traces harvested from end user
               | computers (potentially just from their employees even to
               | avoid privacy complaints).
        
               | boomanaiden154 wrote:
               | You can make the synthetic benchmarks relatively
               | accurate, it just takes effort. The compile-time hit and
               | additional effort is often worth it for the extra couple
               | percent for important applications.
               | 
               | Performance is also pretty different on the scales that
               | performance engineers are interested in for these sorts
               | of production codes, but without the build system
               | scalability problems that LTO has. The original AutoFDO
               | paper shows an improvement of 10.5%->12.5% going from
               | AutoFDO to instrumented PGO. That is pretty big. It's
               | probably even bigger with newer instrumentation based
               | techniques like CSPGO.
               | 
               | They also mention the exact reasons that AutoFDO will not
               | perform as well, with issues in debug info and losing
               | profile accuracy due to sampling inaccuracy.
               | 
               | I couldn't find any numbers for Chrome, but I am
               | reasonably certain that they have tried both and continue
               | to use instrumented PGO for the extra couple percent.
               | There are other pieces of the Chrome ecosystem
               | (specifically the ChromeOS kernel) that are already
               | optimized using sampling-based profiling. It's been a
               | while since I last talked to the Chromium toolchain
               | people about this though. I also remember hearing them
               | benchmark FEPGO vs IRPGO at some point and concluding
               | that IRPGO was better.
        
               | c0balt wrote:
               | Yeah, and nixpkgs also, last time I checked, does patch
               | GCC/ clang to ensure determinism. Many compilers and
               | toolchain by default want to, e.g., embed build
               | information that may leak from the build env in a non-
               | deterministic/ non-reprodicible manner.
        
               | munificent wrote:
               | Yup. Even so much as inserting the build timestamp into
               | the generated executable (which is strangely common)
               | causes havoc with build caching.
        
             | Sophira wrote:
             | Plus, these models are entirely black boxes. Even given
             | weights, we don't know how to look at them and meaningfully
             | tell what's happening - and not only that, but training
             | these models is likely not cheap at all.
             | 
             | Stable output is how we can verify that attacks like the
             | one described in Reflections on Trusting Trust[0] don't
             | happen.
             | 
             | [0] https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_198
             | 4_Ref...
        
             | a_t48 wrote:
             | NVCC CUDA builds were nondeterministic last time I checked,
             | it made certain things (trying to get very clever with
             | generating patches) difficult. This was also hampered by
             | certain libraries (maybe GTSAM?) wanting to write __DATE__
             | somewhere in the build output, creating endlessly changing
             | builds.
        
               | sigmoid10 wrote:
               | In parallel computing you run into nondeterminism pretty
               | quickly anyways - especially with CUDA because of
               | undetermined execution order and floating point accuracy.
        
               | a_t48 wrote:
               | Yes, at runtime. Compiling CUDA doesn't require a GPU,
               | though, and doesn't really use "parallel computing". I
               | think CUDA via clang gets this right and will produce the
               | same build every time - it was purely an NVCC issue.
        
             | dheera wrote:
             | LLMs can be deterministic if you set the random seed and
             | pin it to a certain version of the weights.
             | 
             | My bigger concern would be bugs in the machine code would
             | be very, very difficult to track down.
        
             | azinman2 wrote:
             | Just fix the random seed :)
        
             | skybrian wrote:
             | I'm amused by the possibility of a compiler having a flag
             | to set a random seed. (with a fixed default, of course).
             | 
             | If you hit a compiler bug, you could try a different seed
             | to see what happens.
             | 
             | Or how about a code formatter with a random seed?
             | 
             | Tool developers could run unit tests with a different seed
             | until they find a bug - or hide the problem by finding a
             | lucky seed for which you have no provable bugs :)
             | 
             | Edit:
             | 
             | Or how about this: we write a compiler as a
             | nondeterministic algorithm where every output is correct,
             | but they are optimized differently depending on an input
             | vector of choices. Then use machine learning techniques to
             | find the picks that produce the best output.
        
           | pton_xd wrote:
           | > It is normally not a necessary feature of a compiler to be
           | determistic. A compiler should be correct against a
           | specification.
           | 
           | That sounds like a nightmare. Optimizing code to play nice
           | with black-box heuristic compilers like V8's TurboFan is,
           | already in fact, a continual maintenance nightmare.
           | 
           | If you don't care about performance, non-deterministic
           | compilation is probably "good enough." See TurboFan.
        
           | stavros wrote:
           | LLMs are deterministic. We inject randomness after the fact,
           | just because we don't like our text being deterministic. Turn
           | temperature to 0 and you're good.
        
             | waldrews wrote:
             | But temperature 0 LLM's don't exhibit the emergent
             | phenomena we like, even in apparently non-creative tasks.
             | The randomness is, in some sense, a cheap proxy for an
             | infeasible search over all completion sequences, much like
             | simulated annealing with zero temperature is a search for a
             | local optimum but adding randomness makes it explore
             | globally and find more interesting possibilities.
        
               | sebzim4500 wrote:
               | Sure but you could add pseudo random noise instead and
               | get the same behavior while retaining determinism.
        
               | refulgentis wrote:
               | Temperature is at ~1.2 in this thread, here's some 0.0:
               | 
               | - Yes, temperature 0.0 is less creative.
               | 
               | - Injecting pseudo-random noise to get deterministic
               | creative outputs is "not even wrong", in the Wolfgang
               | Pauli sense. It's fixing something that isn't broken,
               | with something that can't fix it, that if it could, would
               | be replicating the original behavior - more simply, it's
               | proposing non-deterministic determinism.
               | 
               | - Temperature 0.0, in practice, is an LLM. There aren't
               | emergent phenomena, in the sense "emergent phenomena" is
               | used with LLMs, missing. Many, many, many, applications
               | use this.
               | 
               | - In simplistic scenarios, on very small models, 0.0
               | could get stuck literally repeating the same token.
               | 
               | - There's a whole other layer of ex. repeat
               | penalties/frequency penalties and such that are used
               | during inference to limit this. Only OpenAI and llama.cpp
               | expose repeat/frequency.
               | 
               | - Temperature 0.0 is still non-deterministic on ex.
               | OpenAI, though _substantially_ the same, and even _the_
               | same most of the time. It 's hard to notice differences.
               | (Reproducible builds require extra engineering effort,
               | the same way ensuring temperature = 0.0 is _truly_
               | deterministic requires engineering effort.)
               | 
               | - Pedantically, only temperature 0.0 _at the same seed_
               | (initial state) is deterministic.
        
             | Sophira wrote:
             | Even then, though, the output could change drastically
             | based on a single change to the input (such as a comment).
             | 
             | That's not something you want in a compiler.
        
           | wakawaka28 wrote:
           | It is very important for a compiler to be deterministic.
           | Otherwise you can't validate the integrity of binaries! We
           | already have issues with reproducibility without adding this
           | shit in the mix.
        
             | riedel wrote:
             | Reproducible builds are an edge case, that require
             | determistic compilation for sure. But profile based
             | optimisation or linker address randomisation are sometimes
             | also useful. While rule out one thing for the other.
             | Normally you can easily turn on and off optimisation
             | depending on your need. Just do -O0 if you want
             | determinism. But normally you should not rely on it (also
             | at execution time)
        
         | vessenes wrote:
         | Thank you for the summary. My memory of SOTA on disassembly
         | about a year ago was sub--30% accuracy, so this is a
         | significant step forward.
         | 
         | I do think the idea of a 90%+-ish forward and backward
         | assembler LLM is pretty intriguing. There's bound to be a lot
         | of uses for it; especially if you're of the mind that to get
         | there it would have to have learned a lot about computers in
         | the foundation model training phase.
         | 
         | Like, you'd definitely want to have those weights somehow baked
         | into a typical coding assistant LLM, and of course you'd be
         | able to automate round one of a lot of historical archiving
         | projects that would like to get compilable modern code but only
         | have a binary, you'd be able to turn some PDP-1 code into
         | something that would compile on a modern machine, ... you'd
         | probably be able to leverage it into building chip simulations
         | / code easily, it would be really useful for writing Verilog,
         | (maybe), anyway, the use cases seem pretty broad to me.
        
         | boomanaiden154 wrote:
         | Sure, performance is more interesting, but it's significantly
         | harder.
         | 
         | With code size, you just need to run the code through the
         | compiler and you have a deterministic measurement for
         | evaluation.
         | 
         | Performance has no such metric. Benchmarks are expensive and
         | noisy. Cost models seem like a promising direction, but they
         | aren't really there yet.
        
         | joshuanapoli wrote:
         | Maybe they are thinking about embedding a program generator and
         | execution environment into their LLM inferencing loop in a
         | tighter way. The model invents a program that guides the output
         | in a specific/algorithmic way, tailored to the prompt.
        
         | swyx wrote:
         | some comments like this make me waant to subscribe to you for
         | all your future comments. thanks for doing the hard work of
         | summarizing and taking the bold step of sharing your thoughts
         | in public. i wish more HNers were like you.
        
         | sigmoid10 wrote:
         | >compilers behaving in a deterministic and predictable way is
         | an important fundamental of pipelines. LLMs are inherently
         | unpredictable, and so using an LLM for compilation /
         | decompilation -- even an LLM that has 99.99% accuracy
         | 
         | You're confusing different concepts here. An llm is technically
         | not unpredictable by itself (at least the ones we are talking
         | about here, there are different problems with beasts like GPT4
         | [1]). The "randomness" of llms you are probably experiencing
         | stems from the autoregressive completion, which samples from
         | probabilities for a temperature T>0 (which is very common
         | because it makes sense in chat applications). But there is
         | nothing that prevents you from simply choosing greedy sampling,
         | which would make your output 100% deterministic and
         | reproducible. That is particularly useful for
         | disassembling/decompiling and has the chance to vastly improve
         | over existing tools, because it is common knowledge that they
         | are often not the sharpest tools and humans are much better at
         | piecing together working code.
         | 
         | The other question here is accuracy for compiling. For that it
         | is important whether the llm can follow a specification
         | correctly. Because once you write unspecified behaviour, your
         | code is fair game for other compilers as well. So the real
         | question is how well does it follow the spec how good is it at
         | dealing with situations where normal compilers will flounder.
         | 
         | [1] https://152334h.github.io/blog/non-determinism-in-gpt-4/
        
           | yowlingcat wrote:
           | This was a big unlock for me -- I recall antirez saying the
           | same thing in a comment replying to me when I asked a similar
           | question about potential LLM features in Redis [1].
           | 
           | [1] https://news.ycombinator.com/item?id=39617370
        
           | barrkel wrote:
           | Determinism for any given input isn't an interesting metric
           | though. Inputs are always different, or else you could just
           | replace it with a lookup function. What's important is
           | reliability of the output given a distribution of inputs, and
           | that's where LLMs are unreliable. Temperature sampling can be
           | a technique to _improve_ reliability particularly when things
           | get into repetitve loops - though usually it 's to increase
           | creativity.
        
         | torginus wrote:
         | LLMs are probably great at this. You can break down the code
         | into functions or basic blocks. You can use the LLM to
         | decompile them and then test if the decompilation results match
         | the code when executed. You'll probably get this right after a
         | few tries. Then you can train your model with the succesful
         | decompilation results so your model will get better.
        
       | mmphosis wrote:
       | https://xcancel.com/AIatMeta/status/1806361623831171318
        
       | chad1n wrote:
       | As usual, Twitter is impressed by this, but I'm very skeptical,
       | the chance of it breaking your program is pretty high. The thing
       | that makes optimizations so hard to make is that they have to
       | match the behavior without optimizations (unless you have UBs),
       | which is something that LLMs probably will struggle with since
       | they can't exactly understand the code and execution tree.
        
         | ramon156 wrote:
         | Let's be real, at least 40% of those comments are bots
        
           | extheat wrote:
           | People simply have no idea what they're talking about. It's
           | just jumping on to the latest hype train. My first impression
           | here was per the name that it was actually some sort of
           | compiler in it of itself--ie programming language in and pure
           | machine code or some other IR out. It's got bits and pieces
           | of that here and there but that's not what it really is at
           | all. It's more of a predictive engine for an optimizer and
           | not a very generalized one for that.
           | 
           | What would be more interesting is training a large model on
           | pure (code, assembly) pairs like a normal translation task.
           | Presumably a very generalized model would be good at even
           | doing the inverse: given some assembly, write code that will
           | produce the given assembly. Unlike human language there is a
           | finite set of possible correct answers here and you have the
           | convenience of being able to generate synthetic data for
           | cheap. I think optimizations would arise as a natural side
           | effect this way: if there's multiple trees of possible
           | generations (like choosing between logits in an LLM) you
           | could try different branches to see what's smaller in terms
           | of byte code or faster in terms of execution.
        
             | quonn wrote:
             | > Presumably a very generalized model would be good at even
             | doing the inverse: given some assembly, write code that
             | will produce the given assembly.
             | 
             | ChatGPT does this, unreliably.
        
             | hughleat wrote:
             | It can emulate the compiler (IR + passes -> IR or ASM).
             | 
             | > What would be more interesting is training a large model
             | on pure (code, assembly) pairs like a normal translation
             | task.
             | 
             | It is that.
             | 
             | > Presumably a very generalized model would be good at even
             | doing the inverse: given some assembly, write code that
             | will produce the given assembly.
             | 
             | Is has been trained to disassemble. It is much, much better
             | than other models at that.
        
         | solarexplorer wrote:
         | If I understand correctly, the AI is only choosing the
         | optimization passes and their relative order. Each individual
         | optimization step would still be designed and verified
         | manually, and maybe even proven to be correct mathematically.
        
           | boomanaiden154 wrote:
           | Right, it's only solving phase ordering.
           | 
           | In practice though, correctness even over ordering of hand-
           | written passes is difficult. Within the paper they describe a
           | methodology to evaluate phase orderings against a small test
           | set as a smoke test for correctness (PassListEval) and
           | observe that ~10% of the phase orderings result in assertion
           | failures/compiler crashes/correctness issues.
           | 
           | You will end up with a lot more correctness issues adjusting
           | phase orderings like this than you would using one of the
           | more battle-tested default optimization pipelines.
           | 
           | Correctness in a production compiler is a pretty hard
           | problem.
        
             | hughleat wrote:
             | There are two models.
             | 
             | - foundation model is pretrained on asm and ir. Then it is
             | trained to emulate the compiler (ir + passes -> ir or asm)
             | 
             | - ftd model is fine tuned for solving phase ordering and
             | disassembling
             | 
             | FTD is there to demo capabilities. We hope people will fine
             | tune for other optimisations. It will be much, much cheaper
             | than starting from scratch.
             | 
             | Yep, correctness in compilers is a pain. Auto-tuning is a
             | very easy way to break a compiler.
        
         | Lockal wrote:
         | As this LLM operates on LLVM intermediate representation
         | language, the result can be fed into
         | https://alive2.llvm.org/ce/ and formally verified. For those
         | who don't know what to print there: here is an example of C++
         | spaceship operator: https://alive2.llvm.org/ce/z/YJPr84 (try to
         | replace -1 with -2 there to break). This is kind of a Swiss
         | knife for LLVM developers, they often start optimizations with
         | this tool.
         | 
         | What they missed is to mention verification (they probably
         | don't know about alive2) and comparison with other compilers.
         | It is very likely that LLM Compiler "learned" from GCC and with
         | huge computational effort simply generates what GCC can do out
         | of the box.
        
           | boomanaiden154 wrote:
           | I'm reasonably certain the authors are aware of alive2.
           | 
           | The problem with using alive2 to verify LLM based compilation
           | is that alive2 isn't really designed for that. It's an
           | amazing tool for catching correctness issues in LLVM, but
           | it's expensive to run and will time out reasonably often,
           | especially on cases involving floating point. It's explicitly
           | designed to minimize the rate of false-positive correctness
           | issues to serve the primary purpose of alerting compiler
           | developers to correctness issues that need to be fixed.
        
             | hughleat wrote:
             | Yep, we tried it :-) These were exactly the problems we had
             | with it.
        
           | boomanaiden154 wrote:
           | I'm not sure it's likely that the LLM here learned from gcc.
           | The size optimization work here is focused on learning phase
           | orderings for LLVM passes/the LLVM pipeline, which wouldn't
           | be at all applicable to gcc.
           | 
           | Additionally, they train approximately half on assembly and
           | half on LLVM-IR. They don't talk much about how they generate
           | the dataset other than that they generated it from the
           | CodeLlama dataset, but I would guess they compile as much
           | code as they can into LLVM-IR and then just lower that into
           | assembly, leaving gcc out of the loop completely for the vast
           | majority of the compiler specific training.
        
             | hughleat wrote:
             | Yep! No GCC on this one. And yep, that's not far off how
             | the pretraining data was gathered - but with random
             | optimisations to give it a bit of variety.
        
               | boomanaiden154 wrote:
               | Do you have more information on how the dataset was
               | constructed?
               | 
               | It seems like somehow build systems were invoked given
               | the different targets present in the final version?
               | 
               | Was it mostly C/C++ (if so, how did you resolve missing
               | includes/build flags), or something else?
        
               | hughleat wrote:
               | We plan to have a peer reviewed version of the paper
               | where we will probably have more details on that.
               | Otherwise we can't give anymore details than in the paper
               | or post, etc. without going through legal which takes
               | ages. Science is getting harder to do :-(
        
           | moffkalast wrote:
           | > C++ spaceship operator
           | 
           | > (A <=> B) < 0 is true if A < B
           | 
           | > (A <=> B) > 0 is true if A > B
           | 
           | > (A <=> B) == 0 is true if A and B are equal/equivalent.
           | 
           | TIL of the spaceship operator. Was this added as an april
           | fools?
        
             | samatman wrote:
             | This is one of the oldest computer operators in the game:
             | the arithmetic IF statement from FORTRAN.
             | 
             | It's useful for stable-sorting collections with a single
             | test. Also, overloading <=> for a type, gives all
             | comparison operators "for free": ==, !=, <, <=, >=, >
        
               | pklausler wrote:
               | It also has no good definition for its semantics when
               | presented with a NaN.
        
               | sagarm wrote:
               | You would use a partial ordering: https://en.cppreference
               | .com/w/cpp/utility/compare/partial_or...
        
               | pklausler wrote:
               | How would that apply to Fortran's arithmetic IF
               | statement? It goes to one label for a negative value, or
               | to a second label for a zero, or to a third label for
               | positive. A NaN is in none of these categories.
        
               | moffkalast wrote:
               | I mean maybe I'm missing something but it seems like it
               | behaves exactly the same way as subtraction? At least for
               | integers it's definitely the same, for floats I imagine
               | it might handle equals better?
        
               | samatman wrote:
               | C++ has operator overloading, so you can define the
               | spaceship for any class, and get every comparison
               | operator from the fallback definitions, which use `<=>`
               | in some obvious ways.
        
             | sagarm wrote:
             | The three-way comparison operator just needs to return a
             | ternary, and many comparisons boil down to integer
             | subtraction. strcmp is also defined this way.
             | 
             | In C++20 the compiler will automatically use the spaceship
             | operator to implement other comparisons if it is available,
             | so it's a significant convenience.
        
         | bbor wrote:
         | AFAIK this is a heuristic, not a category. The underlying
         | grammar would be preserved.
         | 
         | Personally I thought we were way too close to perfect to make
         | meaningful progress on compilation, but that's probably just
         | naivete
        
           | boomanaiden154 wrote:
           | I would not say we are anywhere close to perfect in
           | compilation.
           | 
           | Even just looking at inlining for size, there are multiple
           | recent studies showing ~10+% improvement
           | (https://dl.acm.org/doi/abs/10.1145/3503222.3507744,
           | https://arxiv.org/abs/2101.04808).
           | 
           | There is a massive amount of headroom, and even tiny bits
           | still matter as ~0.5% gains on code size, or especially
           | performance, can be huge.
        
         | cec wrote:
         | Hey! The idea isn't to replace the compiler with an LLM, the
         | tech is not there yet. Where we see value is in using these
         | models to guide an existing compiler. E.g. orchestrating
         | optimization passes. That way the LLM won't break your code,
         | nor will the compiler (to the extent that your compiler is free
         | from bugs, which can tricky to detect - cf Sec 3.1 of our
         | paper).
        
           | verditelabs wrote:
           | I've done some similar LLM compiler work, obviously not on
           | Meta's scale, teaching an LLM to do optimization by feeding
           | an encoder/decoder pairs of -O0 and -O3 code and even on my
           | small scale I managed to get the LLM to spit out the correct
           | optimization every once and a while.
           | 
           | I think there's a lot of value in LLM compilers to
           | specifically be used for superoptimization where you can
           | generate many possible optimizations, verify the correctness,
           | and pick the most optimal one. I'm excited to see where y'all
           | go with this.
        
             | viraptor wrote:
             | Thank you for freeing me from one of my to-do projects. I
             | wanted to do a similar autoencoder with optimisations. Did
             | you write about it anywhere? I'd love to read the details.
        
               | verditelabs wrote:
               | No writeup, but the code is here:
               | 
               | https://github.com/SuperOptimizer/supercompiler
               | 
               | There's code there to generate unoptimized / optimized
               | pairs via C generators like yarpgen and csmith, then
               | compile, train, inference, and disassemble the results
        
             | hughleat wrote:
             | Yes! An AI building a compiler by learning from a super-
             | optimiser is something I have wanted to do for a while now
             | :-)
        
           | moffkalast wrote:
           | > The idea isn't to replace the compiler with an LLM, the
           | tech is not there yet
           | 
           | What do you mean the tech isn't there _yet_ , why would it
           | ever even go into that direction? I mean we do those kinds of
           | things for shits and giggles but for any practical use? I
           | mean come on. From fast and reliable to glacial and not even
           | working a quarter of the time.
           | 
           | I guess maybe if all compiler designers die in a freak
           | accident and there's literally nobody to replace them, then
           | we'll have to resort to that after the existing versions
           | break.
        
           | swyx wrote:
           | then maybe dont name it "LLM Compiler", just "Compiler
           | Guidance with LLMs" or "LLM-aided Compiler optimization" or
           | something - will get much more to the point without
           | overpromising
        
             | nickpsecurity wrote:
             | Yeah, the name was misleading. I thought it was going to be
             | source to object translation maybe with techniques like how
             | they translate foreign languages.
        
         | namaria wrote:
         | This feels like going insane honestly. It's like reading that
         | people are super excited about using bouncing castles to mix
         | concrete.
        
       | ldjkfkdsjnv wrote:
       | I love this company. Advancing ai and keeping the rest of us in
       | the loop.
        
         | LoganDark wrote:
         | I hate the company (Facebook), but I still think them having
         | been publicly releasing a bunch of the research they've been
         | doing (and models they've been making) has been a net good for
         | almost everybody, at least in terms of exploring the field of
         | LLMs.
        
         | ein0p wrote:
         | My love for Meta is strictly confined to FAIR and the PyTorch
         | team. The rest of the company is basically cancer.
        
         | Slyfox33 wrote:
         | Is this a bot comment?
        
       | muglug wrote:
       | Unlike many other AI-themed papers at Meta this one omits any
       | mention of the model output getting used at Instagram, Facebook
       | or Meta. Research is great! But doesn't seem all that actionable
       | today.
        
         | boomanaiden154 wrote:
         | This would be difficult to deploy as-is in production.
         | 
         | There are correctness issues mentioned in the paper regarding
         | adjusting phase orderings away from the well-trodden
         | O0/O1/O2/O3/Os/Oz path. Their methodology works for a research
         | project quite well, but I personally wouldn't trust it in
         | production. While some obvious issues can be caught by a small
         | test suite and unit tests, there are others that won't be, and
         | that's really risky in production scenarios.
         | 
         | There are also some practical software engineering things like
         | deployment in the compiler. There is actually tooling in
         | upstream LLVM to do this
         | (https://www.youtube.com/watch?v=mQu1CLZ3uWs), but running
         | models on a GPU would be difficult and I would expect CPU
         | inference to massively blow up compile times.
        
       | soist wrote:
       | How do they verify the output preserves semantics of the input?
        
         | hughleat wrote:
         | For the disassembler we round trip. x86 ->(via model) IR ->(via
         | clang) x86. If they are identical then the IR is correct. Could
         | be correct even if not identical, but then you need to check.
         | 
         | For the auto-tuning, we suggest the best passes to use in LLVM.
         | We take some effort to weed out bad passes, but LLVM has bugs.
         | This is in common with any auto-tuner.
         | 
         | We train it to emulate the compiler. The compiler does that
         | better already. We do it because it helps the LLM understand
         | the compiler better and it auto-tunes better as a result.
         | 
         | We hope people will use this model to fine-tune for other
         | heuristics. E.g. an inliner which accepts the IR or the caller
         | and callee to decide profitability. We think things like that
         | will be vastly cheaper for people if they can start from LLM
         | Compiler. Training LLMs from scratch is expensive :-)
         | 
         | IMO, right now, AI should be used to decide profitability not
         | correctness.
        
           | soloist11 wrote:
           | Have you guys applied this work internally to optimize Meta's
           | codebase?
        
       | zitterbewegung wrote:
       | Some previous work in the space is at
       | https://github.com/albertan017/LLM4Decompile
        
       | LoganDark wrote:
       | Reading the title, I thought this was a tool for optimizing and
       | disassembling LLMs, not an LLM designed to optimize and
       | disassemble. Seeing it's just a model is a little disappointing
       | in comparison.
        
       | jameshart wrote:
       | Pretty sure I remember trading 300 creds for a Meta Technologies
       | Neural Optimizer and Disassembler in one of the early _Deus Ex_
       | games.
        
       | nothrowaways wrote:
       | It is so funny that meta has to post it on X.
        
         | rising-sky wrote:
         | https://www.threads.net/@aiatmeta/post/C8ubaKupPwC
        
       | 0x1ceb00da wrote:
       | Wouldn't "Compiler LLM" be a more accurate name than "LLM
       | Compiler"?
        
         | hughleat wrote:
         | Never let a computer scientist name anything :-)
        
       | Havoc wrote:
       | I don't understand the purpose of this. Feels like a task for
       | function calling and sending it to an actual compiler.
       | 
       | Is there an obvious use case I'm missing?
        
         | singularity2001 wrote:
         | GPT 6 can write software directly (as assembly) instead of
         | writing c first.
         | 
         | Lots of training data for binary, and it can train itself by
         | seeing if the program does what it expects it to do.
        
           | dunefox wrote:
           | Is this GPT-6 in the room with us now?
        
         | killerstorm wrote:
         | This is not a product, it's a research project.
         | 
         | They don't expect you to use this.
         | 
         | Applications might require further research. And the main
         | takeaway might be not "here's a tool to generate code", but
         | "LLMs are able to understand binary code, and thus we can train
         | them to do ...".
        
       | zellyn wrote:
       | I continue to be fascinated about what the next qualitative
       | iteration of models will be, marrying the language processing and
       | broad knowledge of LLMs with an ability to reason rigorously.
       | 
       | If I understand correctly, this work (or the most obvious
       | productionized version of it) is similar to the work Deep Mind
       | released a while back: the LLM is essentially used for
       | "intuition"---to pick the approach---and then you hand off to
       | something mechanical/rigorous.
       | 
       | I think we're going to see a huge growth in that type of system.
       | I still think it's kind of weird and cool that our meat brains
       | with spreading activation can (with some amount of
       | effort/concentration) switch over into math mode and manipulate
       | symbols and inferences rigorously.
        
       ___________________________________________________________________
       (page generated 2024-06-29 23:01 UTC)