[HN Gopher] Meta LLM Compiler: neural optimizer and disassembler
___________________________________________________________________
Meta LLM Compiler: neural optimizer and disassembler
Author : foobazgt
Score : 231 points
Date : 2024-06-28 11:12 UTC (1 days ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| HanClinto wrote:
| Huh. This is a very... "interesting" application for an LLM. I'm
| not the brightest crayon in the box, but if anyone else would
| like to follow along with my non-expert opinion as I read through
| the paper, here's my take on it.
|
| It's pretty important for compilers / decompilers to be reliable
| and accurate -- compilers behaving in a deterministic and
| predictable way is an important fundamental of pipelines.
|
| LLMs are inherently unpredictable, and so using an LLM for
| compilation / decompilation -- even an LLM that has 99.99%
| accuracy -- feels a bit odd to include as a piece in my build
| pipeline.
|
| That said, let's look at the paper and see what they did.
|
| They essentially started with CodeLlama, and then went further to
| train the model on three tasks -- one primary, and two
| downstream.
|
| The first task is compilation: given input code and a set of
| compiler flags, can we predict the output assembly? Given the
| inability to verify correctness without using a traditional
| compiler, this feels like it's of limited use on its own.
| However, training a model on this as a primary task enables a
| couple of downstream tasks. Namely:
|
| The second task (and first downstream task) is compiler flag
| prediction / optimization to predict / optimize for smaller
| assembly sizes. It's a bit disappointing that they only seem to
| be able to optimize for assembly size (and not execution speed),
| but it's not without its uses. Because the output of this task
| (compiler flags) are then passed to a deterministic function (a
| traditional compiler), then the instability of the LLM is
| mitigated.
|
| The third task (second downstream task) is decompilation. This is
| not the first time that LLMs have been trained to do better
| decompilation -- however, because of the pretraining that they
| did on the primary task, they feel that this provides some
| advantages over previous approaches. Sadly, they only compare LLM
| Compiler to Code Llama and GPT-4 Turbo, and not against any other
| LLMs fine-tuned for the decompilation task, so it's difficult to
| see in context how much better their approach is.
|
| Regarding the verifiability of the disassembly approach, the
| authors note that there are issues regarding correctness. So the
| authors employ round-tripping -- recompiling the decompiled code
| (using the same compiler flags) to verify correctness / exact-
| match. This still puts accuracy in the 45% or so (if I understand
| their output numbers), so it's not entirely trustworthy yet, but
| it might be able to still be useful (especially if used alongside
| a traditional decompiler, and this model's outputs only used when
| they are verifiably correct).
|
| Overall I'm happy to see this model be released as it seems like
| an interesting use-case. I may need to read more, but at first
| blush I'm not immediately excited by the possibilities that this
| unlocks. Most of all, I would like to see it explored if these
| methods could be extended to optimize for performance -- not just
| size of assembly.
| riedel wrote:
| It is normally not a necessary feature of a compiler to be
| determistic. A compiler should be correct against a
| specification. If the specification allows indeterminism a
| compiler should be able to exploit them. I remember the story
| of the sather-k compiler that did things differently based on
| the phase of the moon.
| munificent wrote:
| It's technically correct that a language specification is
| rarely precise enough to require compiler output to be
| deterministic.
|
| But it's pragmatically true that engineers will want to
| murder you if your compiler is non-deterministic. All sorts
| of build systems, benchmark harnesses, supply chain
| validation tools, and other bits of surrounding ecosystem
| will shit the bed if the compiler doesn't produce bitwise
| identical output on the same input and compiler flags.
| foobazgt wrote:
| Can vouch for this having fixed non-determinism bugs in a
| compiler. Nobody is happy if your builds aren't
| reproducible. You'll also suffer crazy performance problems
| as everything downstream rebuilds randomly and all your
| build caches randomly miss.
| SushiHippie wrote:
| NixOS with its nixpkgs [0] and cache [1] would also not
| work if compilers weren't reproducible. Though they won't
| use something like PGO or some specific optimization
| flags as these would very likely lead to unreproducible
| builds. For example most distros ship a PGO optimized
| build of Python while nixos does not.
|
| [0] https://github.com/nixos/nixpkgs
|
| [1] https://cache.nixos.org/
| boomanaiden154 wrote:
| PGO can be used in such situations, but the profile needs
| to be checked in. Same code + same profile -> same binary
| (assuming the compiler is deterministic, which is tested
| quite extensively).
|
| There are several big projects that use PGO (like
| Chrome), and you can get a deterministic build at
| whatever revision using PGO as the profiles are checked
| in to the repository.
| vlovich123 wrote:
| It's called autofdo although I've struggled to get it
| working well in Rust.
| boomanaiden154 wrote:
| It's not called AutoFDO. AutoFDO refers to a specific
| sampling-based profile technique out of Google
| (https://dl.acm.org/doi/abs/10.1145/2854038.2854044).
| Sometimes people will refer to that as PGO though (with
| PGO and FDO being somewhat synonymous, but PGO seeming to
| be the preferred term in the open source LLVM world).
| Chrome specifically uses instrumented PGO which is very
| much not AutoFDO.
|
| PGO works just fine in Rust and has support built into
| the compiler (https://doc.rust-lang.org/rustc/profile-
| guided-optimization....).
| vlovich123 wrote:
| I wasn't trying to conflate the two. PGO traditionally
| meant a trace build but as a term it's pretty generic, at
| least to me to the general concept of "you have profile
| information that replaces generically tuned heuristics
| that the compiler uses). AutoFDO I'd classify as an
| extension to that concept to a more general PGO
| technique; kind of ThinLTO vs LTO. Specifically, it
| generates the "same" information to supplant compiler
| heuristics, but is more flexible in that the sample can
| be fed back into "arbitrary" versions of the code using
| normal sampling techniques instead of an instrumented
| trace. The reason sampling is better is that it more
| easily fits into capturing data from production which is
| much harder to accomplish for the tracing variant (due to
| perf overheads). Additionally, because it works across
| versions the amortized compile cost drops from 2x to 1x
| because you only need to reseed your profile data
| periodically.
|
| I was under the impression they had switched to AutoFDO
| across the board but maybe that's just for their cloud
| stuff and Chrome continues to run a representative
| workload since that path is more mature. I would guess
| that if it's not being used already, they're exploring
| how to make Chrome run AutoFDO for the same reason
| everyone started using ThinLTO - it brought most of the
| advantages while fixing the disadvantages that hampered
| adoption.
|
| And yes, while PGO is available natively, AutoFDO isn't
| quite as smooth.
| boomanaiden154 wrote:
| I'm not sure where you're getting your information from.
|
| Chrome (and many other performance-critical workloads) is
| using instrumented PGO because it gives better
| performance gains, not because it's a more mature path.
| AutoFDO is only used in situations where collecting data
| with an instrumented build is difficult.
| vlovich123 wrote:
| Last I looked AutoFDO builds were similar in performance
| to PGO as ThinLTO vs LTO is. I'd say that collecting data
| with an instrumented Chrome build is extremely difficult
| - you're relying on your synthetic benchmark environment
| which is very very different from the real world (eg
| extensions aren't installed, the patterns of websites
| being browsed is not realistic, etc). There's also a 2x
| compile cost because you have to build Chrome twice in
| the exact same way + you have to run a synthetic
| benchmark on each build to generate the trace.
|
| I'm just using an educated guess to say that at some
| point in the future Chrome will switch to AutoFDO,
| potentially using traces harvested from end user
| computers (potentially just from their employees even to
| avoid privacy complaints).
| boomanaiden154 wrote:
| You can make the synthetic benchmarks relatively
| accurate, it just takes effort. The compile-time hit and
| additional effort is often worth it for the extra couple
| percent for important applications.
|
| Performance is also pretty different on the scales that
| performance engineers are interested in for these sorts
| of production codes, but without the build system
| scalability problems that LTO has. The original AutoFDO
| paper shows an improvement of 10.5%->12.5% going from
| AutoFDO to instrumented PGO. That is pretty big. It's
| probably even bigger with newer instrumentation based
| techniques like CSPGO.
|
| They also mention the exact reasons that AutoFDO will not
| perform as well, with issues in debug info and losing
| profile accuracy due to sampling inaccuracy.
|
| I couldn't find any numbers for Chrome, but I am
| reasonably certain that they have tried both and continue
| to use instrumented PGO for the extra couple percent.
| There are other pieces of the Chrome ecosystem
| (specifically the ChromeOS kernel) that are already
| optimized using sampling-based profiling. It's been a
| while since I last talked to the Chromium toolchain
| people about this though. I also remember hearing them
| benchmark FEPGO vs IRPGO at some point and concluding
| that IRPGO was better.
| c0balt wrote:
| Yeah, and nixpkgs also, last time I checked, does patch
| GCC/ clang to ensure determinism. Many compilers and
| toolchain by default want to, e.g., embed build
| information that may leak from the build env in a non-
| deterministic/ non-reprodicible manner.
| munificent wrote:
| Yup. Even so much as inserting the build timestamp into
| the generated executable (which is strangely common)
| causes havoc with build caching.
| Sophira wrote:
| Plus, these models are entirely black boxes. Even given
| weights, we don't know how to look at them and meaningfully
| tell what's happening - and not only that, but training
| these models is likely not cheap at all.
|
| Stable output is how we can verify that attacks like the
| one described in Reflections on Trusting Trust[0] don't
| happen.
|
| [0] https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_198
| 4_Ref...
| a_t48 wrote:
| NVCC CUDA builds were nondeterministic last time I checked,
| it made certain things (trying to get very clever with
| generating patches) difficult. This was also hampered by
| certain libraries (maybe GTSAM?) wanting to write __DATE__
| somewhere in the build output, creating endlessly changing
| builds.
| sigmoid10 wrote:
| In parallel computing you run into nondeterminism pretty
| quickly anyways - especially with CUDA because of
| undetermined execution order and floating point accuracy.
| a_t48 wrote:
| Yes, at runtime. Compiling CUDA doesn't require a GPU,
| though, and doesn't really use "parallel computing". I
| think CUDA via clang gets this right and will produce the
| same build every time - it was purely an NVCC issue.
| dheera wrote:
| LLMs can be deterministic if you set the random seed and
| pin it to a certain version of the weights.
|
| My bigger concern would be bugs in the machine code would
| be very, very difficult to track down.
| azinman2 wrote:
| Just fix the random seed :)
| skybrian wrote:
| I'm amused by the possibility of a compiler having a flag
| to set a random seed. (with a fixed default, of course).
|
| If you hit a compiler bug, you could try a different seed
| to see what happens.
|
| Or how about a code formatter with a random seed?
|
| Tool developers could run unit tests with a different seed
| until they find a bug - or hide the problem by finding a
| lucky seed for which you have no provable bugs :)
|
| Edit:
|
| Or how about this: we write a compiler as a
| nondeterministic algorithm where every output is correct,
| but they are optimized differently depending on an input
| vector of choices. Then use machine learning techniques to
| find the picks that produce the best output.
| pton_xd wrote:
| > It is normally not a necessary feature of a compiler to be
| determistic. A compiler should be correct against a
| specification.
|
| That sounds like a nightmare. Optimizing code to play nice
| with black-box heuristic compilers like V8's TurboFan is,
| already in fact, a continual maintenance nightmare.
|
| If you don't care about performance, non-deterministic
| compilation is probably "good enough." See TurboFan.
| stavros wrote:
| LLMs are deterministic. We inject randomness after the fact,
| just because we don't like our text being deterministic. Turn
| temperature to 0 and you're good.
| waldrews wrote:
| But temperature 0 LLM's don't exhibit the emergent
| phenomena we like, even in apparently non-creative tasks.
| The randomness is, in some sense, a cheap proxy for an
| infeasible search over all completion sequences, much like
| simulated annealing with zero temperature is a search for a
| local optimum but adding randomness makes it explore
| globally and find more interesting possibilities.
| sebzim4500 wrote:
| Sure but you could add pseudo random noise instead and
| get the same behavior while retaining determinism.
| refulgentis wrote:
| Temperature is at ~1.2 in this thread, here's some 0.0:
|
| - Yes, temperature 0.0 is less creative.
|
| - Injecting pseudo-random noise to get deterministic
| creative outputs is "not even wrong", in the Wolfgang
| Pauli sense. It's fixing something that isn't broken,
| with something that can't fix it, that if it could, would
| be replicating the original behavior - more simply, it's
| proposing non-deterministic determinism.
|
| - Temperature 0.0, in practice, is an LLM. There aren't
| emergent phenomena, in the sense "emergent phenomena" is
| used with LLMs, missing. Many, many, many, applications
| use this.
|
| - In simplistic scenarios, on very small models, 0.0
| could get stuck literally repeating the same token.
|
| - There's a whole other layer of ex. repeat
| penalties/frequency penalties and such that are used
| during inference to limit this. Only OpenAI and llama.cpp
| expose repeat/frequency.
|
| - Temperature 0.0 is still non-deterministic on ex.
| OpenAI, though _substantially_ the same, and even _the_
| same most of the time. It 's hard to notice differences.
| (Reproducible builds require extra engineering effort,
| the same way ensuring temperature = 0.0 is _truly_
| deterministic requires engineering effort.)
|
| - Pedantically, only temperature 0.0 _at the same seed_
| (initial state) is deterministic.
| Sophira wrote:
| Even then, though, the output could change drastically
| based on a single change to the input (such as a comment).
|
| That's not something you want in a compiler.
| wakawaka28 wrote:
| It is very important for a compiler to be deterministic.
| Otherwise you can't validate the integrity of binaries! We
| already have issues with reproducibility without adding this
| shit in the mix.
| riedel wrote:
| Reproducible builds are an edge case, that require
| determistic compilation for sure. But profile based
| optimisation or linker address randomisation are sometimes
| also useful. While rule out one thing for the other.
| Normally you can easily turn on and off optimisation
| depending on your need. Just do -O0 if you want
| determinism. But normally you should not rely on it (also
| at execution time)
| vessenes wrote:
| Thank you for the summary. My memory of SOTA on disassembly
| about a year ago was sub--30% accuracy, so this is a
| significant step forward.
|
| I do think the idea of a 90%+-ish forward and backward
| assembler LLM is pretty intriguing. There's bound to be a lot
| of uses for it; especially if you're of the mind that to get
| there it would have to have learned a lot about computers in
| the foundation model training phase.
|
| Like, you'd definitely want to have those weights somehow baked
| into a typical coding assistant LLM, and of course you'd be
| able to automate round one of a lot of historical archiving
| projects that would like to get compilable modern code but only
| have a binary, you'd be able to turn some PDP-1 code into
| something that would compile on a modern machine, ... you'd
| probably be able to leverage it into building chip simulations
| / code easily, it would be really useful for writing Verilog,
| (maybe), anyway, the use cases seem pretty broad to me.
| boomanaiden154 wrote:
| Sure, performance is more interesting, but it's significantly
| harder.
|
| With code size, you just need to run the code through the
| compiler and you have a deterministic measurement for
| evaluation.
|
| Performance has no such metric. Benchmarks are expensive and
| noisy. Cost models seem like a promising direction, but they
| aren't really there yet.
| joshuanapoli wrote:
| Maybe they are thinking about embedding a program generator and
| execution environment into their LLM inferencing loop in a
| tighter way. The model invents a program that guides the output
| in a specific/algorithmic way, tailored to the prompt.
| swyx wrote:
| some comments like this make me waant to subscribe to you for
| all your future comments. thanks for doing the hard work of
| summarizing and taking the bold step of sharing your thoughts
| in public. i wish more HNers were like you.
| sigmoid10 wrote:
| >compilers behaving in a deterministic and predictable way is
| an important fundamental of pipelines. LLMs are inherently
| unpredictable, and so using an LLM for compilation /
| decompilation -- even an LLM that has 99.99% accuracy
|
| You're confusing different concepts here. An llm is technically
| not unpredictable by itself (at least the ones we are talking
| about here, there are different problems with beasts like GPT4
| [1]). The "randomness" of llms you are probably experiencing
| stems from the autoregressive completion, which samples from
| probabilities for a temperature T>0 (which is very common
| because it makes sense in chat applications). But there is
| nothing that prevents you from simply choosing greedy sampling,
| which would make your output 100% deterministic and
| reproducible. That is particularly useful for
| disassembling/decompiling and has the chance to vastly improve
| over existing tools, because it is common knowledge that they
| are often not the sharpest tools and humans are much better at
| piecing together working code.
|
| The other question here is accuracy for compiling. For that it
| is important whether the llm can follow a specification
| correctly. Because once you write unspecified behaviour, your
| code is fair game for other compilers as well. So the real
| question is how well does it follow the spec how good is it at
| dealing with situations where normal compilers will flounder.
|
| [1] https://152334h.github.io/blog/non-determinism-in-gpt-4/
| yowlingcat wrote:
| This was a big unlock for me -- I recall antirez saying the
| same thing in a comment replying to me when I asked a similar
| question about potential LLM features in Redis [1].
|
| [1] https://news.ycombinator.com/item?id=39617370
| barrkel wrote:
| Determinism for any given input isn't an interesting metric
| though. Inputs are always different, or else you could just
| replace it with a lookup function. What's important is
| reliability of the output given a distribution of inputs, and
| that's where LLMs are unreliable. Temperature sampling can be
| a technique to _improve_ reliability particularly when things
| get into repetitve loops - though usually it 's to increase
| creativity.
| torginus wrote:
| LLMs are probably great at this. You can break down the code
| into functions or basic blocks. You can use the LLM to
| decompile them and then test if the decompilation results match
| the code when executed. You'll probably get this right after a
| few tries. Then you can train your model with the succesful
| decompilation results so your model will get better.
| mmphosis wrote:
| https://xcancel.com/AIatMeta/status/1806361623831171318
| chad1n wrote:
| As usual, Twitter is impressed by this, but I'm very skeptical,
| the chance of it breaking your program is pretty high. The thing
| that makes optimizations so hard to make is that they have to
| match the behavior without optimizations (unless you have UBs),
| which is something that LLMs probably will struggle with since
| they can't exactly understand the code and execution tree.
| ramon156 wrote:
| Let's be real, at least 40% of those comments are bots
| extheat wrote:
| People simply have no idea what they're talking about. It's
| just jumping on to the latest hype train. My first impression
| here was per the name that it was actually some sort of
| compiler in it of itself--ie programming language in and pure
| machine code or some other IR out. It's got bits and pieces
| of that here and there but that's not what it really is at
| all. It's more of a predictive engine for an optimizer and
| not a very generalized one for that.
|
| What would be more interesting is training a large model on
| pure (code, assembly) pairs like a normal translation task.
| Presumably a very generalized model would be good at even
| doing the inverse: given some assembly, write code that will
| produce the given assembly. Unlike human language there is a
| finite set of possible correct answers here and you have the
| convenience of being able to generate synthetic data for
| cheap. I think optimizations would arise as a natural side
| effect this way: if there's multiple trees of possible
| generations (like choosing between logits in an LLM) you
| could try different branches to see what's smaller in terms
| of byte code or faster in terms of execution.
| quonn wrote:
| > Presumably a very generalized model would be good at even
| doing the inverse: given some assembly, write code that
| will produce the given assembly.
|
| ChatGPT does this, unreliably.
| hughleat wrote:
| It can emulate the compiler (IR + passes -> IR or ASM).
|
| > What would be more interesting is training a large model
| on pure (code, assembly) pairs like a normal translation
| task.
|
| It is that.
|
| > Presumably a very generalized model would be good at even
| doing the inverse: given some assembly, write code that
| will produce the given assembly.
|
| Is has been trained to disassemble. It is much, much better
| than other models at that.
| solarexplorer wrote:
| If I understand correctly, the AI is only choosing the
| optimization passes and their relative order. Each individual
| optimization step would still be designed and verified
| manually, and maybe even proven to be correct mathematically.
| boomanaiden154 wrote:
| Right, it's only solving phase ordering.
|
| In practice though, correctness even over ordering of hand-
| written passes is difficult. Within the paper they describe a
| methodology to evaluate phase orderings against a small test
| set as a smoke test for correctness (PassListEval) and
| observe that ~10% of the phase orderings result in assertion
| failures/compiler crashes/correctness issues.
|
| You will end up with a lot more correctness issues adjusting
| phase orderings like this than you would using one of the
| more battle-tested default optimization pipelines.
|
| Correctness in a production compiler is a pretty hard
| problem.
| hughleat wrote:
| There are two models.
|
| - foundation model is pretrained on asm and ir. Then it is
| trained to emulate the compiler (ir + passes -> ir or asm)
|
| - ftd model is fine tuned for solving phase ordering and
| disassembling
|
| FTD is there to demo capabilities. We hope people will fine
| tune for other optimisations. It will be much, much cheaper
| than starting from scratch.
|
| Yep, correctness in compilers is a pain. Auto-tuning is a
| very easy way to break a compiler.
| Lockal wrote:
| As this LLM operates on LLVM intermediate representation
| language, the result can be fed into
| https://alive2.llvm.org/ce/ and formally verified. For those
| who don't know what to print there: here is an example of C++
| spaceship operator: https://alive2.llvm.org/ce/z/YJPr84 (try to
| replace -1 with -2 there to break). This is kind of a Swiss
| knife for LLVM developers, they often start optimizations with
| this tool.
|
| What they missed is to mention verification (they probably
| don't know about alive2) and comparison with other compilers.
| It is very likely that LLM Compiler "learned" from GCC and with
| huge computational effort simply generates what GCC can do out
| of the box.
| boomanaiden154 wrote:
| I'm reasonably certain the authors are aware of alive2.
|
| The problem with using alive2 to verify LLM based compilation
| is that alive2 isn't really designed for that. It's an
| amazing tool for catching correctness issues in LLVM, but
| it's expensive to run and will time out reasonably often,
| especially on cases involving floating point. It's explicitly
| designed to minimize the rate of false-positive correctness
| issues to serve the primary purpose of alerting compiler
| developers to correctness issues that need to be fixed.
| hughleat wrote:
| Yep, we tried it :-) These were exactly the problems we had
| with it.
| boomanaiden154 wrote:
| I'm not sure it's likely that the LLM here learned from gcc.
| The size optimization work here is focused on learning phase
| orderings for LLVM passes/the LLVM pipeline, which wouldn't
| be at all applicable to gcc.
|
| Additionally, they train approximately half on assembly and
| half on LLVM-IR. They don't talk much about how they generate
| the dataset other than that they generated it from the
| CodeLlama dataset, but I would guess they compile as much
| code as they can into LLVM-IR and then just lower that into
| assembly, leaving gcc out of the loop completely for the vast
| majority of the compiler specific training.
| hughleat wrote:
| Yep! No GCC on this one. And yep, that's not far off how
| the pretraining data was gathered - but with random
| optimisations to give it a bit of variety.
| boomanaiden154 wrote:
| Do you have more information on how the dataset was
| constructed?
|
| It seems like somehow build systems were invoked given
| the different targets present in the final version?
|
| Was it mostly C/C++ (if so, how did you resolve missing
| includes/build flags), or something else?
| hughleat wrote:
| We plan to have a peer reviewed version of the paper
| where we will probably have more details on that.
| Otherwise we can't give anymore details than in the paper
| or post, etc. without going through legal which takes
| ages. Science is getting harder to do :-(
| moffkalast wrote:
| > C++ spaceship operator
|
| > (A <=> B) < 0 is true if A < B
|
| > (A <=> B) > 0 is true if A > B
|
| > (A <=> B) == 0 is true if A and B are equal/equivalent.
|
| TIL of the spaceship operator. Was this added as an april
| fools?
| samatman wrote:
| This is one of the oldest computer operators in the game:
| the arithmetic IF statement from FORTRAN.
|
| It's useful for stable-sorting collections with a single
| test. Also, overloading <=> for a type, gives all
| comparison operators "for free": ==, !=, <, <=, >=, >
| pklausler wrote:
| It also has no good definition for its semantics when
| presented with a NaN.
| sagarm wrote:
| You would use a partial ordering: https://en.cppreference
| .com/w/cpp/utility/compare/partial_or...
| pklausler wrote:
| How would that apply to Fortran's arithmetic IF
| statement? It goes to one label for a negative value, or
| to a second label for a zero, or to a third label for
| positive. A NaN is in none of these categories.
| moffkalast wrote:
| I mean maybe I'm missing something but it seems like it
| behaves exactly the same way as subtraction? At least for
| integers it's definitely the same, for floats I imagine
| it might handle equals better?
| samatman wrote:
| C++ has operator overloading, so you can define the
| spaceship for any class, and get every comparison
| operator from the fallback definitions, which use `<=>`
| in some obvious ways.
| sagarm wrote:
| The three-way comparison operator just needs to return a
| ternary, and many comparisons boil down to integer
| subtraction. strcmp is also defined this way.
|
| In C++20 the compiler will automatically use the spaceship
| operator to implement other comparisons if it is available,
| so it's a significant convenience.
| bbor wrote:
| AFAIK this is a heuristic, not a category. The underlying
| grammar would be preserved.
|
| Personally I thought we were way too close to perfect to make
| meaningful progress on compilation, but that's probably just
| naivete
| boomanaiden154 wrote:
| I would not say we are anywhere close to perfect in
| compilation.
|
| Even just looking at inlining for size, there are multiple
| recent studies showing ~10+% improvement
| (https://dl.acm.org/doi/abs/10.1145/3503222.3507744,
| https://arxiv.org/abs/2101.04808).
|
| There is a massive amount of headroom, and even tiny bits
| still matter as ~0.5% gains on code size, or especially
| performance, can be huge.
| cec wrote:
| Hey! The idea isn't to replace the compiler with an LLM, the
| tech is not there yet. Where we see value is in using these
| models to guide an existing compiler. E.g. orchestrating
| optimization passes. That way the LLM won't break your code,
| nor will the compiler (to the extent that your compiler is free
| from bugs, which can tricky to detect - cf Sec 3.1 of our
| paper).
| verditelabs wrote:
| I've done some similar LLM compiler work, obviously not on
| Meta's scale, teaching an LLM to do optimization by feeding
| an encoder/decoder pairs of -O0 and -O3 code and even on my
| small scale I managed to get the LLM to spit out the correct
| optimization every once and a while.
|
| I think there's a lot of value in LLM compilers to
| specifically be used for superoptimization where you can
| generate many possible optimizations, verify the correctness,
| and pick the most optimal one. I'm excited to see where y'all
| go with this.
| viraptor wrote:
| Thank you for freeing me from one of my to-do projects. I
| wanted to do a similar autoencoder with optimisations. Did
| you write about it anywhere? I'd love to read the details.
| verditelabs wrote:
| No writeup, but the code is here:
|
| https://github.com/SuperOptimizer/supercompiler
|
| There's code there to generate unoptimized / optimized
| pairs via C generators like yarpgen and csmith, then
| compile, train, inference, and disassemble the results
| hughleat wrote:
| Yes! An AI building a compiler by learning from a super-
| optimiser is something I have wanted to do for a while now
| :-)
| moffkalast wrote:
| > The idea isn't to replace the compiler with an LLM, the
| tech is not there yet
|
| What do you mean the tech isn't there _yet_ , why would it
| ever even go into that direction? I mean we do those kinds of
| things for shits and giggles but for any practical use? I
| mean come on. From fast and reliable to glacial and not even
| working a quarter of the time.
|
| I guess maybe if all compiler designers die in a freak
| accident and there's literally nobody to replace them, then
| we'll have to resort to that after the existing versions
| break.
| swyx wrote:
| then maybe dont name it "LLM Compiler", just "Compiler
| Guidance with LLMs" or "LLM-aided Compiler optimization" or
| something - will get much more to the point without
| overpromising
| nickpsecurity wrote:
| Yeah, the name was misleading. I thought it was going to be
| source to object translation maybe with techniques like how
| they translate foreign languages.
| namaria wrote:
| This feels like going insane honestly. It's like reading that
| people are super excited about using bouncing castles to mix
| concrete.
| ldjkfkdsjnv wrote:
| I love this company. Advancing ai and keeping the rest of us in
| the loop.
| LoganDark wrote:
| I hate the company (Facebook), but I still think them having
| been publicly releasing a bunch of the research they've been
| doing (and models they've been making) has been a net good for
| almost everybody, at least in terms of exploring the field of
| LLMs.
| ein0p wrote:
| My love for Meta is strictly confined to FAIR and the PyTorch
| team. The rest of the company is basically cancer.
| Slyfox33 wrote:
| Is this a bot comment?
| muglug wrote:
| Unlike many other AI-themed papers at Meta this one omits any
| mention of the model output getting used at Instagram, Facebook
| or Meta. Research is great! But doesn't seem all that actionable
| today.
| boomanaiden154 wrote:
| This would be difficult to deploy as-is in production.
|
| There are correctness issues mentioned in the paper regarding
| adjusting phase orderings away from the well-trodden
| O0/O1/O2/O3/Os/Oz path. Their methodology works for a research
| project quite well, but I personally wouldn't trust it in
| production. While some obvious issues can be caught by a small
| test suite and unit tests, there are others that won't be, and
| that's really risky in production scenarios.
|
| There are also some practical software engineering things like
| deployment in the compiler. There is actually tooling in
| upstream LLVM to do this
| (https://www.youtube.com/watch?v=mQu1CLZ3uWs), but running
| models on a GPU would be difficult and I would expect CPU
| inference to massively blow up compile times.
| soist wrote:
| How do they verify the output preserves semantics of the input?
| hughleat wrote:
| For the disassembler we round trip. x86 ->(via model) IR ->(via
| clang) x86. If they are identical then the IR is correct. Could
| be correct even if not identical, but then you need to check.
|
| For the auto-tuning, we suggest the best passes to use in LLVM.
| We take some effort to weed out bad passes, but LLVM has bugs.
| This is in common with any auto-tuner.
|
| We train it to emulate the compiler. The compiler does that
| better already. We do it because it helps the LLM understand
| the compiler better and it auto-tunes better as a result.
|
| We hope people will use this model to fine-tune for other
| heuristics. E.g. an inliner which accepts the IR or the caller
| and callee to decide profitability. We think things like that
| will be vastly cheaper for people if they can start from LLM
| Compiler. Training LLMs from scratch is expensive :-)
|
| IMO, right now, AI should be used to decide profitability not
| correctness.
| soloist11 wrote:
| Have you guys applied this work internally to optimize Meta's
| codebase?
| zitterbewegung wrote:
| Some previous work in the space is at
| https://github.com/albertan017/LLM4Decompile
| LoganDark wrote:
| Reading the title, I thought this was a tool for optimizing and
| disassembling LLMs, not an LLM designed to optimize and
| disassemble. Seeing it's just a model is a little disappointing
| in comparison.
| jameshart wrote:
| Pretty sure I remember trading 300 creds for a Meta Technologies
| Neural Optimizer and Disassembler in one of the early _Deus Ex_
| games.
| nothrowaways wrote:
| It is so funny that meta has to post it on X.
| rising-sky wrote:
| https://www.threads.net/@aiatmeta/post/C8ubaKupPwC
| 0x1ceb00da wrote:
| Wouldn't "Compiler LLM" be a more accurate name than "LLM
| Compiler"?
| hughleat wrote:
| Never let a computer scientist name anything :-)
| Havoc wrote:
| I don't understand the purpose of this. Feels like a task for
| function calling and sending it to an actual compiler.
|
| Is there an obvious use case I'm missing?
| singularity2001 wrote:
| GPT 6 can write software directly (as assembly) instead of
| writing c first.
|
| Lots of training data for binary, and it can train itself by
| seeing if the program does what it expects it to do.
| dunefox wrote:
| Is this GPT-6 in the room with us now?
| killerstorm wrote:
| This is not a product, it's a research project.
|
| They don't expect you to use this.
|
| Applications might require further research. And the main
| takeaway might be not "here's a tool to generate code", but
| "LLMs are able to understand binary code, and thus we can train
| them to do ...".
| zellyn wrote:
| I continue to be fascinated about what the next qualitative
| iteration of models will be, marrying the language processing and
| broad knowledge of LLMs with an ability to reason rigorously.
|
| If I understand correctly, this work (or the most obvious
| productionized version of it) is similar to the work Deep Mind
| released a while back: the LLM is essentially used for
| "intuition"---to pick the approach---and then you hand off to
| something mechanical/rigorous.
|
| I think we're going to see a huge growth in that type of system.
| I still think it's kind of weird and cool that our meat brains
| with spreading activation can (with some amount of
| effort/concentration) switch over into math mode and manipulate
| symbols and inferences rigorously.
___________________________________________________________________
(page generated 2024-06-29 23:01 UTC)