[HN Gopher] Diffusion on syntax trees for program synthesis
       ___________________________________________________________________
        
       Diffusion on syntax trees for program synthesis
        
       Author : pouwerkerk
       Score  : 377 points
       Date   : 2024-06-04 00:52 UTC (22 hours ago)
        
 (HTM) web link (tree-diffusion.github.io)
 (TXT) w3m dump (tree-diffusion.github.io)
        
       | sakras wrote:
       | This is very cool! My first thought is: can this be applied to
       | converting raster graphics to vector graphics (eg PNG to SVG)?
       | Seems like a very similar problem, though probably much more
       | computationally expensive.
        
         | szvsw wrote:
         | > We apply our approach to inverse graphics tasks, where our
         | model learns to convert images into programs that produce those
         | images.
         | 
         | I would argue that at least on a philosophical level, this is,
         | definitionally, the process of converting raster graphics to
         | vector graphics, as long as you by the premise that the
         | difference between the two is simply that vector gfx is a
         | programmatic/imperative representation of image generation,
         | while raster is a data structure/declarative representation of
         | images.
         | 
         | In other words, simply put, raster images are just arrays,
         | vector images are sequences of instructions. Raster images are
         | naturally much closer to the "raw" data needed by the output
         | mechanism, while vector images require a much more complex
         | interpreter.
        
           | adrianmonk wrote:
           | Or, raster and vector images are philosophically _the same
           | thing_. The only difference is that vector has more
           | operations than raster. Raster just has  "draw unit square at
           | integer coordinates".
        
             | szvsw wrote:
             | Yep, I agree with this - tried to hint at that when I said
             | "vector images require a much more complex interpreter".
        
             | DougBTX wrote:
             | On the other hand, A Pixel Is Not A Little Square[0] would
             | disagree, a raster is a grid sample of a continuous
             | function.
             | 
             | [0] http://alvyray.com/Memos/CG/Microsoft/6_pixel.pdf
        
               | code_biologist wrote:
               | Though I agree with the point that paper makes (it makes
               | a good case that the little square mental model of a
               | pixel is mostly inappropriate) it does seem focused on
               | imaging and does not mention places where the little
               | square mental model is appropriate.
               | 
               | Pictures of cats, subpixel rendered text, or company logo
               | SVGs as displayed on a web page are point samples and not
               | little squares.
               | 
               | User interfaces are good examples of often being little
               | squares. Calling HN's beige background a point sampled
               | discrete representation of an underlying continuous
               | function seems pretty tortured -- to me it seems like a
               | bunch of beige little squares.
        
           | andybak wrote:
           | > vector gfx is a programmatic/imperative representation of
           | image generation, while raster is a data
           | structure/declarative representation of images.
           | 
           | This seems a bit off to me.
           | 
           | Aside from oddities such as Postscript, most vector formats
           | are canonical examples of declarative code. The distinction
           | is more about what is being represented rather than how it is
           | represented.
        
       | pmayrgundter wrote:
       | It's funny, this kind of subtree mutation was looked at pretty
       | deeply by Koza and Adami in the 90s under the rubric of Genetic
       | Algorithms, but with a slightly different optimization function
       | 
       | One ref in the paper to 2000 for GAs for fast generation of
       | program trees, but that's missing the main show
       | 
       | Hope they're reading this and dig into those guys work
        
         | 29athrowaway wrote:
         | You can also say backpropagation is the chain rule from
         | centuries ago.
        
           | elijahbenizzy wrote:
           | Backpropogration is just an application of the chain rule --
           | cool that we all learned it in high school!
        
           | telotortium wrote:
           | It is a computationally clever application of the chain rule
           | to minimize the amount of computation needed to compute
           | gradients for all parameters in the network.
        
             | om8 wrote:
             | > to minimize the amount of computation
             | 
             | IMO backprop is the most trivial implementation of
             | differentiation in neural networks. Do you know an easier
             | way to compute gradients with larger overhead? If so,
             | please share it.
        
               | QuadmasterXLII wrote:
               | My first forays into making neural networks used
               | replacement rules to modify an expression tree until all
               | the "D" operators went away, but that takes exponential
               | complexity in network depth if you aren't careful. Finite
               | differences is linear in number of parameters, as is
               | differentiation by Dual Numbers
        
               | eli_gottlieb wrote:
               | Backprop is the application of dynamic programming to the
               | chain rule for total derivatives, which sounds trivial
               | only in retrospect.
        
               | Jensson wrote:
               | You can do forward propagation. Humans typically finds
               | forward easier than backwards.
        
               | tripzilch wrote:
               | since you asked ... how about Monte Carlo with Gibbs
               | sampling?
        
           | pmayrgundter wrote:
           | I totally didn't realize this until these comments. Neat!
           | 
           | I went digging in wikipedia.. the Backpropagation article was
           | created in 2005 and yet the mention of association/derivation
           | from the chain rule wasn't mentioned until 2014, through a
           | borrow from the German article
           | 
           | https://en.wikipedia.org/w/index.php?title=Backpropagation&o.
           | ..
        
         | tithe wrote:
         | Are these the references?
         | 
         | - https://web.archive.org/web/20021224053225/http://smi-
         | web.st...
         | 
         | - https://www.genetic-programming.com/jkpdf/tr1314.pdf
        
         | verdverm wrote:
         | Some more recent alternatives to Koza's GP use some very
         | different search mechanisms. FFX & PGE are both very fast.
         | 
         | https://seminars.math.binghamton.edu/ComboSem/worm-chiu.pge_...
         | 
         | https://arxiv.org/pdf/2209.09675
         | 
         | I authored PGE and have thought that RL, and more recently
         | diffusion techniques, might help these algos. All of the algos
         | need better ways to guide the search, or help it get unstuck
         | from local optima, which happens surprisingly fast. Most work
         | in GP / EVO is about avoiding premature convergence.
        
         | teruakohatu wrote:
         | These kind of Genetic Algorithms are still being researched in
         | academia. I attended a seminar a couple of years ago on the
         | subject. It's still a total dead end imho.
        
           | Chabsff wrote:
           | I used to (early 00s) be super big into GA and GP until a
           | professor of mine at the time described the whole class of
           | algorithms as "Marginally better than brute forcing". That
           | really resonated with my experience, and was just too spot-on
           | to ignore.
        
             | bionhoward wrote:
             | GP finds good working trees for complex real world problems
             | in a single night on a consumer GPU which you would
             | definitely need to brute force for (millions of) years to
             | find. Quite a margin!
        
         | sandos wrote:
         | Wow, what a flash from the past! Was playing around a LOT with
         | GP around that time, and the name Koza is certainly familiar. I
         | even think I did some semi-similar things, instead of my normal
         | approach which was simplistic but inefficient in that lots of
         | invalid code was generated.
        
         | vidarh wrote:
         | Genetic _Programming_ [1], specifically. I have both his two
         | bricks from  '92 and '94 (Genetic Programming: On the
         | Programming of Computers by Means of Natural Selection, and
         | Genetic Programming II : Automatic Discovery of Reusable
         | Programs). I've not read his two later ones.
         | 
         | The big problem they seemed to get stuck at was partially doing
         | it _fast enough_ , and partially ending up with a result that
         | was comprehensible. The latter in particular seems to be far
         | better with LLMs. You tended to end up spending a lot of time
         | trying to reorganise and prune trees to get something that you
         | could decipher, and so it seemed like the primary, and too
         | limited, value became algorithms where you could invest a lot
         | of resources into trying to find more optimal versions of very
         | small/compact algorithms you could justify spending time on.
         | But the challenged there is that there are often _so many_ far
         | lower hanging fruits in most code bases that few people get to
         | the point where it 's worth trying.
         | 
         | I still love the idea at a conceptual level...
         | 
         | [1] https://www.genetic-programming.com/johnkoza.html
        
         | pmayrgundter wrote:
         | On my last comment I suggested the authors may not be familiar
         | with Koza+Adami, but didn't realize the corresponding author is
         | Stuart Russell, co-author of "Artificial Intelligence: A Modern
         | Approach", with Peter Norvig.. the "The authoritative, most-
         | used AI textbook, adopted by over 1500 schools." according to
         | their site. https://aima.cs.berkeley.edu/
         | 
         | Whoops!
        
       | pamelafox wrote:
       | I would love to see them try this with the Processing/P5Js
       | libraries. Thats what the ASTs reminded me of. It could
       | potentially be used to help students trying to figure out how to
       | fix their programs. I used AST-based hints for my ProcessingJS
       | courses on Khan Academy, but I handwrote the AST patterns for
       | those hints.
        
       | dwlg00 wrote:
       | I'm failing to see how this is novel. It looks like they're doing
       | diffusion on a representation system for 2D graphics, which is
       | very different than an actual program (they do address this
       | limitation to be fair)
        
         | revalo wrote:
         | Yeah, this is true! These are more like expressions rather than
         | programs. We were mostly following the language used by
         | previous work, https://arxiv.org/abs/1906.04604
        
           | hobofan wrote:
           | Couldn't this be used to do HTML generation from designs?
           | Especially when combined with multiple viewport sizes at the
           | same time, generating a fluid HTML layout would be pretty
           | awesome.
        
       | dinobones wrote:
       | I don't understand the "magic" here.
       | 
       | In a traditional approach, you would generate random images,
       | calculate some distance metric, then use some optimization method
       | like simulated annealing to minimize the distance.
       | 
       | I get that the difference between the image representations is
       | being optimzied here, but how is it possible that changing tokens
       | in a program is differentiable?
        
         | revalo wrote:
         | Changing tokens in a program is not differentiable. For me, the
         | key idea is that you can train a neural model to suggest edits
         | to programs by randomly mutating nodes. And when you run this
         | neural model, you get to make edits that are syntactically
         | correct (i.e., a number will only replace a number etc.)
         | according to a context-free grammar.
        
       | montyanderson wrote:
       | This is fascinating. I've been trying to envisage how the new
       | language models will have deeper or lower-level role in software
       | production than simple code generation.
        
         | passion__desire wrote:
         | I think browsers could be the next iteration. Website backend
         | will have premade flows. e.g. a transfer money from my account
         | to another account, etc. And through fluidic UIs, the website
         | will collect info need, necessary approvals before flow
         | submission. AI-based browser DOM manipulation.
        
       | artninja1988 wrote:
       | Surprised to see Stuart Russells name on this as I thought he was
       | fully consumed by the doomsday cult. Although he's last author so
       | he's probably only on it because he's the head of the lab
        
         | samatman wrote:
         | A lot of doomers work on AI. While frowning, and shaking their
         | heads very gravely, so you know they don't approve.
        
           | optimalsolver wrote:
           | If we don't get to AGI first, the bad guys will.
        
             | sgt101 wrote:
             | hang on, what if...
             | 
             | we're the bad guys?
        
         | ngruhn wrote:
         | I haven't heard anyone make sane case against the doomsday
         | argument. Only attacks.
        
           | ImHereToVote wrote:
           | You can't be scared while you laugh at someone. So laughing
           | is a good thing to do to stave of fear.
        
         | robxorb wrote:
         | What's with the last author/first author thing in science
         | papers?
         | 
         | I've read several times that the author listed last is usually
         | the most significant contributor, and the first author the
         | least significant, due to some kind of tradition around modesty
         | plus favourably introducing new names. (Which then of course
         | doesn't work, if everyone knows it's happening...)
         | 
         | Here, you've interpreted it as the reverse, and by that I mean
         | in the sensible way - did you not know about the tradition, or
         | are you wrong? And how can we know either way for sure?
        
           | optimalsolver wrote:
           | >the author listed last is usually the most significant
           | contributor
           | 
           | Where did you read that?
           | 
           | That's definitely not the case in machine learning.
        
             | robxorb wrote:
             | I've read it in several places over the years, and just had
             | a search now to find a reference to cite. Here's a PubMed
             | paper on the subject:
             | 
             | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3010799/
             | 
             | (And note how points 1 through 4 all quite conflict with
             | each other!)
        
               | pama wrote:
               | If it helps think of the first author as the lead
               | engineer or the CEO, and the last author as the board or
               | the VC. In some areas of science (or some teams) the last
               | author is closer to a CEO, in others closer to a VC (they
               | almost always have the powers of the board). This picture
               | does not contradict the guideline in the reference you
               | shared. Typically, most of the work and writing of the
               | paper is done by the first author, though sometimes, for
               | example when a student gives up, only most of the writing
               | of the paper. The main ideas may be from the first author
               | or from the last author, and rarely in between, but the
               | sorting typically goes by amount of labor/contributions
               | on the task. In some narrow subfields, including most
               | math, sorting is alphabetical or random.
        
               | robxorb wrote:
               | That does help, thanks! Very intuitive analogy. Maybe
               | this kind of organisational structure is a kind of
               | natural archetype people gravitate towards when coming
               | together to break new ground.
        
           | fabian2k wrote:
           | Conventions vary by field, but within a specific field
           | they're usually pretty consistent. In natural sciences
           | (except large physics papers) the convention is that the
           | first author is the one doing most of the practical work. The
           | last author is the PI (principal investigator) of the group
           | who had a hand in designing the experiments and oversaw the
           | research. Now, the latter can mean anything from barely doing
           | any work on the paper to being deeply involved in the
           | research.
           | 
           | If you're reading papers most of the time the last author is
           | more meaningful to you because they're the senior researcher,
           | you know their research interests and what kind of papers
           | they produce. The first authors are PhD students and
           | PostDocs, they change much more often.
        
             | robxorb wrote:
             | Thank you, that makes my apparently half-formed prior
             | understanding make a lot more sense. Seems the path forward
             | is researching the greater context of the last authors
             | work, and of course the common convention for each field.
             | 
             | Eg, as a sibling commented this convention may not be
             | common in ML research, I wonder then if it may be something
             | less common in emergent fields where researchers are
             | generally be younger and less likely to follow older
             | traditions.
        
       | behnamoh wrote:
       | How is it different from genetic algorithms that mutate the
       | syntax tree until the target output is achieved?
        
         | Karellen wrote:
         | It's different in the same way that using an LLM instead of a
         | traditional Markov chain is a different way of generating text.
         | You're still predicting the next word at a time to hopefully
         | end up with plausible sentences/paragraphs, but the difference
         | is in how you model the training dataset, and how you use that
         | model to make each next choice in your live application.
        
       | gastonmorixe wrote:
       | could diffusion work at binary level? I mean, could we train a
       | diffusion model to generate a final binary of a program given a
       | prompt? probably AST may be better but the binary I feel is
       | extremely easy to at least test fast if it works or not. Though
       | there may be a lot of drawbacks, if this is possible I can't wait
       | until we ask "give me an app that does that" and the diffusion
       | model starts generating all te bytes the app to do the job. Just
       | wondering
        
         | dcreater wrote:
         | That would be mind blowing. Why go through all the lost
         | intermediary steps, especially through Python and JS, when you
         | can generate machine code directly
        
           | treyd wrote:
           | Interpretability, reasoning about the generated code,
           | portability, etc. Probably also demands a larger model with a
           | higher cost of training (and likely more training data) to
           | target a more unstructured "language" like machine code.
        
           | eternauta3k wrote:
           | If your model is error-prone, having control structures,
           | types and other compile-time checks is very valuable. It's
           | harder to constrain arbitrary machine code to make something
           | sensible.
        
             | mejutoco wrote:
             | Intuitively it makes sense, but I am not fully convinced
             | about this. You could give it only a few register and
             | discard invalid operations for certain registers or plain
             | known invalid operations.
        
               | treyd wrote:
               | But that doesn't stop it from generating code that
               | segfaults.
        
               | mejutoco wrote:
               | That is the same problem as generating python that blows
               | up, no? (assuming it is tested in a sandbox)
        
         | fulafel wrote:
         | Editing with feedback from program output, like in this work,
         | could be more closely applicable if you first disassemble the
         | binary and have it edit things in the assembly language AST,
         | then reassemble. This would result in a higher likelihood of
         | creating a valid program.
        
       | zelphirkalt wrote:
       | This sounds more similar to what people have done with Racket and
       | hint generation for MOOCs. Not sure which university it is again,
       | but I saw a presentation about how they generate hints for
       | students by mutating the syntax tree and analyzing how they had
       | to modify it, to get to a target solution. It was presented at
       | some RacketCon, maybe a decade ago already. Perhaps that
       | knowledge how to do it can be combined with newer machine
       | learning approaches?
       | 
       | EDIT: I found the talk:
       | https://invidious.baczek.me/watch?v=ijyFC36kVis
        
       | aquarius0 wrote:
       | The application to inverse graphics tasks reminds me of this
       | paper which was released one week earlier:
       | https://arxiv.org/abs/2405.15306
        
       | flakiness wrote:
       | The PDF is super slow to render, I guess because it contains
       | commands from programmatically generated figures. It gives it a
       | kind of an academic-paper-feel I miss these days.
       | 
       | https://arxiv.org/pdf/2405.20519
        
       | lwansbrough wrote:
       | I'd like to see it with SDFs!
        
         | grondilu wrote:
         | Please elaborate. Are you thinking of approximating the
         | distance function with an algebraic expression, with algebra
         | itself being the "programming language"?
        
           | Philpax wrote:
           | You can represent arbitrary shapes through the composition of
           | SDFs: https://iquilezles.org/articles/distfunctions/
           | 
           | These can be treated as parameterised nodes in a tree,
           | similar to what's happening here. It follows that there may
           | be a possible adaptation of this to SDF composition, such
           | that you can give it a shape and have it produce the SDF
           | nodes + composition required to produce that shape.
           | 
           | Most existing approaches to SDFs with NNs have the NN itself
           | take on the role of the SDF (i.e. given a point, it predicts
           | the distance), so there's a compelling opportunity here to
           | build a system that can produce spatial representations from
           | existing imagery without NN inference at render-time.
           | 
           | I imagine adding the third dimension to the problem makes it
           | much harder, though! I'll have to give the paper a read to
           | determine how coupled to 2D their current approach is.
        
       | grondilu wrote:
       | The application to graphics is interesting. It seems to me that
       | current image generation models struggle with stylized pictures
       | ("ligne claire" in comics, geometric shapes and so on). After all
       | this kinds of pictures should be easy to encode in vectoriel
       | formats (like SVG), which are basically programming languages.
        
       | machiaweliczny wrote:
       | I had idea about doing something similar based on DifussER paper.
       | One would need to model code edits as algebra similar to add
       | char, replace char or delete char but something like define func,
       | remove func, define var etc. I am undereducated to do it myself
       | but have a feeling it could work.
        
         | machiaweliczny wrote:
         | I will have to dig into this paper as it looks like exactly
         | this. I wonder if they use closures to limit valid operations
         | space.
         | 
         | The only thing I didn't understand to make it happen was how to
         | connect it well to description of desired program or edit.
         | 
         | BTW my idea was to train it by destructing programs available
         | on github (so adding noise via some random valid ops and then
         | removing it to retrieve original program). Probably best done
         | in N-1 commit is treated as noise and moving back to commit 0
        
       | can16358p wrote:
       | I wonder how this would apply to compiler/interpreter
       | optimizations.
       | 
       | Is it possible that it can "disect" some parts of the execution,
       | perhaps at assembly level, and come up with optimizations
       | specific to the compiled code without changing the output (I mean
       | expected program output, not emitted binary), that modern
       | compilers have not deterministically come up with?
        
         | bastawhiz wrote:
         | I expect the answer is "no". I wouldn't expect a tool like this
         | to "discover" assembly without being trained on the compiled
         | output. The model has no notion of how or where the code runs.
         | 
         | After decades of compiler research and super compilers chugging
         | away, we're sort of at a point where discovering novel
         | optimizations with results that are more than a smidge of
         | improvement is almost impossibly unlikely. Compilers today are
         | _really good_.
         | 
         | That said, I think the value that something like this might
         | have is being able to optimize the _intent_ of the code. If it
         | can determine that I 'm sorting some numbers, it can rewrite my
         | code to use a faster sorting algorithm that has the same
         | functional properties. If I'm storing data that never gets
         | used, it can stop storing it. It has a view of the code at a
         | level above what the compiler sees, with an understanding not
         | just of _what_ is being done, but _why_.
        
           | xavxav wrote:
           | > After decades of compiler research and super compilers
           | chugging away, we're sort of at a point where discovering
           | novel optimizations with results that are more than a smidge
           | of improvement is almost impossibly unlikely. Compilers today
           | are really good.
           | 
           | I agree when it comes to peephole optimizations, but there's
           | still a lot of juice left in exploiting language guarantees
           | (immutability, non-aliasing, data-parallelism), however most
           | compiler developer energy is spent propping up C/C++ and
           | consequently optimizations are developed with those languages
           | in mind.
        
         | gergo_barany wrote:
         | This is called superoptimization:
         | https://en.wikipedia.org/wiki/Superoptimization
         | 
         | There are people applying synthesis techniques to
         | superoptimization. So something like this would possibly apply.
        
         | hoosieree wrote:
         | My dissertation worked on a similar problem. I used obfuscation
         | to build a large dataset from a small set of ground-truth
         | functions. Then built a model to classify unseen obfuscated
         | binary code to the nearest of the known functions.
         | 
         | The application I had in mind during the research was anti-
         | malware static analysis, but optimization is really just the
         | flipside of obfuscation. Something I'd like to try in the
         | future is a diffusion model that treats obfuscation as "noise"
         | to be removed.
         | 
         | One thing I learned is that optimizing compilers produce very
         | regular output. After normalizing addresses, the "vocabulary"
         | size of basic blocks ends up pretty small, like ~2000 tokens.
         | Certain "phrases" correlate with the original source code's
         | semantics regardless of how much obfuscation you add on top.
        
       | whereismyacc wrote:
       | There's been talk in the past about github adding integrations
       | with common build tools (automatically?). What if you could
       | compile every llvm-compiled project on github and run a diffusion
       | model over the intermediate representations?
        
         | daralthus wrote:
         | what's the output?
        
           | whereismyacc wrote:
           | I guess the output would be an llvm intermediate
           | representation that you can compile down and run, right? I'm
           | stretching pretty far past my knowledge here.
        
             | Philpax wrote:
             | I think the question - at least, for me - is "what do you
             | expect to do with this system?" What will you output with
             | your diffusion model?
        
       | JHonaker wrote:
       | Markov Chain Monte Carlo for program synthesis isn't exactly
       | novel. The most immediate reference I thought of is Josh
       | Tenenbaum's [1].
       | 
       | There's also a lot of demos in WebPPL (web probabilistic
       | programming language)[2] like [3] for the synthesis of 3D space-
       | ships. I highly recommend their associated books on The Design
       | and Implementation of Probabilistic Programming Languages [4] and
       | Probabilistic Models of Cognition [5].
       | 
       | I also highly recommend taking a look at the publications of the
       | MIT Probabilistic Computing Project [6].
       | 
       | [1] Human-level concept learning through probabilistic program
       | induction.
       | https://www.cs.cmu.edu/~rsalakhu/papers/LakeEtAl2015Science....
       | 
       | [2] http://webppl.org/
       | 
       | [3] https://dritchie.github.io/web-procmod/
       | 
       | [4] https://dippl.org/
       | 
       | [5] http://probmods.org/
       | 
       | [6] http://probcomp.csail.mit.edu/
        
       | ofou wrote:
       | Here's a GPTo summary of the paper
       | 
       | https://www.emergentmind.com/papers/2405.20519
        
       | goexploration wrote:
       | The beam search idea is interesting. Curious to know if beam
       | search for reverse diffusion has been done before.
       | 
       | Could anyone clarify how they integrate beam search with the
       | reverse diffusion- do they sample m > k nodes from a reverse
       | diffusion step and expand only the top k nodes?
        
       ___________________________________________________________________
       (page generated 2024-06-04 23:01 UTC)