[HN Gopher] Program Synthesis and Large Language Models
       ___________________________________________________________________
        
       Program Synthesis and Large Language Models
        
       Author : ChristofferSN
       Score  : 39 points
       Date   : 2024-12-12 08:56 UTC (3 days ago)
        
 (HTM) web link (cacm.acm.org)
 (TXT) w3m dump (cacm.acm.org)
        
       | bjornsing wrote:
       | I'm beginning to see LLMs more as a solution to the code reuse
       | problem: they can implement anything that's common enough to
       | appear a number of times in their training data.
       | 
       | If you're hoping they will soon implement entirely novel complex
       | systems based on a loose specification I think you'll be
       | disappointed.
        
         | danenania wrote:
         | This kind of begs the question though in the "there's nothing
         | new under the sun" sense. Maybe your particular task is novel,
         | but are its component parts and algorithms novel? In some
         | cases, perhaps some of them are, but I think that's more the
         | exception than the rule.
         | 
         | LLMs aren't just tape recorders. They can mix and match and
         | adapt. What percentage of the code you write do you think is
         | truly original?
        
       | golly_ned wrote:
       | This article compares the hardness of program synthesis to the
       | utility of chatbots to show chatbots fall short.
       | 
       | The difference is we don't expect chatbots to be 100% right but
       | we do expect program synthesis to be 100% right. For chatbots,
       | 99% is amazing in terms of utility. That other 1% is really hard
       | to get.
       | 
       | Given the limitations of English being able to robustly specify a
       | program, thus requiring constraints for program synthesis to be
       | formal descriptions of specifications, the author is committing a
       | category error in comparing two incommensurable solutions to two
       | distinct problems.
        
         | jebarker wrote:
         | > For chatbots, 99% is amazing in terms of utility.
         | 
         | Is it really though? If it's 99% of generated tokens, then 1
         | out of 100 being wrong is not great for code generation since
         | you're often going to generate more than that. But let's
         | suppose it's 1 in 100 whole functions or programs that's wrong.
         | Unless there's a way to automatically verify when a function is
         | wrong the error rate basically makes full automation, e.g.
         | automatically fixing GitHub issues, out of the question.
         | 
         | So it seems like at 99% we're left with a system that requires
         | constant vigilance from the user to check it's work. This is a
         | far cry from the promise of true program synthesis.
         | 
         | What would be amazingly useful is a 99% reliable system that
         | can tell you when it can't answer the question correctly
         | instead of providing a wrong answer.
        
           | simonw wrote:
           | If you expect to be able to prompt a chatbot and then not
           | TEST that the code it wrote for you works, you're going to
           | have a bad time.
           | 
           | If you're willing to put in that QA work yourself the
           | productivity boost you can get them from is immense.
           | 
           | And honestly, working with the best human programmer in the
           | world won't guarantee that the software they produce is the
           | exact software that you wanted. Details are always lost in
           | the communication between the person who wants the software
           | and the person who builds it
           | 
           | What matters most is the dialog and speed of iteration that
           | you can achieve to converge on the right product.
        
             | jebarker wrote:
             | I think the reason that I haven't felt the productivity
             | boost is because my work is mostly debugging and fixing
             | existing code in a complicated codebase. Rarely is it
             | writing new greenfield code. LLMs have so far struggled to
             | be useful in this setting as it's more like the QA part of
             | the workflow you're suggesting.
        
       | mondrian wrote:
       | Tl;dr (a) program synthesis is provably hard and (b) English is
       | too ambiguous as a specification language. Therefore we're
       | nowhere close.
       | 
       | Ultimately not a convincing argument. We could use this same
       | argument to argue that a human cannot write a program to spec.
       | That may be strictly true, but not interesting.
        
       | lmeyerov wrote:
       | It's frustrating whenever folks throw up proof complexity as why
       | LLMs can't work. If most programs most people want to write can
       | map into predictable & verifiable abstractions, or we recognize
       | almost no software is verified to beginwith, we realize the world
       | is already moving on irrespective of personal hobby horses here.
       | Shocker: Much of the world we interact with every day already
       | runs on PHP, JavaScript, and untyped Python that is not verified,
       | not type checked, and has repos overflowing with bugs and CVEs.
       | 
       | Prof. Dawn Song IMO has been articulating a more productive view.
       | LLMs are generating half the new code on github anyways, so lean
       | in: use this as an opportunity to make it easy for new code to
       | use formal methods where before it would have been to hard.
       | Progress will happen either way, and at least this way we have a
       | shot at bringing the verifiability in to more user code.
        
         | jebarker wrote:
         | Do you have any links for Prof Song talking or writing about
         | this? Sounds interesting
        
         | qsort wrote:
         | I don't think this is in direct contradiction with the article.
         | This was written in response to ideas like:
         | 
         | (a) the machine is so powerful that you can just tell it what
         | to do and it does it with no supervision
         | 
         | or
         | 
         | (b) we can ditch programming languages and just use English as
         | the "source of truth", like you would read, modify and check in
         | the source code rather than compiled object code.
         | 
         | (a) is very much yet to be seen, and (b) is just monumentally
         | stupid. Productive uses of LLMs for code are either as a super-
         | autocomplete or as a pair programmer, which doesn't make
         | programming languages obsolete, much less computer science
         | education as a whole. This is even called out in TFA:
         | 
         | '''Even if chatbots should become a useful tool for
         | programmers, their existence cannot mean the end of
         | programming. It may well be the case that this combination of a
         | built-in repository of code examples and the ability to easily
         | retrieve code examples based on it can be used for generating
         | ideas and for forms of "pair programming" as recently studied
         | by Bird et al.2
         | 
         | But this is not program synthesis, rather it gives software
         | using LLMs the role of more modest and possibly useful tools.
         | We are now seeing a flurry of interest in combining traditional
         | approaches to program analysis with LLMs, and if anything, this
         | points to the limitations inherent in both approaches.3,8'''
        
           | lmeyerov wrote:
           | I'm mostly responding to the middle third of the article that
           | IMO interprets the word "correct" in an academic way that,
           | IMO, is unanchored to how, to a rounding error, software is
           | written:
           | 
           | > "Even simple versions of the approach are intractable in
           | the very precise sense of computational complexity theory if
           | we are concerned about generating code that is correct."
           | (Then 5 paragraphs about that being pspace-complete.)
           | 
           | The notion of correctness used in the article is a verified
           | property. My point is probably less than 1% of the
           | programmers on this site have ever verified their code in
           | that sense, especially since they left a university setting.
           | Instead, most commercial developers write tests based on
           | their domain knowledge as developers and some limited
           | understanding of their company's business domain... if at
           | all. Which LLMs can also do.
        
       ___________________________________________________________________
       (page generated 2024-12-15 23:00 UTC)