[HN Gopher] How to tackle unreliability of coding assistants
       ___________________________________________________________________
        
       How to tackle unreliability of coding assistants
        
       Author : ingve
       Score  : 70 points
       Date   : 2023-11-29 07:56 UTC (15 hours ago)
        
 (HTM) web link (martinfowler.com)
 (TXT) w3m dump (martinfowler.com)
        
       | DonaldPShimoda wrote:
       | I'm just waiting for people to catch on that using an inherently
       | unreliable tool that cannot gauge its own reliability to generate
       | code or answer fact-based queries is a fool's errand. But I
       | expect we'll spend a _lot_ of effort on  "verifying" the results
       | before we just give up entirely.
        
         | onethought wrote:
         | Don't humans fit that definition? We've managed okay for 1000s
         | of years under those conditions.
        
           | bena wrote:
           | They do not. A human can verify its own reliability.
        
             | ryanklee wrote:
             | A human cannot do this self-sufficiently. This is why we
             | work so hard to implement risk mitigation measures and
             | checks and balances on changes.
        
               | bena wrote:
               | Yes, not all humans at all tasks all the time. And some
               | things are important enough to implement checks
               | regardless.
               | 
               | However, there's a lot we can just throw humans at and
               | trust that the thing will get complete and be correct.
               | And even with the checks and balances, we can have the
               | human perform those checks and balances. A human is a
               | pretty autonomous unit on average.
               | 
               | So far, AI can't really say "let me double check that for
               | you" for instance. You ask it a thing, it does a thing,
               | and that's it. If it's wrong, you have to tell it to do
               | the thing again, but differently.
               | 
               | In all the rush to paint these LLMs as "pretty much
               | human", we've instead taken to severely downplaying just
               | how adaptable and clever most sentient beings can be.
        
               | ryanklee wrote:
               | AIs can double check. Agent to agent confirmation and
               | validation is a thing.
               | 
               | In any case, the point is that we have learned techniques
               | to compensate for human fallibility. We will learn
               | techniques to compensate for gen AI fallibility, as well.
               | The objection that AIs can be wrong is far less a barrier
               | to the rise of their utility than is often supposed.
        
               | bena wrote:
               | You see how that's worse, right?
               | 
               | The original argument put forth was that "an inherently
               | unreliable tool cannot gauge its own reliability".
               | 
               | Someone responded that humans fit that description as
               | well.
               | 
               | I said we don't. We can and do verify our own
               | reliability. We can essentially test our assumptions
               | against the real world and fix them.
               | 
               | You then claimed we couldn't do that "self-sufficiently".
               | 
               | I responded that while that is true for some tasks, for a
               | lot of tasks, we can. That an AI can't check itself and
               | won't even try.
               | 
               | And now you're telling me that they can check against
               | each other.
               | 
               | But if you can't trust the originals, asking them if they
               | trust each other is kind of pointless. You're not really
               | doing anything more than adding another layer.
               | 
               | For example: If I put my pants on backwards, I correct
               | that myself. Without the need to check and/or balance
               | against any other person. I am self-correcting to a large
               | degree. The AI would not even know to check its pants
               | until someone told it it was wrong.
               | 
               | The objection isn't that "AIs can be wrong", the
               | objection is that AIs can't really tell the difference
               | between correct and incorrect. So everything has to be
               | checked, often with as much effort as it would take to do
               | the thing in the first place.
        
             | cornel_io wrote:
             | If that was even close to true then I would have had to
             | fire far fewer people over the years.
        
           | ryanklee wrote:
           | They do fit this definition. One result of the rise of
           | generative AI is exposing just how severely and commonly
           | people misperceive their own capabilities and the functioning
           | of their cognitive powers.
        
           | trefoiled wrote:
           | I hear this argument applied often when people bring up the
           | deficiencies of AI, and I don't find it convincing. Compare
           | an AI coding assistant to reaching out to another engineer on
           | my team as an example. If I know this engineer, I will likely
           | have an idea of their relative skill level, their familiarity
           | with the problem at hand, their propensity to suggest one
           | type of solution over another, etc. People are pretty good at
           | developing this kind of sense because we work with other
           | people constantly. The AI assistant, on the other hand, is
           | very much not like a human. I have a limited capacity to
           | understand its "thought process," and I consider myself far
           | more technical than the average person. This makes a
           | verification step troublesome, because I don't know what to
           | expect.
           | 
           | This difference is even more stark when it comes to driving
           | assistants. Video compilations of Teslas with FSD behaving
           | erratically and most importantly, unpredictably, are all over
           | the place. Experienced Tesla drivers seem to have some
           | limited ability to predict the weaknesses of the FSD package,
           | but the issue is that the driving assistant is so unlike a
           | human. I've seen multiple examples of people saying "well,
           | humans cause car crashes too," but the key difference is that
           | I have to sit behind the wheel and deal with the fact that my
           | driving assistant may or may not suddenly swerve into
           | oncoming traffic. The reasons for it doing so are likely
           | obscure to me, and this is a real problem.
        
           | abeppu wrote:
           | Humans are unreliable, but we are also under normal
           | circumstances thoroughly and continually grounded in an
           | external world whose mechanics we interact with, make
           | predictions about, and correct our beliefs about.
           | 
           | The specific way we're training coding assistants for next-
           | token-prediction would also be an incredibly difficult
           | context for humans to produce code.
           | 
           | Suppose you were dropped off in an society of aliens whose
           | perceptual, cultural and cognitive universe is meaningfully
           | different from our own; you don't have a grounding in
           | concepts of what they're trying to _do_ with their programs.
           | You receive a giant dump of reams and reams of source code,
           | in their unfamiliar script, where none of the names initially
           | mean anything to you. In the pile of training material handed
           | to you, you might find some documentation about their
           | programming language, but it's written in their (foreign,
           | weird to you) natural language, and is mixed with everythign
           | else. You never get a teacher who can answer questions, never
           | get access to a IDE/repl/interpreter/debugger/compiler, never
           | get to _run_ a program on different inputs to see its
           | outputs, never get to add a log line to peek at the program's
           | internal state, etc. After a _lot_ of training, you can often
           | predict the next symbol in a program text. But shouldn't we
           | _expect_ you to be "unreliable"? You don't have the ability
           | to run checks against the code you produce! You don't get a
           | warning if you use a variable that doesn't exist! You just
           | produce _tokens_, and get no feedback.
           | 
           | To the degree humans are reliable at coding, it's because we
           | can simulate what program execution will do, with a level of
           | abstraction which we vary in a task dependent way. You can
           | mentally step through every line in a program carefully if
           | you need to. But you can also mentally choose to trust some
           | abstraction and skip steps which you infer cannot be related
           | to some attribute or condition of interest if that
           | abstraction is upheld. The most important parts of your
           | attention are on _what the program does_. This is fully
           | hidden in the next-token-prediction scenario, which is
           | totally focused on _what tokens are used to write the
           | program_.
        
         | kazinator wrote:
         | I've written a few pieces of code with the help of AI. The
         | trick is that I don't need the help. I could see where the bugs
         | are.
         | 
         | The AI could do certain things much faster than I would be able
         | to by hand. For instance, in a certain hash table containing
         | structures, it turned out that the deletion algorithm which
         | moves elements couldn't be used because other code retains
         | (reference counted) pointers to the structures, so the
         | addresses of the object have to be stable. AI easily and
         | correctly rewrote the code from "array of struct" to "array of
         | pointer to struct" representation: the declaration of the thing
         | and all the code working on it was correctly adjusted. I can do
         | that myself, but not in two seconds.
        
         | throwuwu wrote:
         | Just wait a few months. Verifiability is the technique du jour
         | and will make its way into a variety of models soon enough.
        
           | westoncb wrote:
           | Seems like there are certain fundamental limits to what can
           | be done here though. Much of the advantage to using these
           | models is being able to come up with a vague/informal spec
           | and most of the time have it get what you mean and come up
           | with something serviceable with very little effort. If the
           | spec you have in mind to begin with is fuzzy and informal,
           | what do you use to perform verification?
           | 
           | After all, whether a result is correct or not depends on
           | whether it matches the user's desire, so verification
           | criteria must come from that source too.
           | 
           | Sure there are certain types of relatively objective
           | correctness that most of the time will line up with a user's
           | desires, but this kind of verification can never be complete
           | afaict.
        
         | coffeecantcode wrote:
         | .
        
           | agentdrek wrote:
           | is that a vi repeat? lol
        
         | willsmith72 wrote:
         | I could not disagree more. Having something generate code which
         | is 90-100% correct is extremely valuable.
         | 
         | E.g. creating a new page in an app.
         | 
         | Feed an LLM the design, the language/framework/component
         | library you're using, and get a page which is 90% of the way
         | there. Some tweaks and you're good.
         | 
         | Far, far quicker than going by hand, and often better quality
         | code than a junior developer.
         | 
         | Now I would never deploy that code without reading it,
         | understanding it, and testing it, but I would always do that
         | anyway. GPT4 is close enough to a good software engineer, that
         | to resist is it to disadvantage your business.
         | 
         | Now if you're coding for pleasure and the fun of creation and
         | creativity, then ditch the LLMs, they take some of that fun
         | away for sure. But most of the time, I'm focused on achieving
         | business outcomes quickly. That's way more productive with a
         | modern LLM
        
           | bigstrat2003 wrote:
           | I strongly disagree. Having something which is 90% reliable
           | doesn't save me any effort over just doing it myself. If I
           | have to spend the time to check everything the tool generates
           | (which I do, because LLMs are unreliable) then I may as well
           | have written it myself to start with.
           | 
           | I firmly believe that the worst kind of help is unreliable
           | help. If your help never does its job, then you know you have
           | to do everything yourself. If your help always does its job,
           | you know you can trust it. If your help only sometimes does
           | its job, you get the worst deal of all because you never know
           | what you'll get and have to check every single time.
        
             | willsmith72 wrote:
             | > If I have to spend the time to check everything the tool
             | generates (which I do, because LLMs are unreliable) then I
             | may as well have written it myself to start with.
             | 
             | Do you feel the same way about a PR from a coworker?
        
       | kazinator wrote:
       | The way you tackle the unreliability of coding assistants is to
       | know what you're doing so that you don't need a coding assistant,
       | so that you're only using it to save time and effort.
       | 
       | Roughly speaking, if you stick to using AI for writing code you
       | could have written yourself, you're okay.
        
         | NateEag wrote:
         | Though regular use of the assistant will degrade your ability
         | to program without it.
        
           | majormajor wrote:
           | I think in the same way that autocomplete does - I may not
           | have typed out `.length` or whatever it happens to be in the
           | language I used last year enough to remember a year later if
           | it's len or length or size and if it's a property or a method
           | after a year goes by without me touching that language... but
           | then it's a simple search away to refresh my memory, and
           | overall the autocomplete saves a LOT of time that makes up
           | for this.
           | 
           | Yeah, if you never knew what the code that got generated did
           | in the first place, that's not gonna apply, but if you're
           | using it as basically just a code expander for things you
           | could do the pseudocode for in your sleep, you're probably
           | gonna be ok.
        
         | digging wrote:
         | I've only used a very small amount of AI assistance (mostly
         | Anthropic's Claude), and always to learn, not to do. That is, I
         | will ask it what's happening, why are things breaking, etc. It
         | doesn't have to have the right answers, it just needs to
         | unblock me.
         | 
         | I hear it is also quite useful for _doing_ things which you
         | know extremely well but are tedious to do. Anything in between
         | is certainly a danger zone.
        
       | cleandreams wrote:
       | I have several issues with coding assistants.
       | 
       | Over time, will less skilled programmers produce more critical
       | code? I think so. At some point a jet will fall out of the sky
       | because the coding assistant wasn't correct and the whole
       | profession will have a black eye.
       | 
       | The programmers will be less skilled because the (up until
       | recently) lack of coding assistants provides a more rigorous and
       | thorough training environment. With coding assistants the
       | profession will be less intellectually selective because a wider
       | range of people can do it. I don't think this is good given how
       | critical some code is.
       | 
       | There is another related issue. Studies have shown that use of
       | google maps has resulted in a deterioration of people's ability
       | to navigate. Human mapping and navigation ability needs training
       | which it doesn't get when google maps are used. This sort of
       | thing will be an issue with coding.
        
         | SoftTalker wrote:
         | This will be an issue with all knowledge work. The machines
         | will have the knowledge and more and more we will just trust
         | them because we don't know ourselves. Google Maps is a great
         | example.
        
           | hamburga wrote:
           | I want to build a "Google Maps that doesn't make you dumber."
           | 
           | For local navigation, first and foremost. The goal is to
           | teach you how to navigate your locale, so you use it less and
           | less. You still will want to ask it for traffic updates, but
           | you talk to it like you would between locals who know all the
           | roads.
           | 
           | As a model for how to do AI in a way that enhances your
           | thinking rather than softens/replaces it.
        
         | bena wrote:
         | A wide range of people can build things. We trust only a few to
         | build jets and skyscrapers, etc.
         | 
         | I think much the same will happen with regards to programming.
         | Sure, most people will be able to bust out a simple script to
         | do X. But if you want to do a "serious task", you're going to
         | get a professional.
        
         | BurningFrog wrote:
         | I spent 10 years pair programming. It's a similar situation in
         | some ways.
         | 
         | Like, I can't know if the code my pair writes has flaws, just
         | like an AI coding assistant.
         | 
         | I've never learned so much about programming as when pairing.
         | Having someone else to ask or suggest improvements is just
         | invaluable. It's very rarely I learn something new from myself,
         | but another human will always know things I don't, just like an
         | AI.
         | 
         | Of course, you don't blindly accept the code your
         | pair/assistant writes. You have to understand it, ask
         | questions, and most of all write tests that confirms it
         | actually works.
        
       | the8472 wrote:
       | Are there tools yet that put the compiler in the loop? Maybe even
       | a check-compile-test loop that bails on the first failure and
       | then tries to refine it in the background based on what failed?
        
         | bee_rider wrote:
         | Then maybe the model could be fine-tuned on that loop, haha.
         | Could be a fun thing to try at least.
        
           | dontupvoteme wrote:
           | It's a bit of fools-errand because they train on information
           | which is no longer valid and will get stuck if you don't
           | inform them. For instance GPT cannot write a connection to an
           | openai endpoint because the API was upgraded to 1.0 and broke
           | compatibility with all the code it learned from
        
       | righthand wrote:
       | Honestly I'd rather just RTFM and code it myself than invent a
       | game to deal with these issues.
        
       | meindnoch wrote:
       | I'm happy that more and more people embrace these tools for more
       | and more critical software.
       | 
       | It keeps me employed, and even increases my rate quite a bit.
        
         | bee_rider wrote:
         | Are you an EMT?
        
       | abeppu wrote:
       | I think this is "how to think about coding assistants and your
       | task" but none of this is "tackling" their unreliability.
       | 
       | While coding assistants seem to do well in a range of situations,
       | I continue to believe that for coding specifically, merely
       | training on next-token-prediction is leaving too much on the
       | table. Yes, source code is represented as text, but computer
       | programs are an area where there's available information which is
       | _so much richer_. We can know not only the text of the program
       | but the type of every expression, which variables are in scope at
       | any point, what is the signature of a method we're trying to
       | call, etc. These assistants should be able to make predictions
       | about program _traces_, not just program source text. A step
       | further would be to guess potential loop invariants, pre/post
       | conditions, etc, confirm which are upheld by existing tests, and
       | down-weight recommending changes which introduce violations to
       | those inferred conditions.
       | 
       | ChatGPT and tab-completion assistants have both given me things
       | that are not even valid programs (e.g. will not compile, use a
       | variable that isn't actually in scope, etc). ChatGPT even told me
       | that an example it generated wasn't compiling for me b/c I wasn't
       | using a new enough version of the language, and then referenced a
       | language version which does not yet exist. All of this is
       | possible in part b/c these tools are engaging only at the level
       | of text, and are structurally isolated from the rich information
       | available inside an interpreter or debugger. "Tackling"
       | unreliability should start with reframing tasks in a way which
       | lets tools better see the causes of their failures.
        
         | jkaptur wrote:
         | I absolutely agree, but I find the situation incredibly funny.
         | There are three characters here: me, the LLM, and the compiler.
         | Two of them are robots, but they refuse to talk to each other -
         | it's up to me to tell the LLM the bad news from the compiler.
        
           | sa-code wrote:
           | This sounds like the worst game of telephone
        
           | fragmede wrote:
           | that's a frontend issue though. If you use the python
           | interpreter in ChatGPT, you can tell it to run the code until
           | it works, at which point it'll do a couple of iterations
           | before giving you code.
        
           | convolvatron wrote:
           | dont you already cut and paste your error message to google
           | like everyone else?
        
         | swatcoder wrote:
         | Indeed. These are basically tech demos at this point. A marvel
         | to see, and sometimes useful, but still extremely crude.
         | 
         | There's a lot of headroom for sophistication as a few more
         | research insights are made and an experienced tooling team
         | commits a few years to making rich multimodal/ensemble code
         | assistants that can perform smart analyses, transformations,
         | boilerplating, etc on your _project_ instead of just adding
         | some text to your file.
         | 
         | But it'll take a years of insight and labor to build that
         | system up, adapt it to different industries/uses, and prove it
         | out as engineering-ready.
         | 
         | People get caught up in the novelty of Copilot and ChatGPT and
         | then imagine that the revolution arrives when some new paper
         | comes out tomorrow (or that none will arrive because of today's
         | limitations), but the far more likely reality is that the
         | revolution paces as something more like the internet's -- real
         | and profound, but unfolding gradually over the span of decades
         | as people work hard to make new things with it.
        
           | renegade-otter wrote:
           | "Any sufficiently advanced technology is indistinguishable
           | from magic."
           | 
           | -Arthur C. Clarke
        
           | TerrifiedMouse wrote:
           | Frankly, we may be barking up the wrong tree with LLMs. Sure
           | they deliver novel and very marketable results, hence the
           | insane funding, but I can't help but feel it's really just a
           | parlor trick and there is a yet undiscovered algorithm that
           | can actually deliver the AGI that we seek - AGI that reliable
           | and precise like fictional AIs.
        
             | jrockway wrote:
             | I think a lot of progress in the field of AI comes from
             | random breakthroughs rather than evolution of existing
             | ideas. For that reason, I think AGI will probably be
             | another random breakthrough, and not just an LLM with more
             | parameters. Good if you're NVidia, bad if you're OpenAI.
             | The random breakthrough could happen at Google, or it could
             | happen in some professor's basement. There is no way to
             | know, and no way to throw money at the problem and
             | guarantee a return.
        
           | staunton wrote:
           | I think the ultimate jump in capability would be achieved by
           | making a dedicated language to be used by an LLM and training
           | it in parallel on mainly that. The problem is that to make
           | this actually useful you need to establish a new programming
           | language. And making a useful programming language that is
           | actually used by people is even more expensive and takes more
           | time than training a state-of-the-art LLM...
        
         | Night_Thastus wrote:
         | In order to really solve this, you couldn't use an LLM, at
         | least not one in any shape that we have now.
         | 
         | You'd need something that actually _understands_ the language.
         | What is a lifetime, what is scope, what are types, functions,
         | variables, etc. Something that can contextually look at
         | correct, incomplete, or broken programs and reason about what
         | they 're doing by knowing the rules that the language operates
         | on and following them to a conclusion. It would also need to
         | understand high-level design patterns and structure to not just
         | know what's happening in the literal sense "this variable is
         | being incremented" but also in a more abstract sense "this
         | variable keeps track of references because this is mixed
         | C/Python code that needs that to handle deallocation".
         | Something that recognizes patterns both within and outside of
         | the code, with appropriately fuzzy matching where needed.
         | 
         | And I think importantly you'd need to be able to query it at a
         | variety of levels. What's happening on a line-by-line basis and
         | what's happening at the high level at a given point of code.
         | One OR the other isn't sufficient.
         | 
         | That is not a simple ask. We're a long way away from something
         | that smart existing.
        
           | LeifCarrotson wrote:
           | What LLM text generation has shown is that you don't actually
           | have to understand English to generate pretty decent English.
           | You just have to have enough examples.
           | 
           | This is where the massive corpus of source code available on
           | the Internet can help generate a "LSM" (large software model)
           | if you can expose the tokens as the lexer understands them in
           | the training set.
           | 
           | If your LSM sees a trillion examples of correct usage of
           | lifetime and scope and types and so on, then in the same way
           | that an LLM trained on English grammar will emit text with
           | correct grammar as if it _understands_ English, your LSM will
           | generate software with correct syntax as if it _understands_
           | the software. Whatever the definition of  "understands" is in
           | the context of an LLM.
        
             | fragmede wrote:
             | right. we're already abstracting from English words and
             | characters into tokens, piping code through half a compiler
             | so the LSM is given the AST to train on doesn't seem all
             | that far fetched.
        
             | abeppu wrote:
             | But:
             | 
             | - natural language is flexible, computer languages are less
             | so.
             | 
             | - "pretty decent English" still includes hallucinations.
             | I've seen companies whose product demo for generating
             | marketing copy just makes up a plausible review.
             | Hallucinating methods, variables, other packages/modules
             | yields broken code.
             | 
             | - the human thought behind natural language is not feasible
             | to directly provide to a model. An IR corresponding to the
             | source of the program is feasible to provide. A trace of
             | the program executing is feasible to provide. Grounding an
             | LLM in the rich exterior world that humans talk about is
             | hard; grounding an LSM in the rich internal representations
             | accessible to an IDE or a debugger is achievable.
        
             | majormajor wrote:
             | "pretty decent english" is a pretty fuzzy bar.
             | 
             | Indeed, Chat GPT 4 and Copilot can generate "pretty decent
             | code" that will look fine to the average human coder _even
             | when it 's incorrect_ (making up methods or getting params
             | wrong or slighly missing requirements or similar).
             | 
             | The level of precision required for "pretty decent non-
             | trivial code" is much higher than prose that looks like it
             | was written by an educated human, so I share the idea that
             | if it was augmented - even in really _stupid_ ways like
             | asking the IDE if it would even compile, in the case of
             | Copilot, before suggesting it to the user - it would work
             | _much_ better at a much lower effort than increasing it 's
             | understanding implicitly by orders of magnitude.
        
             | staunton wrote:
             | > you don't actually have to understand English to generate
             | pretty decent English. You just have to have enough
             | examples.
             | 
             | I would have thought babies have been showing this beyond a
             | doubt since time immemorial.
        
           | abeppu wrote:
           | Absolutely it's not a simple ask. But research in program
           | synthesis was making interesting progress before LLMs came
           | along. I think it would be better to ask how ML can improve
           | or speed up those efforts, rather than trying to make a
           | general ML model "solve" such a complex problem as a special
           | case.
           | 
           | A step in this direction, which I've been trying to figure
           | out as a side project, and which I would love someone to
           | scoop me on and build a real thing, is to stitch an ML model
           | into into a (mini)kanren. Minikanren folk have built
           | relational interpreters for small languages but not for a
           | "real" industrial language (so far as I'm aware). These small
           | relational interpreters can generate programs given a set of
           | relational statements about them (e.g. "find f so that f(x1)
           | == y1, f(x2) == y2, ..."). Because they're actually examining
           | the full universe of programs in their small language, they
           | will eventually find all programs that satisfy the
           | constraints. But their search is in an order that's given by
           | the structure/definition of the interpreter, and so for
           | complex sets of requirements, finding the _first_ satisfying
           | example can be unacceptably slow. But what if the search were
           | based on the softmaxed outputs of an ML model, and you do a
           | least cost search? Then (roughly) in the case that beam-
           | search generation from the ML model would have produced a
           | valid answer, you still find it in roughly the same time, but
           | you _know_ it's valid. In the case where a valid answer
           | requires that in a small number of key places, we have to
           | take a choice that the model assigns low probability, then
           | we'll find it but it takes longer. And in the case that all
           | the choices needed to construct a valid answer are low-
           | probability under the model, then we're basically in the same
           | place that the vanilla minikanren search was.
        
           | dontupvoteme wrote:
           | An interface layer is 70% of the value for 10% of the work.
        
         | dxbydt wrote:
         | Generating invalid code is a big hassle. I have used ChatGPT
         | for generating R code & sometimes it refers to functions that
         | don't exist. The standard deviation of [1,2,1] is 0.577, given
         | by sd(c(1,2,1)). Sometimes I get stdev(c(1,2,3)) - there is no
         | stdev in R. Why not have a slower mode where you run the code
         | thru the R interpreter first, & only emit valid code that
         | passes that step ?
        
           | fragmede wrote:
           | The Python one that currently exists can use Pandas, but then
           | you're using Python and not R. Other language support must be
           | on their roadmap, only question is how far down the list it
           | is.
        
         | walt_grata wrote:
         | You know I've been thinking about this for a while, I honestly
         | don't think LLMs will ever be the right choice for a coding
         | assistant. Where I've had a strange idea they could help,
         | although it's completely unproven yet, is replacing what we
         | traditionally think of as a developer and code. I'm
         | envisioning, for simple application, a completely non-technical
         | person have a conversation with a coding assistant and it
         | something directly executable, think the tokens it outputs are
         | JVM bytecode or something similar. I'm sure there's innumerable
         | flaws with my thinking/approach but so far, it's been fun to
         | think and code around.
        
         | CSMastermind wrote:
         | Would be interesting to see them trained on actual syntax trees
         | from text.
         | 
         | Then maybe have a separate model trained on going from syntax
         | tree to source code.
        
           | CaptainFever wrote:
           | > Then maybe have a separate model trained on going from
           | syntax tree to source code.
           | 
           | I don't see why you need a model for this. But yes, this is a
           | very cool idea.
        
         | delta_p_delta_x wrote:
         | > these tools are engaging only at the level of text
         | 
         | Couldn't we say the same thing about almost all UNIX/Linux
         | coreutils? There's no way to get a strongly-typed array/tree of
         | folders, files, and metadata from `ls`; it has to be awk-ed and
         | sed-ed into compliance. There's no way to get a strongly-typed
         | dictionary of command-line options and what they do for pretty
         | much every coreutil; you have to invoke `something --help` and
         | then awk/sed _that_ output again.
         | 
         | These coreutils are at their core, only stringly-typed, which
         | makes life more difficult than it strictly needs to be.
         | 
         | Everything is a bag of bytes, or a stream to be manipulated.
         | This philosophy simply leaked into LLMs.
        
       | wrs wrote:
       | Several times when I've asked ChatGPT for an approach to
       | something, it has spit out code that uses an API that looks
       | perfect for my use case, but doesn't actually exist.
       | 
       | So I'm thinking someone should be building an automated product
       | manager!
        
         | jwells89 wrote:
         | Have had the same experience several times. "Yes ChatGPT, it
         | makes perfect sense for that to exist and would be wonderful if
         | it did, but unfortunately it does not."
        
         | gnulinux wrote:
         | This is what ChatGPT does to me whenever I ask anything non-
         | trivial. I find it funny people think it'll take over our jobs,
         | it simply can't even do the most basic things beyond the beaten
         | path.
        
       | anigbrowl wrote:
       | In practice I find a bigger problem is perversity - AI assistant
       | doing OK with incremental prompting, but sometimes decides to
       | just remove all comments from the code, or if asked to focus its
       | effort on one section, deletes all other code. Code assistants
       | need to be much integrated with IDEs and I think you probably
       | need 2 or 3 running in parallel, maybe more.
        
         | sevagh wrote:
         | Yes, this really irritates me.
         | 
         | Me: <prompt 1: modify this function>
         | 
         | AI assistant (either ChatGPT or Copilot-X): <attempt 1>
         | 
         | Me: <feedback>
         | 
         | AI assistant: <attempt 2>, fixed with feedback, but deleted
         | something crucial from attempt 1, for no reason at all
        
       | darkteflon wrote:
       | I feel like the work on using CFGs with LLMs should be low-
       | hanging fruit for improving code assistants, but perhaps someone
       | more knowledgeable could chime in[1], [2], [3].
       | 
       | At lot of the confabulations we see today - non-existent
       | variables, imports and APIs, out-of-scope variables, etc - would
       | seem (to me) to be meaningfully addressable with these
       | techniques.
       | 
       | Relatedly, I have gotten surprisingly great mileage out of
       | treating confabulations, in the first instance, as a user (ie, my
       | own) failure to adequately contextualise, rather than error.
       | 
       | In a sense, CFGs give you sharper tools to do that
       | contextualisation. I wonder how far the "sharper tools" approach
       | will carry us. It seems, to this interested layman, consistent
       | with Shannon's work in statistical language modelling.[4]
       | 
       | The term "prompt engineering" implies, to me, a focus on the
       | instructive aspects of prompting, which is a bit unfortunate but
       | also wholly consistent with the way I see colleagues and friends
       | trying to interact with LLMs. Perhaps we should instead call it
       | "context composition" or something, to emphasise the constructive
       | nature.
       | 
       | [1] https://github.com/outlines-dev/outlines
       | 
       | [2] https://github.com/ggerganov/llama.cpp/pull/1773
       | 
       | [3]
       | https://www.reddit.com/r/LocalLLaMA/comments/156gu8c/d_const...
       | 
       | [4] https://hedgehogreview.com/issues/markets-and-the-
       | good/artic...
        
       | firebot wrote:
       | Maybe learn to code?
        
         | sciolist wrote:
         | My good-faith interpretation of your comment is: "Maybe learn
         | to code (without external resources or tools)?" LLMs are
         | another tool and resource, just like StackOverflow, linters,
         | autocomplete, Google, etc. None of these tools are infallible,
         | but they provide value. Just like all other tools, you don't
         | need to use LLMs because of their issues - but we want them to
         | be as useful as possible - what the author is trying to do.
        
       | slalomskiing wrote:
       | Sounds like an exercise in frustration
        
       | vishnumenon wrote:
       | My current hypothesis here is that the way to make coding
       | assistants as reliable as possible is to shift the balance
       | towards making their output rely on context provided in-prompt
       | rather than information stored in LLM weights. As all the major
       | providers shift towards larger context-windows, it seems
       | increasingly viable to give the LLM the necessary docs for
       | whatever libraries are being used in the current file. I've been
       | working on an experiment in this space[0], and while it's
       | obviously bottle-necked by the size of the documentation index,
       | even a couple-hundred documentation sources seems to help a ton
       | when working with less-used languages/libraries.
       | 
       | [0]: https://indexical.dev/
        
       | tomjakubowski wrote:
       | fairly off-topic, sorry: Midjourney gets the cartoon donkey's
       | snout wrong, so wrong. It looks more like a cartoon dog. For some
       | reason I'm really bothered by it.
        
       ___________________________________________________________________
       (page generated 2023-11-29 23:00 UTC)