[HN Gopher] How to tackle unreliability of coding assistants
___________________________________________________________________
How to tackle unreliability of coding assistants
Author : ingve
Score : 70 points
Date : 2023-11-29 07:56 UTC (15 hours ago)
(HTM) web link (martinfowler.com)
(TXT) w3m dump (martinfowler.com)
| DonaldPShimoda wrote:
| I'm just waiting for people to catch on that using an inherently
| unreliable tool that cannot gauge its own reliability to generate
| code or answer fact-based queries is a fool's errand. But I
| expect we'll spend a _lot_ of effort on "verifying" the results
| before we just give up entirely.
| onethought wrote:
| Don't humans fit that definition? We've managed okay for 1000s
| of years under those conditions.
| bena wrote:
| They do not. A human can verify its own reliability.
| ryanklee wrote:
| A human cannot do this self-sufficiently. This is why we
| work so hard to implement risk mitigation measures and
| checks and balances on changes.
| bena wrote:
| Yes, not all humans at all tasks all the time. And some
| things are important enough to implement checks
| regardless.
|
| However, there's a lot we can just throw humans at and
| trust that the thing will get complete and be correct.
| And even with the checks and balances, we can have the
| human perform those checks and balances. A human is a
| pretty autonomous unit on average.
|
| So far, AI can't really say "let me double check that for
| you" for instance. You ask it a thing, it does a thing,
| and that's it. If it's wrong, you have to tell it to do
| the thing again, but differently.
|
| In all the rush to paint these LLMs as "pretty much
| human", we've instead taken to severely downplaying just
| how adaptable and clever most sentient beings can be.
| ryanklee wrote:
| AIs can double check. Agent to agent confirmation and
| validation is a thing.
|
| In any case, the point is that we have learned techniques
| to compensate for human fallibility. We will learn
| techniques to compensate for gen AI fallibility, as well.
| The objection that AIs can be wrong is far less a barrier
| to the rise of their utility than is often supposed.
| bena wrote:
| You see how that's worse, right?
|
| The original argument put forth was that "an inherently
| unreliable tool cannot gauge its own reliability".
|
| Someone responded that humans fit that description as
| well.
|
| I said we don't. We can and do verify our own
| reliability. We can essentially test our assumptions
| against the real world and fix them.
|
| You then claimed we couldn't do that "self-sufficiently".
|
| I responded that while that is true for some tasks, for a
| lot of tasks, we can. That an AI can't check itself and
| won't even try.
|
| And now you're telling me that they can check against
| each other.
|
| But if you can't trust the originals, asking them if they
| trust each other is kind of pointless. You're not really
| doing anything more than adding another layer.
|
| For example: If I put my pants on backwards, I correct
| that myself. Without the need to check and/or balance
| against any other person. I am self-correcting to a large
| degree. The AI would not even know to check its pants
| until someone told it it was wrong.
|
| The objection isn't that "AIs can be wrong", the
| objection is that AIs can't really tell the difference
| between correct and incorrect. So everything has to be
| checked, often with as much effort as it would take to do
| the thing in the first place.
| cornel_io wrote:
| If that was even close to true then I would have had to
| fire far fewer people over the years.
| ryanklee wrote:
| They do fit this definition. One result of the rise of
| generative AI is exposing just how severely and commonly
| people misperceive their own capabilities and the functioning
| of their cognitive powers.
| trefoiled wrote:
| I hear this argument applied often when people bring up the
| deficiencies of AI, and I don't find it convincing. Compare
| an AI coding assistant to reaching out to another engineer on
| my team as an example. If I know this engineer, I will likely
| have an idea of their relative skill level, their familiarity
| with the problem at hand, their propensity to suggest one
| type of solution over another, etc. People are pretty good at
| developing this kind of sense because we work with other
| people constantly. The AI assistant, on the other hand, is
| very much not like a human. I have a limited capacity to
| understand its "thought process," and I consider myself far
| more technical than the average person. This makes a
| verification step troublesome, because I don't know what to
| expect.
|
| This difference is even more stark when it comes to driving
| assistants. Video compilations of Teslas with FSD behaving
| erratically and most importantly, unpredictably, are all over
| the place. Experienced Tesla drivers seem to have some
| limited ability to predict the weaknesses of the FSD package,
| but the issue is that the driving assistant is so unlike a
| human. I've seen multiple examples of people saying "well,
| humans cause car crashes too," but the key difference is that
| I have to sit behind the wheel and deal with the fact that my
| driving assistant may or may not suddenly swerve into
| oncoming traffic. The reasons for it doing so are likely
| obscure to me, and this is a real problem.
| abeppu wrote:
| Humans are unreliable, but we are also under normal
| circumstances thoroughly and continually grounded in an
| external world whose mechanics we interact with, make
| predictions about, and correct our beliefs about.
|
| The specific way we're training coding assistants for next-
| token-prediction would also be an incredibly difficult
| context for humans to produce code.
|
| Suppose you were dropped off in an society of aliens whose
| perceptual, cultural and cognitive universe is meaningfully
| different from our own; you don't have a grounding in
| concepts of what they're trying to _do_ with their programs.
| You receive a giant dump of reams and reams of source code,
| in their unfamiliar script, where none of the names initially
| mean anything to you. In the pile of training material handed
| to you, you might find some documentation about their
| programming language, but it's written in their (foreign,
| weird to you) natural language, and is mixed with everythign
| else. You never get a teacher who can answer questions, never
| get access to a IDE/repl/interpreter/debugger/compiler, never
| get to _run_ a program on different inputs to see its
| outputs, never get to add a log line to peek at the program's
| internal state, etc. After a _lot_ of training, you can often
| predict the next symbol in a program text. But shouldn't we
| _expect_ you to be "unreliable"? You don't have the ability
| to run checks against the code you produce! You don't get a
| warning if you use a variable that doesn't exist! You just
| produce _tokens_, and get no feedback.
|
| To the degree humans are reliable at coding, it's because we
| can simulate what program execution will do, with a level of
| abstraction which we vary in a task dependent way. You can
| mentally step through every line in a program carefully if
| you need to. But you can also mentally choose to trust some
| abstraction and skip steps which you infer cannot be related
| to some attribute or condition of interest if that
| abstraction is upheld. The most important parts of your
| attention are on _what the program does_. This is fully
| hidden in the next-token-prediction scenario, which is
| totally focused on _what tokens are used to write the
| program_.
| kazinator wrote:
| I've written a few pieces of code with the help of AI. The
| trick is that I don't need the help. I could see where the bugs
| are.
|
| The AI could do certain things much faster than I would be able
| to by hand. For instance, in a certain hash table containing
| structures, it turned out that the deletion algorithm which
| moves elements couldn't be used because other code retains
| (reference counted) pointers to the structures, so the
| addresses of the object have to be stable. AI easily and
| correctly rewrote the code from "array of struct" to "array of
| pointer to struct" representation: the declaration of the thing
| and all the code working on it was correctly adjusted. I can do
| that myself, but not in two seconds.
| throwuwu wrote:
| Just wait a few months. Verifiability is the technique du jour
| and will make its way into a variety of models soon enough.
| westoncb wrote:
| Seems like there are certain fundamental limits to what can
| be done here though. Much of the advantage to using these
| models is being able to come up with a vague/informal spec
| and most of the time have it get what you mean and come up
| with something serviceable with very little effort. If the
| spec you have in mind to begin with is fuzzy and informal,
| what do you use to perform verification?
|
| After all, whether a result is correct or not depends on
| whether it matches the user's desire, so verification
| criteria must come from that source too.
|
| Sure there are certain types of relatively objective
| correctness that most of the time will line up with a user's
| desires, but this kind of verification can never be complete
| afaict.
| coffeecantcode wrote:
| .
| agentdrek wrote:
| is that a vi repeat? lol
| willsmith72 wrote:
| I could not disagree more. Having something generate code which
| is 90-100% correct is extremely valuable.
|
| E.g. creating a new page in an app.
|
| Feed an LLM the design, the language/framework/component
| library you're using, and get a page which is 90% of the way
| there. Some tweaks and you're good.
|
| Far, far quicker than going by hand, and often better quality
| code than a junior developer.
|
| Now I would never deploy that code without reading it,
| understanding it, and testing it, but I would always do that
| anyway. GPT4 is close enough to a good software engineer, that
| to resist is it to disadvantage your business.
|
| Now if you're coding for pleasure and the fun of creation and
| creativity, then ditch the LLMs, they take some of that fun
| away for sure. But most of the time, I'm focused on achieving
| business outcomes quickly. That's way more productive with a
| modern LLM
| bigstrat2003 wrote:
| I strongly disagree. Having something which is 90% reliable
| doesn't save me any effort over just doing it myself. If I
| have to spend the time to check everything the tool generates
| (which I do, because LLMs are unreliable) then I may as well
| have written it myself to start with.
|
| I firmly believe that the worst kind of help is unreliable
| help. If your help never does its job, then you know you have
| to do everything yourself. If your help always does its job,
| you know you can trust it. If your help only sometimes does
| its job, you get the worst deal of all because you never know
| what you'll get and have to check every single time.
| willsmith72 wrote:
| > If I have to spend the time to check everything the tool
| generates (which I do, because LLMs are unreliable) then I
| may as well have written it myself to start with.
|
| Do you feel the same way about a PR from a coworker?
| kazinator wrote:
| The way you tackle the unreliability of coding assistants is to
| know what you're doing so that you don't need a coding assistant,
| so that you're only using it to save time and effort.
|
| Roughly speaking, if you stick to using AI for writing code you
| could have written yourself, you're okay.
| NateEag wrote:
| Though regular use of the assistant will degrade your ability
| to program without it.
| majormajor wrote:
| I think in the same way that autocomplete does - I may not
| have typed out `.length` or whatever it happens to be in the
| language I used last year enough to remember a year later if
| it's len or length or size and if it's a property or a method
| after a year goes by without me touching that language... but
| then it's a simple search away to refresh my memory, and
| overall the autocomplete saves a LOT of time that makes up
| for this.
|
| Yeah, if you never knew what the code that got generated did
| in the first place, that's not gonna apply, but if you're
| using it as basically just a code expander for things you
| could do the pseudocode for in your sleep, you're probably
| gonna be ok.
| digging wrote:
| I've only used a very small amount of AI assistance (mostly
| Anthropic's Claude), and always to learn, not to do. That is, I
| will ask it what's happening, why are things breaking, etc. It
| doesn't have to have the right answers, it just needs to
| unblock me.
|
| I hear it is also quite useful for _doing_ things which you
| know extremely well but are tedious to do. Anything in between
| is certainly a danger zone.
| cleandreams wrote:
| I have several issues with coding assistants.
|
| Over time, will less skilled programmers produce more critical
| code? I think so. At some point a jet will fall out of the sky
| because the coding assistant wasn't correct and the whole
| profession will have a black eye.
|
| The programmers will be less skilled because the (up until
| recently) lack of coding assistants provides a more rigorous and
| thorough training environment. With coding assistants the
| profession will be less intellectually selective because a wider
| range of people can do it. I don't think this is good given how
| critical some code is.
|
| There is another related issue. Studies have shown that use of
| google maps has resulted in a deterioration of people's ability
| to navigate. Human mapping and navigation ability needs training
| which it doesn't get when google maps are used. This sort of
| thing will be an issue with coding.
| SoftTalker wrote:
| This will be an issue with all knowledge work. The machines
| will have the knowledge and more and more we will just trust
| them because we don't know ourselves. Google Maps is a great
| example.
| hamburga wrote:
| I want to build a "Google Maps that doesn't make you dumber."
|
| For local navigation, first and foremost. The goal is to
| teach you how to navigate your locale, so you use it less and
| less. You still will want to ask it for traffic updates, but
| you talk to it like you would between locals who know all the
| roads.
|
| As a model for how to do AI in a way that enhances your
| thinking rather than softens/replaces it.
| bena wrote:
| A wide range of people can build things. We trust only a few to
| build jets and skyscrapers, etc.
|
| I think much the same will happen with regards to programming.
| Sure, most people will be able to bust out a simple script to
| do X. But if you want to do a "serious task", you're going to
| get a professional.
| BurningFrog wrote:
| I spent 10 years pair programming. It's a similar situation in
| some ways.
|
| Like, I can't know if the code my pair writes has flaws, just
| like an AI coding assistant.
|
| I've never learned so much about programming as when pairing.
| Having someone else to ask or suggest improvements is just
| invaluable. It's very rarely I learn something new from myself,
| but another human will always know things I don't, just like an
| AI.
|
| Of course, you don't blindly accept the code your
| pair/assistant writes. You have to understand it, ask
| questions, and most of all write tests that confirms it
| actually works.
| the8472 wrote:
| Are there tools yet that put the compiler in the loop? Maybe even
| a check-compile-test loop that bails on the first failure and
| then tries to refine it in the background based on what failed?
| bee_rider wrote:
| Then maybe the model could be fine-tuned on that loop, haha.
| Could be a fun thing to try at least.
| dontupvoteme wrote:
| It's a bit of fools-errand because they train on information
| which is no longer valid and will get stuck if you don't
| inform them. For instance GPT cannot write a connection to an
| openai endpoint because the API was upgraded to 1.0 and broke
| compatibility with all the code it learned from
| righthand wrote:
| Honestly I'd rather just RTFM and code it myself than invent a
| game to deal with these issues.
| meindnoch wrote:
| I'm happy that more and more people embrace these tools for more
| and more critical software.
|
| It keeps me employed, and even increases my rate quite a bit.
| bee_rider wrote:
| Are you an EMT?
| abeppu wrote:
| I think this is "how to think about coding assistants and your
| task" but none of this is "tackling" their unreliability.
|
| While coding assistants seem to do well in a range of situations,
| I continue to believe that for coding specifically, merely
| training on next-token-prediction is leaving too much on the
| table. Yes, source code is represented as text, but computer
| programs are an area where there's available information which is
| _so much richer_. We can know not only the text of the program
| but the type of every expression, which variables are in scope at
| any point, what is the signature of a method we're trying to
| call, etc. These assistants should be able to make predictions
| about program _traces_, not just program source text. A step
| further would be to guess potential loop invariants, pre/post
| conditions, etc, confirm which are upheld by existing tests, and
| down-weight recommending changes which introduce violations to
| those inferred conditions.
|
| ChatGPT and tab-completion assistants have both given me things
| that are not even valid programs (e.g. will not compile, use a
| variable that isn't actually in scope, etc). ChatGPT even told me
| that an example it generated wasn't compiling for me b/c I wasn't
| using a new enough version of the language, and then referenced a
| language version which does not yet exist. All of this is
| possible in part b/c these tools are engaging only at the level
| of text, and are structurally isolated from the rich information
| available inside an interpreter or debugger. "Tackling"
| unreliability should start with reframing tasks in a way which
| lets tools better see the causes of their failures.
| jkaptur wrote:
| I absolutely agree, but I find the situation incredibly funny.
| There are three characters here: me, the LLM, and the compiler.
| Two of them are robots, but they refuse to talk to each other -
| it's up to me to tell the LLM the bad news from the compiler.
| sa-code wrote:
| This sounds like the worst game of telephone
| fragmede wrote:
| that's a frontend issue though. If you use the python
| interpreter in ChatGPT, you can tell it to run the code until
| it works, at which point it'll do a couple of iterations
| before giving you code.
| convolvatron wrote:
| dont you already cut and paste your error message to google
| like everyone else?
| swatcoder wrote:
| Indeed. These are basically tech demos at this point. A marvel
| to see, and sometimes useful, but still extremely crude.
|
| There's a lot of headroom for sophistication as a few more
| research insights are made and an experienced tooling team
| commits a few years to making rich multimodal/ensemble code
| assistants that can perform smart analyses, transformations,
| boilerplating, etc on your _project_ instead of just adding
| some text to your file.
|
| But it'll take a years of insight and labor to build that
| system up, adapt it to different industries/uses, and prove it
| out as engineering-ready.
|
| People get caught up in the novelty of Copilot and ChatGPT and
| then imagine that the revolution arrives when some new paper
| comes out tomorrow (or that none will arrive because of today's
| limitations), but the far more likely reality is that the
| revolution paces as something more like the internet's -- real
| and profound, but unfolding gradually over the span of decades
| as people work hard to make new things with it.
| renegade-otter wrote:
| "Any sufficiently advanced technology is indistinguishable
| from magic."
|
| -Arthur C. Clarke
| TerrifiedMouse wrote:
| Frankly, we may be barking up the wrong tree with LLMs. Sure
| they deliver novel and very marketable results, hence the
| insane funding, but I can't help but feel it's really just a
| parlor trick and there is a yet undiscovered algorithm that
| can actually deliver the AGI that we seek - AGI that reliable
| and precise like fictional AIs.
| jrockway wrote:
| I think a lot of progress in the field of AI comes from
| random breakthroughs rather than evolution of existing
| ideas. For that reason, I think AGI will probably be
| another random breakthrough, and not just an LLM with more
| parameters. Good if you're NVidia, bad if you're OpenAI.
| The random breakthrough could happen at Google, or it could
| happen in some professor's basement. There is no way to
| know, and no way to throw money at the problem and
| guarantee a return.
| staunton wrote:
| I think the ultimate jump in capability would be achieved by
| making a dedicated language to be used by an LLM and training
| it in parallel on mainly that. The problem is that to make
| this actually useful you need to establish a new programming
| language. And making a useful programming language that is
| actually used by people is even more expensive and takes more
| time than training a state-of-the-art LLM...
| Night_Thastus wrote:
| In order to really solve this, you couldn't use an LLM, at
| least not one in any shape that we have now.
|
| You'd need something that actually _understands_ the language.
| What is a lifetime, what is scope, what are types, functions,
| variables, etc. Something that can contextually look at
| correct, incomplete, or broken programs and reason about what
| they 're doing by knowing the rules that the language operates
| on and following them to a conclusion. It would also need to
| understand high-level design patterns and structure to not just
| know what's happening in the literal sense "this variable is
| being incremented" but also in a more abstract sense "this
| variable keeps track of references because this is mixed
| C/Python code that needs that to handle deallocation".
| Something that recognizes patterns both within and outside of
| the code, with appropriately fuzzy matching where needed.
|
| And I think importantly you'd need to be able to query it at a
| variety of levels. What's happening on a line-by-line basis and
| what's happening at the high level at a given point of code.
| One OR the other isn't sufficient.
|
| That is not a simple ask. We're a long way away from something
| that smart existing.
| LeifCarrotson wrote:
| What LLM text generation has shown is that you don't actually
| have to understand English to generate pretty decent English.
| You just have to have enough examples.
|
| This is where the massive corpus of source code available on
| the Internet can help generate a "LSM" (large software model)
| if you can expose the tokens as the lexer understands them in
| the training set.
|
| If your LSM sees a trillion examples of correct usage of
| lifetime and scope and types and so on, then in the same way
| that an LLM trained on English grammar will emit text with
| correct grammar as if it _understands_ English, your LSM will
| generate software with correct syntax as if it _understands_
| the software. Whatever the definition of "understands" is in
| the context of an LLM.
| fragmede wrote:
| right. we're already abstracting from English words and
| characters into tokens, piping code through half a compiler
| so the LSM is given the AST to train on doesn't seem all
| that far fetched.
| abeppu wrote:
| But:
|
| - natural language is flexible, computer languages are less
| so.
|
| - "pretty decent English" still includes hallucinations.
| I've seen companies whose product demo for generating
| marketing copy just makes up a plausible review.
| Hallucinating methods, variables, other packages/modules
| yields broken code.
|
| - the human thought behind natural language is not feasible
| to directly provide to a model. An IR corresponding to the
| source of the program is feasible to provide. A trace of
| the program executing is feasible to provide. Grounding an
| LLM in the rich exterior world that humans talk about is
| hard; grounding an LSM in the rich internal representations
| accessible to an IDE or a debugger is achievable.
| majormajor wrote:
| "pretty decent english" is a pretty fuzzy bar.
|
| Indeed, Chat GPT 4 and Copilot can generate "pretty decent
| code" that will look fine to the average human coder _even
| when it 's incorrect_ (making up methods or getting params
| wrong or slighly missing requirements or similar).
|
| The level of precision required for "pretty decent non-
| trivial code" is much higher than prose that looks like it
| was written by an educated human, so I share the idea that
| if it was augmented - even in really _stupid_ ways like
| asking the IDE if it would even compile, in the case of
| Copilot, before suggesting it to the user - it would work
| _much_ better at a much lower effort than increasing it 's
| understanding implicitly by orders of magnitude.
| staunton wrote:
| > you don't actually have to understand English to generate
| pretty decent English. You just have to have enough
| examples.
|
| I would have thought babies have been showing this beyond a
| doubt since time immemorial.
| abeppu wrote:
| Absolutely it's not a simple ask. But research in program
| synthesis was making interesting progress before LLMs came
| along. I think it would be better to ask how ML can improve
| or speed up those efforts, rather than trying to make a
| general ML model "solve" such a complex problem as a special
| case.
|
| A step in this direction, which I've been trying to figure
| out as a side project, and which I would love someone to
| scoop me on and build a real thing, is to stitch an ML model
| into into a (mini)kanren. Minikanren folk have built
| relational interpreters for small languages but not for a
| "real" industrial language (so far as I'm aware). These small
| relational interpreters can generate programs given a set of
| relational statements about them (e.g. "find f so that f(x1)
| == y1, f(x2) == y2, ..."). Because they're actually examining
| the full universe of programs in their small language, they
| will eventually find all programs that satisfy the
| constraints. But their search is in an order that's given by
| the structure/definition of the interpreter, and so for
| complex sets of requirements, finding the _first_ satisfying
| example can be unacceptably slow. But what if the search were
| based on the softmaxed outputs of an ML model, and you do a
| least cost search? Then (roughly) in the case that beam-
| search generation from the ML model would have produced a
| valid answer, you still find it in roughly the same time, but
| you _know_ it's valid. In the case where a valid answer
| requires that in a small number of key places, we have to
| take a choice that the model assigns low probability, then
| we'll find it but it takes longer. And in the case that all
| the choices needed to construct a valid answer are low-
| probability under the model, then we're basically in the same
| place that the vanilla minikanren search was.
| dontupvoteme wrote:
| An interface layer is 70% of the value for 10% of the work.
| dxbydt wrote:
| Generating invalid code is a big hassle. I have used ChatGPT
| for generating R code & sometimes it refers to functions that
| don't exist. The standard deviation of [1,2,1] is 0.577, given
| by sd(c(1,2,1)). Sometimes I get stdev(c(1,2,3)) - there is no
| stdev in R. Why not have a slower mode where you run the code
| thru the R interpreter first, & only emit valid code that
| passes that step ?
| fragmede wrote:
| The Python one that currently exists can use Pandas, but then
| you're using Python and not R. Other language support must be
| on their roadmap, only question is how far down the list it
| is.
| walt_grata wrote:
| You know I've been thinking about this for a while, I honestly
| don't think LLMs will ever be the right choice for a coding
| assistant. Where I've had a strange idea they could help,
| although it's completely unproven yet, is replacing what we
| traditionally think of as a developer and code. I'm
| envisioning, for simple application, a completely non-technical
| person have a conversation with a coding assistant and it
| something directly executable, think the tokens it outputs are
| JVM bytecode or something similar. I'm sure there's innumerable
| flaws with my thinking/approach but so far, it's been fun to
| think and code around.
| CSMastermind wrote:
| Would be interesting to see them trained on actual syntax trees
| from text.
|
| Then maybe have a separate model trained on going from syntax
| tree to source code.
| CaptainFever wrote:
| > Then maybe have a separate model trained on going from
| syntax tree to source code.
|
| I don't see why you need a model for this. But yes, this is a
| very cool idea.
| delta_p_delta_x wrote:
| > these tools are engaging only at the level of text
|
| Couldn't we say the same thing about almost all UNIX/Linux
| coreutils? There's no way to get a strongly-typed array/tree of
| folders, files, and metadata from `ls`; it has to be awk-ed and
| sed-ed into compliance. There's no way to get a strongly-typed
| dictionary of command-line options and what they do for pretty
| much every coreutil; you have to invoke `something --help` and
| then awk/sed _that_ output again.
|
| These coreutils are at their core, only stringly-typed, which
| makes life more difficult than it strictly needs to be.
|
| Everything is a bag of bytes, or a stream to be manipulated.
| This philosophy simply leaked into LLMs.
| wrs wrote:
| Several times when I've asked ChatGPT for an approach to
| something, it has spit out code that uses an API that looks
| perfect for my use case, but doesn't actually exist.
|
| So I'm thinking someone should be building an automated product
| manager!
| jwells89 wrote:
| Have had the same experience several times. "Yes ChatGPT, it
| makes perfect sense for that to exist and would be wonderful if
| it did, but unfortunately it does not."
| gnulinux wrote:
| This is what ChatGPT does to me whenever I ask anything non-
| trivial. I find it funny people think it'll take over our jobs,
| it simply can't even do the most basic things beyond the beaten
| path.
| anigbrowl wrote:
| In practice I find a bigger problem is perversity - AI assistant
| doing OK with incremental prompting, but sometimes decides to
| just remove all comments from the code, or if asked to focus its
| effort on one section, deletes all other code. Code assistants
| need to be much integrated with IDEs and I think you probably
| need 2 or 3 running in parallel, maybe more.
| sevagh wrote:
| Yes, this really irritates me.
|
| Me: <prompt 1: modify this function>
|
| AI assistant (either ChatGPT or Copilot-X): <attempt 1>
|
| Me: <feedback>
|
| AI assistant: <attempt 2>, fixed with feedback, but deleted
| something crucial from attempt 1, for no reason at all
| darkteflon wrote:
| I feel like the work on using CFGs with LLMs should be low-
| hanging fruit for improving code assistants, but perhaps someone
| more knowledgeable could chime in[1], [2], [3].
|
| At lot of the confabulations we see today - non-existent
| variables, imports and APIs, out-of-scope variables, etc - would
| seem (to me) to be meaningfully addressable with these
| techniques.
|
| Relatedly, I have gotten surprisingly great mileage out of
| treating confabulations, in the first instance, as a user (ie, my
| own) failure to adequately contextualise, rather than error.
|
| In a sense, CFGs give you sharper tools to do that
| contextualisation. I wonder how far the "sharper tools" approach
| will carry us. It seems, to this interested layman, consistent
| with Shannon's work in statistical language modelling.[4]
|
| The term "prompt engineering" implies, to me, a focus on the
| instructive aspects of prompting, which is a bit unfortunate but
| also wholly consistent with the way I see colleagues and friends
| trying to interact with LLMs. Perhaps we should instead call it
| "context composition" or something, to emphasise the constructive
| nature.
|
| [1] https://github.com/outlines-dev/outlines
|
| [2] https://github.com/ggerganov/llama.cpp/pull/1773
|
| [3]
| https://www.reddit.com/r/LocalLLaMA/comments/156gu8c/d_const...
|
| [4] https://hedgehogreview.com/issues/markets-and-the-
| good/artic...
| firebot wrote:
| Maybe learn to code?
| sciolist wrote:
| My good-faith interpretation of your comment is: "Maybe learn
| to code (without external resources or tools)?" LLMs are
| another tool and resource, just like StackOverflow, linters,
| autocomplete, Google, etc. None of these tools are infallible,
| but they provide value. Just like all other tools, you don't
| need to use LLMs because of their issues - but we want them to
| be as useful as possible - what the author is trying to do.
| slalomskiing wrote:
| Sounds like an exercise in frustration
| vishnumenon wrote:
| My current hypothesis here is that the way to make coding
| assistants as reliable as possible is to shift the balance
| towards making their output rely on context provided in-prompt
| rather than information stored in LLM weights. As all the major
| providers shift towards larger context-windows, it seems
| increasingly viable to give the LLM the necessary docs for
| whatever libraries are being used in the current file. I've been
| working on an experiment in this space[0], and while it's
| obviously bottle-necked by the size of the documentation index,
| even a couple-hundred documentation sources seems to help a ton
| when working with less-used languages/libraries.
|
| [0]: https://indexical.dev/
| tomjakubowski wrote:
| fairly off-topic, sorry: Midjourney gets the cartoon donkey's
| snout wrong, so wrong. It looks more like a cartoon dog. For some
| reason I'm really bothered by it.
___________________________________________________________________
(page generated 2023-11-29 23:00 UTC)