[HN Gopher] How we made our AI code review bot stop leaving nitp...
___________________________________________________________________
How we made our AI code review bot stop leaving nitpicky comments
Author : dakshgupta
Score : 234 points
Date : 2024-12-18 16:31 UTC (4 days ago)
(HTM) web link (www.greptile.com)
(TXT) w3m dump (www.greptile.com)
| pama wrote:
| After failing with three reasonable ideas, they solved the
| problem with an idea that previously would have seemed unlikely
| to work (get similarity to past known nits and use a threshold of
| 3 similar nits above a certain cutoff similarity score for
| filtering). A lot of applications of ML have a similar hacky
| trial and error flavor like that. Intuition is often built in
| retrospect and may not transfer well to other domains. I would
| have guessed that finetuning would also work, but agree with the
| author that it would be more expensive and less portable across
| different models.
| dakshgupta wrote:
| Since we posted this, two camps of people reached out:
|
| Classical ML people who recommended we try training a
| classifier, possibly on the embeddings.
|
| Fine tuning platforms that recommended we try their platform.
|
| The challenge there would be gathering enough data _per
| customer_ to meaningfully capture their definition and standard
| for nit-pickiness.
| pama wrote:
| What you use now is a simple KNN classifier, and if it works
| well enough, perhaps no need to go much further. If you need
| to squeeze out a couple additional percentage points maybe
| try a different simple and robust ML classifier (random
| forest, xgboost, or a simple two layer network). All these
| methods, including your current classifier, will get better
| with additional data and minor tuning in the future.
| dakshgupta wrote:
| Thank you, I will try this. I suspect we can extract some
| universal theory of nits and have a base filter to start
| with, and have it learn per-company preferences on top of
| that.
| keepingscore wrote:
| You should be able to do that already by taking all of
| your customers nit embeddings and averaging them to
| produce a point in space that represents the universal
| nit. Embeddings are really cool and the fact that they
| still work when averaging is one of their cool
| properties.
| dakshgupta wrote:
| This is a cool idea - I'll try this and add it as an
| appendix to this post.
| olddustytrail wrote:
| I don't think the prompt was reasonable. Nits are the eggs of
| headlice. Expecting the LLM to guess they meant superficial
| issues, rather than just spelling that out, is a bit weird.
|
| It's like saying "don't comment on elephants" when you mean
| don't comment on the obvious stuff.
| whitfin wrote:
| Yeah, even the title here says "nitpicky". I'm not sure if
| they tried "nitpicks" instead of "nits", but I don't know why
| you wouldn't...
| tayo42 wrote:
| Do llms struggle with other homonyms?
| Eisenstein wrote:
| They can, but I think the parent is saying that when you
| are crafting an instruction for someone or something to
| follow, you should be as direct and as clear as possible in
| order to remove the need for the actor to have to intuit a
| decision instead of act on certainty.
|
| I would start with something like: 'Avoid writing comments
| which only concern stylistic issues or which only contain
| criticisms considered pedantic or trivial.'
| Nullabillity wrote:
| [flagged]
| dakshgupta wrote:
| I would say they are useful but they aren't magic (at least
| yet) and building useful applications on top of them requires
| some work.
| WesolyKubeczek wrote:
| Not only do LLMs require some work to build anything useful,
| they are also fuzzy and nondeterministic, requiring you to
| tweak your prompting tricks once in a while, and they are
| also quite expensive.
|
| Nothing but advantages.
| dang wrote:
| " _Please don 't post shallow dismissals, especially of other
| people's work. A good critical comment teaches us something._"
|
| https://news.ycombinator.com/newsguidelines.html
| XenophileJKO wrote:
| Unless they were using a model too stupid for this operation, the
| fundamental problem is usually solvable via prompting.
|
| Usually the issue is that the models have a bias for action, so
| you need to give it an accetpable action when there isn't a good
| comment. Some other output/determination.
|
| I've seen this in many other similar applications.
| marviel wrote:
| +1 for "No Comment/Action Required" responses reducing trigger-
| happiness
| keepingscore wrote:
| This isn't my post but I work with llms everyday and I've seen
| this kind of instruction ignoring behavior on sonnet when the
| context window starts getting close to the edge.
| XenophileJKO wrote:
| I don't know, in practice there are so many potential causes
| that you have to look case by case in situations like that. I
| don't have a ton of experience with the raw Claude model
| specifically, but would anticipate you'll have the same
| problem classes.
|
| Usually it comes down to one of the following:
|
| - ambiguity and semantics (I once had a significant behavior
| difference between "suggest" and "recommend", i.e. a model
| can suggest without recommending.)
|
| - conflicting instructions
|
| - data/instruction bleeding (delimiters help, but if the span
| is too long it can loose track of what is data and what is
| instructions.)
|
| - action bias (If the task is to find code comments for
| example, even if you tell it not to, it will have a bias to
| do it as you defined the task that way.)
|
| - exceeding attention capacity (having to pay attention to
| too much or having too many instructions. This is where
| structures output or chain of thought type approaches help.
| They help focus attention on each step of the process and the
| related rules.)
|
| I feel like these are the ones you encounter the most.
| dakshgupta wrote:
| One word changes impacting output is interesting but also
| quite frustrating. Especially because the patterns don't
| translate across models.
| gopher_space wrote:
| The "write like the people who wrote the info you want"
| pattern absolutely translates across models.
| sdesol wrote:
| Yes and no. I've found the order in which you give
| instructions matters for some models as well. With LLMs,
| you really need to treat them like black boxes and you
| cannot assume one prompt will work for all. It is
| honestly, in my experience, a lot of trial and error.
| dakshgupta wrote:
| This might be what we experienced. We regularly have context
| reach 30k+ tokens.
| NBJack wrote:
| Prompting only works if the training corpus used for the LLM
| has critical mass on the concept. I agree that the model should
| have alternatives (i.e. preface the comment with the tag
| 'Nit:'), but bear in mind code review nit-picking isn't exactly
| a well-researched area of study. Also to your point, given this
| was also likely trained predominantly on code, there's probably
| a lack of comment sentiment in the corpus.
| dbetteridge wrote:
| Prompt:
|
| If the comment could be omitted without affecting the codes
| functionality but is stylistic or otherwise can be ignored then
| preface the comment with
|
| NITPICK
|
| I'm guessing you've tried something like the above and then
| filtering for the preface, as you mentioned the llm being bad at
| understanding what is and isn't important.
| abecedarius wrote:
| Nitpick: I'd ask for NITPICK at the end of output instead of
| the start. The model should be in a better place to make that
| decision there.
| ivanjermakov wrote:
| I find it ironic how capable LLMs have become, but they're
| still struggling with things like this. Great reminder that
| it's still text prediction at its core.
| spott wrote:
| I mean, think of LLM output as unfiltered thinking. If you
| were to make that determination would you make it before
| you had thought it through?
| Eisenstein wrote:
| I find it more helpful to keep in mind the autoregressive
| nature rather than the prediction. 'Text prediction' brings
| to mind that it is guessing what word would follow another
| word, but it is doing so much more than that.
| 'Autoregressive' brings to mind that it is using its
| previously generated output to create new output every time
| it comes up with another token. In that case you
| immediately understand that it would have to make the
| determination of severity after it has generated the
| description of the issue.
| dakshgupta wrote:
| We tried this too, better but not good enough. It also often
| labeled critical issues as nitpicks, which is unacceptable in
| our context.
| firesteelrain wrote:
| Then some of the nitpicks were actually valid?
|
| What about PossiblyWrong, then?
| fnqi8ckfek wrote:
| My boss can't stop nitpicking, and I've started telling him
| that "if your comment doesn't affect the output I'm ignoring
| it".
| marcandre wrote:
| That is such an over-simplistic criteria! Proof by the
| absurd: pass your code through an obfuscator / simplifier and
| the output won't be affected.
| fnqi8ckfek wrote:
| Ok?
|
| So if the boss's nitpick is "you must pass your code thru
| an indicator that doesn't affect the output", I'm going to
| ignore it. I don't understand what point you're trying to
| make.
| seb1204 wrote:
| My thinking too. Or similarly including some coding style and
| commenting style rules in the repository so they become part of
| the code base and get considered by the LLM
| throw310822 wrote:
| Seems easier than getting the same from my colleagues.
| mr_toad wrote:
| I wonder if it was trained on a lot of nitpicking comments from
| human reviews.
| iLoveOncall wrote:
| It's funny because I wouldn't consider the comment that they
| highlight in their post as a nitpick.
|
| Something that has an impact on the long term maintainability of
| code is definitely not nitpikcky, and in the majority of cases
| define a type fits this category as it makes refactors and
| extensions MUCH easier.
|
| On top of that, I think the approach they went with is a huge
| mistake. The same comment can be a nitpick on one CR but crucial
| on another, clustering them is destined to result in false-
| positives and false-negatives.
|
| I'm not sure I'd want to use a product to review my code for
| which 1) I cannot customize the rules, 2) it seems like the rules
| chosen by the creators are poor.
|
| To be honest I wouldn't want to use any AI-based code reviewer at
| all. We have one at work (FAANG, so something with a large
| dedicated team) and it has not once produced a useful comment and
| instead has been factually wrong many times.
| dakshgupta wrote:
| This is an important point - there is no universal
| understanding of nitpickiness. It is why we have it learn every
| new customers ways from scratch.
| mannykannot wrote:
| This does not address the issue raised in iLoveOncall's third
| paragraph: "the same comment can be a nitpick on one CR but
| crucial on another..." In "attempt 2", you say that "the LLMs
| judgment of its own output was nearly random", which raises
| questions that go well beyond just nitpicking, up to that of
| whether the current state of the art in LLM code review is
| fit for much more than ticking the box that says "yes, we are
| doing code review."
| lupire wrote:
| If you are using an LLM for judgment, you are using it
| wrong. An LLM is good for generating suggestions,
| brainstorming, not making judgments.
|
| That's why it is called _Generative_ AI.
| mannykannot wrote:
| Indeed - and if you are doing code review without
| judgement, you are doing it wrong.
| jbirer wrote:
| Two theories:
|
| 1) This is an ego problem. Whoever is doing the development
| cannot handle being called out on certain software architecture
| / coding mistakes, so it becomes "nitpicking".
|
| 2) The software shop has a "ship out faster, cut corners"
| culture, which at that point might as well turn off the AI
| review bot.
| AgentOrange1234 wrote:
| 1 is super interesting.
|
| I've found it's nice to talk to an LLM about personal issues
| because I know it's not a real person judging me. Maybe if
| the comments were kept private with the dev, it'd be more
| just a coaching tool that didn't feel like a criticism?
| Comfy-Tinwork wrote:
| Private with the Dev, the operator & everyone who uses any
| product of the training data they build.
| trash_cat wrote:
| This is a company culture problem.
| kgeist wrote:
| What about false positives?
|
| As I see it, the solution assumes the embeddings only capture the
| form: say, if developers previously downvoted suggestions to wrap
| code in unnecessary try..catch blocks, then similar suggestions
| will be successfully blocked in the future, regardless of the
| module/class etc. (i.e. a kind of generalization)
|
| But what if enough suggestions regarding class X (or module X)
| get downvoted, and then the mechanism starts assuming class
| X/module X doesn't need review at all? I mean the case when a lot
| of such embeddings end up clustering around the class itself (or
| a function), not around the general form of the comment.
|
| How do you prevent this? Or it's unlikely to happen? The only
| metric I've found in the article is the percentage of addressed
| suggestions that made it to the end user.
| dakshgupta wrote:
| This is the biggest pitfall of this method. It's partially
| combatted by also comparing it against an upvoted set, so if a
| type of comment has been upvoted and downvoted in the past, it
| is not blocked.
| Falimonda wrote:
| fwiw, not all the mobile site's menu items work when clicked
| panarchy wrote:
| I wonder how long until we end up with questionable devs making
| spurious changes just to try and game the LLM output to give them
| a pass.
| johnfn wrote:
| I already have devs doing this to me at work and I'm not even
| an AI
| dakshgupta wrote:
| You could include a comment that says "ignore that I did
| ______" during review. As long as a human doesn't do the second
| pass (we recommend they do), that should let you slip your code
| by the AI.
| untech wrote:
| I am not sure about using UPPERCASE in prompts for emphasis. I
| feel intuitively that uppercase is less "understandable" for LLMs
| because it is more likely to be tokenized as a sequence of
| characters. I have no data to back this up, though.
| dakshgupta wrote:
| I meant for that to be more an illustration - might do a longer
| post about the specific prompting techniques we tried.
| pedrovhb wrote:
| Here's an idea: have the LLM output each comment with a
| "severity" score ranging from 0-100 or maybe a set of possible
| values ("trivial", "small", "high"). Let it get everything off of
| its chest outputting the nitpicks but recognizing they're minor.
| Filter the output to only contain comments above a given
| threshold.
|
| It's hard to avoid thinking of a pink elephant, but easy enough
| to consciously recognize it's not relevant to the task at hand.
| iLoveOncall wrote:
| Here's an idea: read the article and realize they already tried
| exactly that.
| zahlman wrote:
| The article authors tried this technique and found it didn't
| work very well.
| jumploops wrote:
| We do something internally[0] but specifically for security
| concerns.
|
| We've found that having the LLM provide a "severity" level
| (simply low, medium, high), we're able to filter out all the
| nitpicky feedback.
|
| It's important to note that this severity level should be
| specified at the end of the LLM's response, not the beginning or
| middle.
|
| There's still an issue of context, where the LLM will provide a
| false positive due to unseen aspects of the larger system (e.g.
| make sure to sanitize X input).
|
| We haven't found the bot to be overbearing, but mostly because we
| auto-delete past comments when changes are pushed.
|
| [0]
| https://magicloops.dev/loop/3f3781f3-f987-4672-8500-bacbeefc...
| dakshgupta wrote:
| The severity needing to be at the end was an important insight.
| It made the results much better but not quite good enough.
|
| We had it output a json with fields {comment: string, severity:
| string} in that order.
| Merik wrote:
| Another variation on this is to think about tokens and
| definitions. Numbers don't have inherent meaning for your use
| case, so if you use numbers you need to provide an explicit
| definition of each rating number in the prompt. Similarly,
| and more effectively is to use labels such as low-quality,
| medium-quality, high-quality, and again providing an explicit
| definition of the label; one step further is to use explicit
| self describing label (along with detailed definition) such
| as "trivial-observation-on-naming-convention" or "insightful-
| identification-on-missed-corner-case".
|
| Effectively you are turning a somewhat arbitrary numeric
| "rating" task , into a multi label classification problem
| with well defined labels.
|
| The natural evolution is to then train a BERT based
| classifier or similar on the set of labels and comments,
| which will get you a model judge that is super fast and can
| achieve good accuracy.
| nikolayasdf123 wrote:
| > $0.45/file capped at $50/dev/month
|
| wow. this is really expensive... especially given core of this
| technology is open source and target customers can set it up
| themselves self-hosted
| MacsHeadroom wrote:
| A cap of less than 1% of an average developer's pay is "really
| expensive"?
| Arainach wrote:
| For a glorified linter? Absolutely.
|
| For reference, IntelliJ Ultimate - a full IDE with leading
| language support - costs that much.
| nikolayasdf123 wrote:
| who is this average developer? layoffs are left and right for
| last 4 years. startups getting shutdown. funding for gov
| contracts getting cut. it is increasingly hard to find any
| job in software. many guys I know are barely making it, and
| better spend those money on their child or living expenses.
| "average developers" say in china, philipines, india,
| vietnam, cis countries, making way less than to un-frugally
| spend it on 50USD subscription, for something that may not
| even work well, and may require human anyways. pay 0.4USD
| "per-file" is just ridiculous. this pricing immediately puts
| off and is a non-starter
|
| UPD: oh, you mean management will fire SWEs and replace them
| with this? well, yeah, then it makes sense to them. but the
| quality has to be good. and even then many mid to large size
| orgs I know are cutting all subscriptions (particularly per
| developer or per box) they possibly can (e.g. Microsoft,
| Datadog etc.) so even for them cost is of importance
| callamdelaney wrote:
| Does it work on real engineers?
| defrost wrote:
| A quarter of real engineers are already Civil.
| aeve890 wrote:
| What do real engineers have to do with software? /jk
| Havoc wrote:
| Think it would initially have gone better had they not used
| ,,nits" but rather nitpicks. ie something that's in the
| dictionary that the chatbot is likely to understand
| kgeist wrote:
| They should have also added some basic chain-of-thought
| reasoning in the prompt + asked it to add [nitpick] tag at the
| end, so that the likelihood of adding it correctly increased
| due to in-context learning. Then another pass removes all the
| nitpicks and internal reasoning steps.
| tayo42 wrote:
| > Attempt 2: LLM-as-a-judge
|
| Wouldnt this be achievable with a classifier model? Maybe even a
| combo of getting the embedding and then putting it through a
| classifier? Kind of like how Gans work.
|
| Edit: I read the article before the comment section, silly me lol
| just-another-se wrote:
| And how do you guys deal with a cold start problem then? Suppose
| the repo is new and has very few previous comments?
| croemer wrote:
| Do pooling with the average embedding of all customers
| dakshgupta wrote:
| We do use some other techniques to filter out comments that are
| almost always considered useless by teams.
|
| For the typical team size that uses us (at least 20+ engineers)
| the number of downvotes gets high enough to show results within
| a workday or two, and achieves something of a stable state
| within a week.
| dcreater wrote:
| Are you still looking for slaves, err I mean employees, who are
| going to work 80+hrs a week? Do you sponsor visas?
| jerrygoyal wrote:
| I'm in the market for PR review bots, as the nitpicking issue is
| real. So far, I have tried Coderabbit, but it adds too much noise
| to PRs, and only a very small percentage of comments are actually
| useful. I specifically instructed it to ignore nitpicks, but it
| still added such comments. Their cringy ASCII art comments make
| it even harder to take them seriously.
|
| I recently signed up for Korbit AI, but it's too soon to provide
| feedback. Honestly, I'm getting a bit fed up with experimenting
| with different PR bots.
|
| Question for the author: In what ways is your solution better
| than Coderabbit and Korbit AI?
| dakshgupta wrote:
| I haven't explored Korbit but with CodeRabbit there are a
| couple of things:
|
| 1. We are better at full codebase context, because of how we
| index the codebase like a graph and use graph search and an LLM
| to determine what other parts of the codebase should be taken
| into consideration while reviewing a diff.
|
| 2. We offer an API you can use to build custom workflows, for
| example every time a test fails in your pipeline you can pass
| the output to Greptile and it will diagnose with full codebase
| context.
|
| 3. Customers of ours that switched from CodeRabbit usually say
| Greptile has far fewer and less verbose comments. This is a
| subjective, of course.
|
| That said, CodeRabbit is cheaper.
|
| Both products have free trials for though, so I would recommend
| trying both and seeing which one your team prefers, which is
| ultimately what matters.
|
| I'm happy to answer more questions or a demo for your team too
| -> daksh@greptile.com
| swells34 wrote:
| Isn't code review the one area you'd _never_ want to replace
| the programmer in? Like, that is the most important step of the
| process; it 's where we break down the logic and sanity check
| the implementation. That is expressly where AI is the weakest.
| imoverclocked wrote:
| Many code review products are not AI. They are often
| collections of rules that have been specifically crafted to
| catch bad patterns. Some of the rules are stylistic in nature
| and those tend to turn people away the fastest IMHO.
|
| Many of these tools can be integrated into a local workflow
| so they will never ping you on a PR but many developers
| prefer to just write some code and let the review process
| suss things out.
| yorwba wrote:
| You probably shouldn't take a code review bot's "LGTM!" at
| face value, but if it tells you there is an issue with your
| code and it is indeed an issue with your code, it has
| successfully subbed in for your coworker who didn't get
| around to looking at it yet.
| imiric wrote:
| > if it tells you there is an issue with your code and it
| is indeed an issue with your code
|
| That's the crucial bit. If it ends up hallucinating issues
| as LLMs tend to do[1], then it just adds more work for
| human developers to confirm its report.
|
| Human devs can create false reports as well, but we're more
| likely to miss an issue than misunderstand and be confident
| about it.
|
| [1]:
| https://www.theregister.com/2024/12/10/ai_slop_bug_reports/
| dakshgupta wrote:
| I agree - at least with where technology is today, I would
| strongly discourage replacing human code review with AI.
|
| It does however serve as a good first pass, so by the time
| the human reviewer gets it, the little things have been
| addressed and they can focus on broader technical decisions.
|
| Would you want your coworkers to run their PRs through an AI
| reviewer, resolve those comments, then send it to you?
| AgentOrange1234 wrote:
| This feels like the right take.
|
| Once we have an AI starting to do something, there's an art
| to gradually adopting it in a way that makes sense.
|
| We don't lose the ability to have humans review code. We
| don't have to all use this on every PR. We don't have to
| blindly accept every comment it makes.
| jerrygoyal wrote:
| No one is blindly merging PR after AI Code Review. Think of
| review bot as an extra pair of eyes.
| bobnamob wrote:
| Yet.
|
| The things I've seen juniors try to do because "the AI said
| so" is staggering. Furthermore, I have little faith that
| the average junior gets competent enough teaching/mentoring
| that this attitude won't be a pervasive problem in 5ish
| years.
|
| Admittedly, maybe I'm getting old and this is just another
| "darn kids these days" rant
| hansvm wrote:
| If you think of the "AI" benefit as more of a linter then the
| value starts to become clear.
|
| E.g., I've never hesitated to add a linting rule requiring
| `except Exception:` in lieu of `except:` in Python, since the
| latter is very rarely required (so the necessary incantations
| to silence the linter are cheap on average) and since the
| former is almost always a bug prone to making
| shutdown/iteration/etc harder than they should be. When I add
| that rule at new companies, ~90% of the violations are latent
| bugs.
|
| AI has the potential to (though I haven't seen it work well
| yet in this capacity) lint most such problematic patterns. In
| an ideal world, it'd even have the local context to know
| that, e.g., since you're handling a DB transaction and
| immediately re-throwing the raw `except:` is appropriate and
| not even flag it in the "review" (the AI linting), reducing
| the false positive rate. You'd still want a human review, but
| you could avoid bugging a human till the code is worth
| reviewing, or you could let the human focus on things that
| matter instead of harping on your use of low-level atomic
| fences yet again. AI has potential to improve the transition
| from "code complete" to "shipped and working."
| siscia wrote:
| Wouldn't be simpler to just use a linter then?
| stavros wrote:
| It would be simpler, but it wouldn't be as effective.
| abenga wrote:
| How so? The build fails if linter checks don't pass. What
| does an ai add to this?
| stavros wrote:
| Much more nuanced linting rules.
| fzeroracer wrote:
| What exactly is gained from these nuanced linting rules?
| Why do they need to exist? What are they actually doing
| in your codebase other than sucking up air?
| stavros wrote:
| Are you arguing against linters? These disingenuous
| arguments are just tiring, you either accept that linters
| can be beneficial, and thus more nuanced rules are a good
| thing, or you believe linters are generally useless. This
| "LLM bad!" rhetoric is exhausting.
| fzeroracer wrote:
| No. And speaking of disingenuous arguments, you've made
| an especially absurd one here.
|
| Linters are useful. But you're arguing that you have such
| complex rules that they cannot be performed by a linter
| and thus must be offloaded to an LLM. I think that's
| wrong, and it's a classical middle management mistake.
|
| We can all agree that some rules are good. But that does
| not mean more rules are good, nor does it mean more
| complex rules are good. Not only are you integrating a
| whole extra system to support these linting rules, you
| are doing so for the sake of adding even more complex
| linting rules that cannot be automated in a way that
| prevents developer friction.
| stavros wrote:
| If you think "this variable name isn't descriptive
| enough" or "these comments don't explain the 'why' well
| enough" is constructive feedback, then we won't agree,
| and I'm very curious as to what your PR reviews look
| like.
| hansvm wrote:
| Linters suffer from a false positive/negative tradeoff
| that AI can improve. If they falsely flag things then
| developers tend to automatically ignore or silence the
| linter. If they don't flag a thing then ... well ... they
| didn't flag it, and that particular burden is pushed to
| some other human reviewer. Both states are less than
| ideal, and if you can decrease the rate of them happening
| then the tool is better in that dimension.
|
| How does AI fit into that picture then? The main benefits
| IMO are the abilities to (1) use contextual clues, (2)
| process "intricate" linting rules (implicitly, since it's
| all just text for the LLM -- this also means you can
| process many more linting rules, since things too
| complicated to be described nicely by the person writing
| a linter without too high of a false positive rate are
| unlikely to ever be introduced into the linter), and (3)
| giving better feedback when rules are broken. Some
| examples to compare and contrast:
|
| For that `except` vs `except Exception:` thing I
| mentioned, all a linter can do is check whether the
| offending pattern exists, making the ~10% of proper use
| cases just a little harder to develop. A smarter linter
| (not that I've seen one with this particular rule yet)
| could allow a bare `except:` if the exception is always
| re-raised (that being both the normal use-case in DB
| transaction handling and whatnot where you might
| legitimately want to catch everything, and also a coding
| pattern where the practice of catching everything is
| unlikely to cause the bugs it normally does). An AI
| linter can handle those edge cases automatically, not
| giving you spurious warnings for properly written DB
| transaction handling. Moreover, it can suggest a
| contextually relevant proper fix (`except BaseException:`
| to indicate to future readers that you considered the
| problems and definitely want this behavior, `except
| Exception:` to indicate that you do want to catch
| "everything" but without weird shutdown bugs, `except
| SomeSpecificException:` because the developer was just
| being lazy and would have accidentally written a new bug
| if they caught `Exception` instead, or perhaps just
| suggesting a different API if exceptions weren't a
| reasonable way to control the flow of execution at that
| point).
|
| As another example, you might have a linting rule banning
| low-level atomics (fences, seq_cst loads, that sort of
| thing). Sometimes they're useful though, and an AI linter
| could handle the majority of cases with advice along the
| lines of "the thing you're trying to do can be easily
| handled with a mutex; please remove the low-level
| atomics". Incorporating the context like that is
| impossible for normal linters.
|
| My point wasn't that you're replacing a linter with an
| AI-powered linter; it's that the tool generates the same
| sort of local, mechanical feedback a linter does -- all
| the stuff that might bog down a human reviewer and keep
| them from handling the big-picture items. If the tool is
| tuned to have a low false-positive rate then almost any
| advice it gives is, by definition, an important
| improvement to your codebase. Human reviewers will still
| be important, both in catching anything that slips
| through, and with the big-picture code review tasks.
| eterm wrote:
| Code review is also one of the last places to get adequate
| knowledge-sharing into many processes which have gone fully
| remote / asynchronous.
| righthand wrote:
| Our AI code review bot (Codacy) is just an LLM that compiles all
| linter rules and might be the most annoying useless thing. For
| example it will ding your PR for not considering Opera browser
| limitations on a backend NodeJS PR.
|
| Furthermore most of the code reviews I perform, rarely do I ever
| really leave commentary. There are so many frameworks and
| libraries today that solve whatever problem, unless someone adds
| complex code or puts a file in a goofy spot, it's an instant
| approval. So an AI bot doesn't help something which is a minimal
| non-problem task.
| mcbishop wrote:
| Some have a smaller tighter codebase... where it's realistic to
| pursue consistent application of an internal style guide (which
| current AI seems well suited to help with (or take ownership
| of)).
| fallingknife wrote:
| A linter is well suited for this. If you have style rules too
| complex for a linter to handle I think the problem is the
| rules, not the enforcement mechanism.
| doikor wrote:
| Code style is something that should be enforceable with
| linters and formatters for the most part without any need for
| an AI.
| lupire wrote:
| That's not what style means.
| Spivak wrote:
| I've honestly considered just making a script that hits the
| checkbox automatically because we have to have "code review"
| for compliance but 99% of the time I simply don't care. Is the
| change you made gonna bring down prod, no, okay ship it.
|
| At some level if I didn't trust you to not write shitty code
| you wouldn't be on our team. I don't think I want to go all the
| way and say code review is a smell but not really needing it is
| what code ownership, good integration tests, and good qa buys
| you.
| pacifika wrote:
| Code review is a good onboarding on upskilling tool, as a
| reviewer.
| rgbrgb wrote:
| How do you bootstrap the trust without code review? Always
| confused how teams without review onboard someone new.
| ajmurmann wrote:
| Pair-programming is one option. It might seem expensive at
| first glance but usually both parties get a lot of value
| from it (new possible ideas for the teacher if the learner
| isn't entirely junior and obviously quick learning for the
| new person)
| criley2 wrote:
| I think you just accept the risk, same as not doing code
| review. If you're doing something Serious(tm) then you
| follow the serious rules to protect your serious business.
| If it doesn't actually matter that much if prod goes down
| or you release a critical vulnerability or leak client data
| or whatever, then you accept the risk. It's cheaper to push
| out code quickly and cleanup the mess as long as
| stakeholders and investors are happy.
| Tainnor wrote:
| > At some level if I didn't trust you to not write shitty
| code you wouldn't be on our team.
|
| Code review isn't about preventing "shitty code" because you
| don't trust your coworkers.
|
| Of course, if 99% of the time you don't care about the code
| review then it's not going to be an effective process, but
| that's a self-fulfilling prophecy.
| jahnu wrote:
| Our approach is to leave PR comments as pointers in the areas
| where the reviewer needs to look, where it is most important,
| and they can forget the rest. I want good code reviews but
| don't want to waste anyone's time.
| minasmorath wrote:
| To be honest, I tried to work with with GitHub Copilot to see
| if it could help junior devs focus in on the important parts
| of a PR and increase their efficiency and such, but... I've
| found it to be worse than useless at identifying where the
| real complexities and bug opportunities actually are.
|
| When it does manage to identify the areas of a PR with the
| highest importance to review, other problematic parts of the
| PR will go effectively unreviewed, because the juniors are
| trusting that the AI tool was 100% correct in identifying the
| problematic spots and nothing else is concerning. This is
| partially a training issue, but these tools are being
| marketed as if you can trust them 100% and so new devs
| just... do.
|
| In the worst cases I see, which happen a good 30% of the time
| at this point based on some rough napkin math, the AI directs
| junior engineers into time-wasting rabbit holes around things
| that are actually non-issues while actual issues with a PR go
| completely unnoticed. They spend ages writing defensive code
| and polyfills for things that the framework and its included
| polyfills already 100% cover. But if course, that code was
| usually AI generated too, and it's incomplete for the case
| they're trying to defend against anyway.
|
| So IDK, I still think there's value in there somewhere, but
| extracting it has been an absolute nightmare for me.
| righthand wrote:
| I do this as well for my PRs, it's a good way to leave a
| little explainer on the code especially with large merge
| requests.
| oakejp12 wrote:
| Are the junior programmers that you hire that good that you
| don't need training or commentary? I find I spend a lot of time
| reviewing MRs and leaving commentary. For instance, this past
| week, I had a junior apply the same business logic across
| classes instead of creating a class/service and injecting it as
| a dependency.
|
| This, improper handling of exceptions, missed testing cases or
| no tests at all, incomplete types, a misunderstanding of a
| nuanced business case, etc. An automatic approval would leave
| the codebase in such a dire state.
| anonzzzies wrote:
| We found that in general it's pretty hard to make the llm just
| stop without human intervention. You can see with things like
| Cline that if the llm has to check its own work, it'll keep
| making 'improvements' in a loop; removing all comments, adding
| all comments etc. It needs to generate something and seems overly
| helpful to give you something.
| seb1204 wrote:
| Would the LLM make less nit picking comments if the code base
| included coding style and commenting rules in the repository?
| extr wrote:
| I'm surprised the authors didn't try the "dumb" version of the
| solution they went with: instead of using fancy cosine similarity
| create to implicit clusters, just ask it to classify the comment
| along a few dimensions and then do your own filtering on that (or
| create your own 0-100 scoring!) Seems like you would have more
| control that way and actually derive some rich(er) data to fine
| tune on. It seems they are already almost doing this: all the
| examples in the article start with "style"!
|
| I have seen this pattern a few times actually, where you want the
| AI to mimic some heuristic humans use. You never want to ask it
| for the heuristic directly, just create the constitute data so
| you can do some simple regression or whatever on top of it and
| control the cutoff yourself.
| pacifika wrote:
| I thought the llm would just makes up the scores because of
| lack of training.
| fnqi8ckfek wrote:
| Reading from the other comments here, I'm the only one thinking
| that this is just busywork...? Just get rid of the thing, it's a
| solution without a problem.
| dannersy wrote:
| I've been following this conversation and many of the
| sentiments against an AI code review bot are gone. I guess
| they're getting shadow banned?
| stavros wrote:
| I'm fairly sure there are at least a few companies that have
| the problem of needing PRs reviewed.
| Hilift wrote:
| I guess normally it's difficult to stop an employee from
| leaving nitpicky comments. But with AI you can finally take
| control.
| profsummergig wrote:
| Is there a name for the activity of trying out different
| strategies to improve the output of AI?
|
| I found this article surprisingly enjoyable and interesting and
| if like to find more like it.
| thrw42A8N wrote:
| Prompt engineering
| profsummergig wrote:
| But this wasn't a different prompt that made it work better.
| Here they created a database of prior human created comments
| with dev ratings of how good the devs found them.
| keybored wrote:
| Why review bots. Why not a bot that runs as part of the
| validation suite? And then you dismiss and follow up on what you
| want?
|
| You can run that locally.
| Kwpolska wrote:
| > Please don't use HN primarily for promotion. It's ok to post
| your own stuff part of the time, but the primary use of the site
| should be for curiosity.
|
| https://news.ycombinator.com/newsguidelines.html
| mosselman wrote:
| I don't feel like this is primarily promotion. Sure it is a bit
| of a content marketing article, but still offers some
| potentially valuable insight.
| Kwpolska wrote:
| A look at OP's posting history suggests he posts a lot of
| advertising.
| Retr0id wrote:
| > LLMs (which are paid by the token)
|
| Straight-forwardly true and yet I'd never thought about it like
| this before. i.e. that there's a perverse incentive for LLM
| vendors to tune for verbose outputs. We rightly raise eyebrows at
| the idea of developers being paid per volume of code, but it's
| the default for LLMs.
| britannio wrote:
| Only while a vendor is ahead of the others. We developers will
| favour a vendor with faster inference and lower pricing.
| Retr0id wrote:
| I think real endgame is some kind of "no win no fee"
| arrangement. In the case of the article in the OP, it'd be if
| they billed clients per "addressed" comment. It's less clear
| how that would map onto someone selling direct access to an
| LLM, though.
| dustingetz wrote:
| then, cartels
| vergessenmir wrote:
| It's why I prefer Claude to let's Chat GPT. I'm paying for
| tokens generated too.
| rgbrgb wrote:
| In the current landscape I think competition handles this
| neatly, especially since open models have almost the opposite
| incentive (training on concision is cheaper). As siblings note,
| Claude tends to be a little less verbose. Though I find they
| can all be quite concise when instructed (e.g. "just the code,
| no yapping").
| johnfn wrote:
| LLMs are paid by input token _and_ output token. LLM developers
| are just as incentivized to give quality output so you make
| more calls to the LLM.
| thomasahle wrote:
| TLDR:
|
| > Giving few-shot examples to the generator didn't work.
|
| > Using an LLM-judge (with no training) didn't work.
|
| > Using an embedding + KNN-classifier (lots of training data)
| worked.
|
| I don't know why they didn't try fine-tuning the LLM-judge, or at
| least give it some few-shot examples.
|
| But it shows that embeddings can make very simple classifiers
| work well.
| planetpluta wrote:
| > Essentially we needed to teach LLMs (which are paid by the
| token) to only generate a small number of high quality comments.
|
| The solution of filtering after the comment is generated doesn't
| seem to address the "paid by the token" piece.
| dietr1ch wrote:
| It may be a step forward in terms of gathering data and
| defining the filter, but pushing up this filter might be
| expensive unless there's enough volume to justify it.
|
| I took it as a comment on that generally models will be biased
| towards being nitty and that's something that needs to be dealt
| with as the incentives are not there to fix things at the
| origin.
| wzdd wrote:
| Previously: https://news.ycombinator.com/item?id=42465374
| hsbauauvhabzb wrote:
| I can't fathom a world where an LLM would be able to review code
| in any meaningful way -at all-.
|
| It should not substitute a human, and probably wasted more effort
| than it solves by a wide margin.
| hsbauauvhabzb wrote:
| And for any poor soul forced into an environment using AI
| driven commit reviews, be sure you reply to the commit reviews
| with gpt output and continue the conversation, eventually bring
| in some of the AI 'engineers' to help resolve the issue, waste
| their time back.
| dakshgupta wrote:
| I understand your skepticism and discomfort, and also agree
| that an LLM should not replace a human code reviewer.
|
| I would encourage you to try one (nearly all including ours
| have a free trial).
|
| When done right, they serve as a solid first pass and surface
| things that warrant a second look. Repeated code where there
| should be an abstraction, inconsistent patterns from other
| similar code elsewhere in the codebase, etc. Things that
| linters can't do.
|
| Would you not want your coworkers to have AI look at their PRs,
| have them address the relevant comments, and then pass it to
| you for review?
| smallpipe wrote:
| > Would you not want your coworkers to have AI look at their
| PRs, have them address the relevant comments
|
| God no. Review is where I teach more junior people about the
| code base. And where I learn from the more senior people.
| Either of us spending time making the AI happy just to be
| told we're solving the wrong problem is a ridiculous waste of
| time.
| makingstuffs wrote:
| Yeah it seems like the inverse of what is logical: Have a human
| review the code which _may_ have been written/influenced by an
| LLM. Having an LLM do code review seems about as useful as YOLO
| merging to main and praying to the Flying Spaghetti Monster...
| pcwelder wrote:
| A whole article without mentioning the name of the LLM. It's not
| like Sonnet and O1 have the same modalities.
|
| Anything you do today might become irrelevant tomorrow.
| AgentOrange1234 wrote:
| I'd be curious to hear more on Attempt 2. It sounds like the
| approach was basically to ask an llm for a score for each
| comment. Adding specifics to this prompt might go a long way?
| Like, what specifically is the rationale for this change, is this
| likely to be a functional bug, is it a security issue, how does
| it impact maintainability over the long run, etc.; basically I
| wonder if asking about more specific criteria and trying to
| define what you mean by nits can help the LLM give you more
| reliable scores.
| dakshgupta wrote:
| That's an interesting point - we didn't try this. Now that you
| said that, I bet even clearly defining what each number on the
| scale means would help.
| iandanforth wrote:
| "As a last resort we tried machine learning ..."
|
| - Hilarious that a cutting edge solution (document embedding and
| search) from 5-6 years ago was their last resort.
|
| - Doubly hilarious that "throw more AI at it" surprised them when
| it didn't work.
| aarondia wrote:
| I find these retro blog posts from people building LLM solutions
| super useful for helping me try out new prompting, eval, etc.
| techniques. Which blog posts have you all found most useful?
| Would love a link.
| utdiscant wrote:
| "We picked the latter, which also gave us our performance metric
| - percentage of generated comments that the author actually
| addresses."
|
| This metric would go up if you leave almost no comments. Would it
| not be better to find a metric that rewards you for generating
| many comments which are addressed, not just having a high
| relevance?
|
| You even mention this challenge yourselves: "Sadly, even with all
| kinds of prompting tricks, we simply could not get the LLM to
| produce fewer nits without also producing fewer critical
| comments."
|
| If that was happening, that doesn't sound like it would be
| reflected in your performance metric.
| dakshgupta wrote:
| Good criticism that we should pay closer attention to. Someone
| else pointed this out and too and since then we've started
| tracking addressed comment per file changed as well.
| SomewhatLikely wrote:
| You could probably modify the metric to addressed comments per
| 1000 lines of code.
| lupire wrote:
| Instead of training an LLM to make comments, train the LLM to
| generate test cases and then guess the line of code that breaks
| the test.
| dakshgupta wrote:
| You might enjoy @goodside on Twitter. He's a prompt engineer at
| Scale AI and a lot of his observations and techniques are
| fascinating.
___________________________________________________________________
(page generated 2024-12-22 23:02 UTC)