[HN Gopher] How we made our AI code review bot stop leaving nitp...
       ___________________________________________________________________
        
       How we made our AI code review bot stop leaving nitpicky comments
        
       Author : dakshgupta
       Score  : 234 points
       Date   : 2024-12-18 16:31 UTC (4 days ago)
        
 (HTM) web link (www.greptile.com)
 (TXT) w3m dump (www.greptile.com)
        
       | pama wrote:
       | After failing with three reasonable ideas, they solved the
       | problem with an idea that previously would have seemed unlikely
       | to work (get similarity to past known nits and use a threshold of
       | 3 similar nits above a certain cutoff similarity score for
       | filtering). A lot of applications of ML have a similar hacky
       | trial and error flavor like that. Intuition is often built in
       | retrospect and may not transfer well to other domains. I would
       | have guessed that finetuning would also work, but agree with the
       | author that it would be more expensive and less portable across
       | different models.
        
         | dakshgupta wrote:
         | Since we posted this, two camps of people reached out:
         | 
         | Classical ML people who recommended we try training a
         | classifier, possibly on the embeddings.
         | 
         | Fine tuning platforms that recommended we try their platform.
         | 
         | The challenge there would be gathering enough data _per
         | customer_ to meaningfully capture their definition and standard
         | for nit-pickiness.
        
           | pama wrote:
           | What you use now is a simple KNN classifier, and if it works
           | well enough, perhaps no need to go much further. If you need
           | to squeeze out a couple additional percentage points maybe
           | try a different simple and robust ML classifier (random
           | forest, xgboost, or a simple two layer network). All these
           | methods, including your current classifier, will get better
           | with additional data and minor tuning in the future.
        
             | dakshgupta wrote:
             | Thank you, I will try this. I suspect we can extract some
             | universal theory of nits and have a base filter to start
             | with, and have it learn per-company preferences on top of
             | that.
        
               | keepingscore wrote:
               | You should be able to do that already by taking all of
               | your customers nit embeddings and averaging them to
               | produce a point in space that represents the universal
               | nit. Embeddings are really cool and the fact that they
               | still work when averaging is one of their cool
               | properties.
        
               | dakshgupta wrote:
               | This is a cool idea - I'll try this and add it as an
               | appendix to this post.
        
         | olddustytrail wrote:
         | I don't think the prompt was reasonable. Nits are the eggs of
         | headlice. Expecting the LLM to guess they meant superficial
         | issues, rather than just spelling that out, is a bit weird.
         | 
         | It's like saying "don't comment on elephants" when you mean
         | don't comment on the obvious stuff.
        
           | whitfin wrote:
           | Yeah, even the title here says "nitpicky". I'm not sure if
           | they tried "nitpicks" instead of "nits", but I don't know why
           | you wouldn't...
        
           | tayo42 wrote:
           | Do llms struggle with other homonyms?
        
             | Eisenstein wrote:
             | They can, but I think the parent is saying that when you
             | are crafting an instruction for someone or something to
             | follow, you should be as direct and as clear as possible in
             | order to remove the need for the actor to have to intuit a
             | decision instead of act on certainty.
             | 
             | I would start with something like: 'Avoid writing comments
             | which only concern stylistic issues or which only contain
             | criticisms considered pedantic or trivial.'
        
       | Nullabillity wrote:
       | [flagged]
        
         | dakshgupta wrote:
         | I would say they are useful but they aren't magic (at least
         | yet) and building useful applications on top of them requires
         | some work.
        
           | WesolyKubeczek wrote:
           | Not only do LLMs require some work to build anything useful,
           | they are also fuzzy and nondeterministic, requiring you to
           | tweak your prompting tricks once in a while, and they are
           | also quite expensive.
           | 
           | Nothing but advantages.
        
         | dang wrote:
         | " _Please don 't post shallow dismissals, especially of other
         | people's work. A good critical comment teaches us something._"
         | 
         | https://news.ycombinator.com/newsguidelines.html
        
       | XenophileJKO wrote:
       | Unless they were using a model too stupid for this operation, the
       | fundamental problem is usually solvable via prompting.
       | 
       | Usually the issue is that the models have a bias for action, so
       | you need to give it an accetpable action when there isn't a good
       | comment. Some other output/determination.
       | 
       | I've seen this in many other similar applications.
        
         | marviel wrote:
         | +1 for "No Comment/Action Required" responses reducing trigger-
         | happiness
        
         | keepingscore wrote:
         | This isn't my post but I work with llms everyday and I've seen
         | this kind of instruction ignoring behavior on sonnet when the
         | context window starts getting close to the edge.
        
           | XenophileJKO wrote:
           | I don't know, in practice there are so many potential causes
           | that you have to look case by case in situations like that. I
           | don't have a ton of experience with the raw Claude model
           | specifically, but would anticipate you'll have the same
           | problem classes.
           | 
           | Usually it comes down to one of the following:
           | 
           | - ambiguity and semantics (I once had a significant behavior
           | difference between "suggest" and "recommend", i.e. a model
           | can suggest without recommending.)
           | 
           | - conflicting instructions
           | 
           | - data/instruction bleeding (delimiters help, but if the span
           | is too long it can loose track of what is data and what is
           | instructions.)
           | 
           | - action bias (If the task is to find code comments for
           | example, even if you tell it not to, it will have a bias to
           | do it as you defined the task that way.)
           | 
           | - exceeding attention capacity (having to pay attention to
           | too much or having too many instructions. This is where
           | structures output or chain of thought type approaches help.
           | They help focus attention on each step of the process and the
           | related rules.)
           | 
           | I feel like these are the ones you encounter the most.
        
             | dakshgupta wrote:
             | One word changes impacting output is interesting but also
             | quite frustrating. Especially because the patterns don't
             | translate across models.
        
               | gopher_space wrote:
               | The "write like the people who wrote the info you want"
               | pattern absolutely translates across models.
        
               | sdesol wrote:
               | Yes and no. I've found the order in which you give
               | instructions matters for some models as well. With LLMs,
               | you really need to treat them like black boxes and you
               | cannot assume one prompt will work for all. It is
               | honestly, in my experience, a lot of trial and error.
        
           | dakshgupta wrote:
           | This might be what we experienced. We regularly have context
           | reach 30k+ tokens.
        
         | NBJack wrote:
         | Prompting only works if the training corpus used for the LLM
         | has critical mass on the concept. I agree that the model should
         | have alternatives (i.e. preface the comment with the tag
         | 'Nit:'), but bear in mind code review nit-picking isn't exactly
         | a well-researched area of study. Also to your point, given this
         | was also likely trained predominantly on code, there's probably
         | a lack of comment sentiment in the corpus.
        
       | dbetteridge wrote:
       | Prompt:
       | 
       | If the comment could be omitted without affecting the codes
       | functionality but is stylistic or otherwise can be ignored then
       | preface the comment with
       | 
       | NITPICK
       | 
       | I'm guessing you've tried something like the above and then
       | filtering for the preface, as you mentioned the llm being bad at
       | understanding what is and isn't important.
        
         | abecedarius wrote:
         | Nitpick: I'd ask for NITPICK at the end of output instead of
         | the start. The model should be in a better place to make that
         | decision there.
        
           | ivanjermakov wrote:
           | I find it ironic how capable LLMs have become, but they're
           | still struggling with things like this. Great reminder that
           | it's still text prediction at its core.
        
             | spott wrote:
             | I mean, think of LLM output as unfiltered thinking. If you
             | were to make that determination would you make it before
             | you had thought it through?
        
             | Eisenstein wrote:
             | I find it more helpful to keep in mind the autoregressive
             | nature rather than the prediction. 'Text prediction' brings
             | to mind that it is guessing what word would follow another
             | word, but it is doing so much more than that.
             | 'Autoregressive' brings to mind that it is using its
             | previously generated output to create new output every time
             | it comes up with another token. In that case you
             | immediately understand that it would have to make the
             | determination of severity after it has generated the
             | description of the issue.
        
         | dakshgupta wrote:
         | We tried this too, better but not good enough. It also often
         | labeled critical issues as nitpicks, which is unacceptable in
         | our context.
        
           | firesteelrain wrote:
           | Then some of the nitpicks were actually valid?
           | 
           | What about PossiblyWrong, then?
        
         | fnqi8ckfek wrote:
         | My boss can't stop nitpicking, and I've started telling him
         | that "if your comment doesn't affect the output I'm ignoring
         | it".
        
           | marcandre wrote:
           | That is such an over-simplistic criteria! Proof by the
           | absurd: pass your code through an obfuscator / simplifier and
           | the output won't be affected.
        
             | fnqi8ckfek wrote:
             | Ok?
             | 
             | So if the boss's nitpick is "you must pass your code thru
             | an indicator that doesn't affect the output", I'm going to
             | ignore it. I don't understand what point you're trying to
             | make.
        
         | seb1204 wrote:
         | My thinking too. Or similarly including some coding style and
         | commenting style rules in the repository so they become part of
         | the code base and get considered by the LLM
        
       | throw310822 wrote:
       | Seems easier than getting the same from my colleagues.
        
         | mr_toad wrote:
         | I wonder if it was trained on a lot of nitpicking comments from
         | human reviews.
        
       | iLoveOncall wrote:
       | It's funny because I wouldn't consider the comment that they
       | highlight in their post as a nitpick.
       | 
       | Something that has an impact on the long term maintainability of
       | code is definitely not nitpikcky, and in the majority of cases
       | define a type fits this category as it makes refactors and
       | extensions MUCH easier.
       | 
       | On top of that, I think the approach they went with is a huge
       | mistake. The same comment can be a nitpick on one CR but crucial
       | on another, clustering them is destined to result in false-
       | positives and false-negatives.
       | 
       | I'm not sure I'd want to use a product to review my code for
       | which 1) I cannot customize the rules, 2) it seems like the rules
       | chosen by the creators are poor.
       | 
       | To be honest I wouldn't want to use any AI-based code reviewer at
       | all. We have one at work (FAANG, so something with a large
       | dedicated team) and it has not once produced a useful comment and
       | instead has been factually wrong many times.
        
         | dakshgupta wrote:
         | This is an important point - there is no universal
         | understanding of nitpickiness. It is why we have it learn every
         | new customers ways from scratch.
        
           | mannykannot wrote:
           | This does not address the issue raised in iLoveOncall's third
           | paragraph: "the same comment can be a nitpick on one CR but
           | crucial on another..." In "attempt 2", you say that "the LLMs
           | judgment of its own output was nearly random", which raises
           | questions that go well beyond just nitpicking, up to that of
           | whether the current state of the art in LLM code review is
           | fit for much more than ticking the box that says "yes, we are
           | doing code review."
        
             | lupire wrote:
             | If you are using an LLM for judgment, you are using it
             | wrong. An LLM is good for generating suggestions,
             | brainstorming, not making judgments.
             | 
             | That's why it is called _Generative_ AI.
        
               | mannykannot wrote:
               | Indeed - and if you are doing code review without
               | judgement, you are doing it wrong.
        
         | jbirer wrote:
         | Two theories:
         | 
         | 1) This is an ego problem. Whoever is doing the development
         | cannot handle being called out on certain software architecture
         | / coding mistakes, so it becomes "nitpicking".
         | 
         | 2) The software shop has a "ship out faster, cut corners"
         | culture, which at that point might as well turn off the AI
         | review bot.
        
           | AgentOrange1234 wrote:
           | 1 is super interesting.
           | 
           | I've found it's nice to talk to an LLM about personal issues
           | because I know it's not a real person judging me. Maybe if
           | the comments were kept private with the dev, it'd be more
           | just a coaching tool that didn't feel like a criticism?
        
             | Comfy-Tinwork wrote:
             | Private with the Dev, the operator & everyone who uses any
             | product of the training data they build.
        
           | trash_cat wrote:
           | This is a company culture problem.
        
       | kgeist wrote:
       | What about false positives?
       | 
       | As I see it, the solution assumes the embeddings only capture the
       | form: say, if developers previously downvoted suggestions to wrap
       | code in unnecessary try..catch blocks, then similar suggestions
       | will be successfully blocked in the future, regardless of the
       | module/class etc. (i.e. a kind of generalization)
       | 
       | But what if enough suggestions regarding class X (or module X)
       | get downvoted, and then the mechanism starts assuming class
       | X/module X doesn't need review at all? I mean the case when a lot
       | of such embeddings end up clustering around the class itself (or
       | a function), not around the general form of the comment.
       | 
       | How do you prevent this? Or it's unlikely to happen? The only
       | metric I've found in the article is the percentage of addressed
       | suggestions that made it to the end user.
        
         | dakshgupta wrote:
         | This is the biggest pitfall of this method. It's partially
         | combatted by also comparing it against an upvoted set, so if a
         | type of comment has been upvoted and downvoted in the past, it
         | is not blocked.
        
       | Falimonda wrote:
       | fwiw, not all the mobile site's menu items work when clicked
        
       | panarchy wrote:
       | I wonder how long until we end up with questionable devs making
       | spurious changes just to try and game the LLM output to give them
       | a pass.
        
         | johnfn wrote:
         | I already have devs doing this to me at work and I'm not even
         | an AI
        
         | dakshgupta wrote:
         | You could include a comment that says "ignore that I did
         | ______" during review. As long as a human doesn't do the second
         | pass (we recommend they do), that should let you slip your code
         | by the AI.
        
       | untech wrote:
       | I am not sure about using UPPERCASE in prompts for emphasis. I
       | feel intuitively that uppercase is less "understandable" for LLMs
       | because it is more likely to be tokenized as a sequence of
       | characters. I have no data to back this up, though.
        
         | dakshgupta wrote:
         | I meant for that to be more an illustration - might do a longer
         | post about the specific prompting techniques we tried.
        
       | pedrovhb wrote:
       | Here's an idea: have the LLM output each comment with a
       | "severity" score ranging from 0-100 or maybe a set of possible
       | values ("trivial", "small", "high"). Let it get everything off of
       | its chest outputting the nitpicks but recognizing they're minor.
       | Filter the output to only contain comments above a given
       | threshold.
       | 
       | It's hard to avoid thinking of a pink elephant, but easy enough
       | to consciously recognize it's not relevant to the task at hand.
        
         | iLoveOncall wrote:
         | Here's an idea: read the article and realize they already tried
         | exactly that.
        
         | zahlman wrote:
         | The article authors tried this technique and found it didn't
         | work very well.
        
       | jumploops wrote:
       | We do something internally[0] but specifically for security
       | concerns.
       | 
       | We've found that having the LLM provide a "severity" level
       | (simply low, medium, high), we're able to filter out all the
       | nitpicky feedback.
       | 
       | It's important to note that this severity level should be
       | specified at the end of the LLM's response, not the beginning or
       | middle.
       | 
       | There's still an issue of context, where the LLM will provide a
       | false positive due to unseen aspects of the larger system (e.g.
       | make sure to sanitize X input).
       | 
       | We haven't found the bot to be overbearing, but mostly because we
       | auto-delete past comments when changes are pushed.
       | 
       | [0]
       | https://magicloops.dev/loop/3f3781f3-f987-4672-8500-bacbeefc...
        
         | dakshgupta wrote:
         | The severity needing to be at the end was an important insight.
         | It made the results much better but not quite good enough.
         | 
         | We had it output a json with fields {comment: string, severity:
         | string} in that order.
        
           | Merik wrote:
           | Another variation on this is to think about tokens and
           | definitions. Numbers don't have inherent meaning for your use
           | case, so if you use numbers you need to provide an explicit
           | definition of each rating number in the prompt. Similarly,
           | and more effectively is to use labels such as low-quality,
           | medium-quality, high-quality, and again providing an explicit
           | definition of the label; one step further is to use explicit
           | self describing label (along with detailed definition) such
           | as "trivial-observation-on-naming-convention" or "insightful-
           | identification-on-missed-corner-case".
           | 
           | Effectively you are turning a somewhat arbitrary numeric
           | "rating" task , into a multi label classification problem
           | with well defined labels.
           | 
           | The natural evolution is to then train a BERT based
           | classifier or similar on the set of labels and comments,
           | which will get you a model judge that is super fast and can
           | achieve good accuracy.
        
       | nikolayasdf123 wrote:
       | > $0.45/file capped at $50/dev/month
       | 
       | wow. this is really expensive... especially given core of this
       | technology is open source and target customers can set it up
       | themselves self-hosted
        
         | MacsHeadroom wrote:
         | A cap of less than 1% of an average developer's pay is "really
         | expensive"?
        
           | Arainach wrote:
           | For a glorified linter? Absolutely.
           | 
           | For reference, IntelliJ Ultimate - a full IDE with leading
           | language support - costs that much.
        
           | nikolayasdf123 wrote:
           | who is this average developer? layoffs are left and right for
           | last 4 years. startups getting shutdown. funding for gov
           | contracts getting cut. it is increasingly hard to find any
           | job in software. many guys I know are barely making it, and
           | better spend those money on their child or living expenses.
           | "average developers" say in china, philipines, india,
           | vietnam, cis countries, making way less than to un-frugally
           | spend it on 50USD subscription, for something that may not
           | even work well, and may require human anyways. pay 0.4USD
           | "per-file" is just ridiculous. this pricing immediately puts
           | off and is a non-starter
           | 
           | UPD: oh, you mean management will fire SWEs and replace them
           | with this? well, yeah, then it makes sense to them. but the
           | quality has to be good. and even then many mid to large size
           | orgs I know are cutting all subscriptions (particularly per
           | developer or per box) they possibly can (e.g. Microsoft,
           | Datadog etc.) so even for them cost is of importance
        
       | callamdelaney wrote:
       | Does it work on real engineers?
        
         | defrost wrote:
         | A quarter of real engineers are already Civil.
        
         | aeve890 wrote:
         | What do real engineers have to do with software? /jk
        
       | Havoc wrote:
       | Think it would initially have gone better had they not used
       | ,,nits" but rather nitpicks. ie something that's in the
       | dictionary that the chatbot is likely to understand
        
         | kgeist wrote:
         | They should have also added some basic chain-of-thought
         | reasoning in the prompt + asked it to add [nitpick] tag at the
         | end, so that the likelihood of adding it correctly increased
         | due to in-context learning. Then another pass removes all the
         | nitpicks and internal reasoning steps.
        
       | tayo42 wrote:
       | > Attempt 2: LLM-as-a-judge
       | 
       | Wouldnt this be achievable with a classifier model? Maybe even a
       | combo of getting the embedding and then putting it through a
       | classifier? Kind of like how Gans work.
       | 
       | Edit: I read the article before the comment section, silly me lol
        
       | just-another-se wrote:
       | And how do you guys deal with a cold start problem then? Suppose
       | the repo is new and has very few previous comments?
        
         | croemer wrote:
         | Do pooling with the average embedding of all customers
        
         | dakshgupta wrote:
         | We do use some other techniques to filter out comments that are
         | almost always considered useless by teams.
         | 
         | For the typical team size that uses us (at least 20+ engineers)
         | the number of downvotes gets high enough to show results within
         | a workday or two, and achieves something of a stable state
         | within a week.
        
       | dcreater wrote:
       | Are you still looking for slaves, err I mean employees, who are
       | going to work 80+hrs a week? Do you sponsor visas?
        
       | jerrygoyal wrote:
       | I'm in the market for PR review bots, as the nitpicking issue is
       | real. So far, I have tried Coderabbit, but it adds too much noise
       | to PRs, and only a very small percentage of comments are actually
       | useful. I specifically instructed it to ignore nitpicks, but it
       | still added such comments. Their cringy ASCII art comments make
       | it even harder to take them seriously.
       | 
       | I recently signed up for Korbit AI, but it's too soon to provide
       | feedback. Honestly, I'm getting a bit fed up with experimenting
       | with different PR bots.
       | 
       | Question for the author: In what ways is your solution better
       | than Coderabbit and Korbit AI?
        
         | dakshgupta wrote:
         | I haven't explored Korbit but with CodeRabbit there are a
         | couple of things:
         | 
         | 1. We are better at full codebase context, because of how we
         | index the codebase like a graph and use graph search and an LLM
         | to determine what other parts of the codebase should be taken
         | into consideration while reviewing a diff.
         | 
         | 2. We offer an API you can use to build custom workflows, for
         | example every time a test fails in your pipeline you can pass
         | the output to Greptile and it will diagnose with full codebase
         | context.
         | 
         | 3. Customers of ours that switched from CodeRabbit usually say
         | Greptile has far fewer and less verbose comments. This is a
         | subjective, of course.
         | 
         | That said, CodeRabbit is cheaper.
         | 
         | Both products have free trials for though, so I would recommend
         | trying both and seeing which one your team prefers, which is
         | ultimately what matters.
         | 
         | I'm happy to answer more questions or a demo for your team too
         | -> daksh@greptile.com
        
         | swells34 wrote:
         | Isn't code review the one area you'd _never_ want to replace
         | the programmer in? Like, that is the most important step of the
         | process; it 's where we break down the logic and sanity check
         | the implementation. That is expressly where AI is the weakest.
        
           | imoverclocked wrote:
           | Many code review products are not AI. They are often
           | collections of rules that have been specifically crafted to
           | catch bad patterns. Some of the rules are stylistic in nature
           | and those tend to turn people away the fastest IMHO.
           | 
           | Many of these tools can be integrated into a local workflow
           | so they will never ping you on a PR but many developers
           | prefer to just write some code and let the review process
           | suss things out.
        
           | yorwba wrote:
           | You probably shouldn't take a code review bot's "LGTM!" at
           | face value, but if it tells you there is an issue with your
           | code and it is indeed an issue with your code, it has
           | successfully subbed in for your coworker who didn't get
           | around to looking at it yet.
        
             | imiric wrote:
             | > if it tells you there is an issue with your code and it
             | is indeed an issue with your code
             | 
             | That's the crucial bit. If it ends up hallucinating issues
             | as LLMs tend to do[1], then it just adds more work for
             | human developers to confirm its report.
             | 
             | Human devs can create false reports as well, but we're more
             | likely to miss an issue than misunderstand and be confident
             | about it.
             | 
             | [1]:
             | https://www.theregister.com/2024/12/10/ai_slop_bug_reports/
        
           | dakshgupta wrote:
           | I agree - at least with where technology is today, I would
           | strongly discourage replacing human code review with AI.
           | 
           | It does however serve as a good first pass, so by the time
           | the human reviewer gets it, the little things have been
           | addressed and they can focus on broader technical decisions.
           | 
           | Would you want your coworkers to run their PRs through an AI
           | reviewer, resolve those comments, then send it to you?
        
             | AgentOrange1234 wrote:
             | This feels like the right take.
             | 
             | Once we have an AI starting to do something, there's an art
             | to gradually adopting it in a way that makes sense.
             | 
             | We don't lose the ability to have humans review code. We
             | don't have to all use this on every PR. We don't have to
             | blindly accept every comment it makes.
        
           | jerrygoyal wrote:
           | No one is blindly merging PR after AI Code Review. Think of
           | review bot as an extra pair of eyes.
        
             | bobnamob wrote:
             | Yet.
             | 
             | The things I've seen juniors try to do because "the AI said
             | so" is staggering. Furthermore, I have little faith that
             | the average junior gets competent enough teaching/mentoring
             | that this attitude won't be a pervasive problem in 5ish
             | years.
             | 
             | Admittedly, maybe I'm getting old and this is just another
             | "darn kids these days" rant
        
           | hansvm wrote:
           | If you think of the "AI" benefit as more of a linter then the
           | value starts to become clear.
           | 
           | E.g., I've never hesitated to add a linting rule requiring
           | `except Exception:` in lieu of `except:` in Python, since the
           | latter is very rarely required (so the necessary incantations
           | to silence the linter are cheap on average) and since the
           | former is almost always a bug prone to making
           | shutdown/iteration/etc harder than they should be. When I add
           | that rule at new companies, ~90% of the violations are latent
           | bugs.
           | 
           | AI has the potential to (though I haven't seen it work well
           | yet in this capacity) lint most such problematic patterns. In
           | an ideal world, it'd even have the local context to know
           | that, e.g., since you're handling a DB transaction and
           | immediately re-throwing the raw `except:` is appropriate and
           | not even flag it in the "review" (the AI linting), reducing
           | the false positive rate. You'd still want a human review, but
           | you could avoid bugging a human till the code is worth
           | reviewing, or you could let the human focus on things that
           | matter instead of harping on your use of low-level atomic
           | fences yet again. AI has potential to improve the transition
           | from "code complete" to "shipped and working."
        
             | siscia wrote:
             | Wouldn't be simpler to just use a linter then?
        
               | stavros wrote:
               | It would be simpler, but it wouldn't be as effective.
        
               | abenga wrote:
               | How so? The build fails if linter checks don't pass. What
               | does an ai add to this?
        
               | stavros wrote:
               | Much more nuanced linting rules.
        
               | fzeroracer wrote:
               | What exactly is gained from these nuanced linting rules?
               | Why do they need to exist? What are they actually doing
               | in your codebase other than sucking up air?
        
               | stavros wrote:
               | Are you arguing against linters? These disingenuous
               | arguments are just tiring, you either accept that linters
               | can be beneficial, and thus more nuanced rules are a good
               | thing, or you believe linters are generally useless. This
               | "LLM bad!" rhetoric is exhausting.
        
               | fzeroracer wrote:
               | No. And speaking of disingenuous arguments, you've made
               | an especially absurd one here.
               | 
               | Linters are useful. But you're arguing that you have such
               | complex rules that they cannot be performed by a linter
               | and thus must be offloaded to an LLM. I think that's
               | wrong, and it's a classical middle management mistake.
               | 
               | We can all agree that some rules are good. But that does
               | not mean more rules are good, nor does it mean more
               | complex rules are good. Not only are you integrating a
               | whole extra system to support these linting rules, you
               | are doing so for the sake of adding even more complex
               | linting rules that cannot be automated in a way that
               | prevents developer friction.
        
               | stavros wrote:
               | If you think "this variable name isn't descriptive
               | enough" or "these comments don't explain the 'why' well
               | enough" is constructive feedback, then we won't agree,
               | and I'm very curious as to what your PR reviews look
               | like.
        
               | hansvm wrote:
               | Linters suffer from a false positive/negative tradeoff
               | that AI can improve. If they falsely flag things then
               | developers tend to automatically ignore or silence the
               | linter. If they don't flag a thing then ... well ... they
               | didn't flag it, and that particular burden is pushed to
               | some other human reviewer. Both states are less than
               | ideal, and if you can decrease the rate of them happening
               | then the tool is better in that dimension.
               | 
               | How does AI fit into that picture then? The main benefits
               | IMO are the abilities to (1) use contextual clues, (2)
               | process "intricate" linting rules (implicitly, since it's
               | all just text for the LLM -- this also means you can
               | process many more linting rules, since things too
               | complicated to be described nicely by the person writing
               | a linter without too high of a false positive rate are
               | unlikely to ever be introduced into the linter), and (3)
               | giving better feedback when rules are broken. Some
               | examples to compare and contrast:
               | 
               | For that `except` vs `except Exception:` thing I
               | mentioned, all a linter can do is check whether the
               | offending pattern exists, making the ~10% of proper use
               | cases just a little harder to develop. A smarter linter
               | (not that I've seen one with this particular rule yet)
               | could allow a bare `except:` if the exception is always
               | re-raised (that being both the normal use-case in DB
               | transaction handling and whatnot where you might
               | legitimately want to catch everything, and also a coding
               | pattern where the practice of catching everything is
               | unlikely to cause the bugs it normally does). An AI
               | linter can handle those edge cases automatically, not
               | giving you spurious warnings for properly written DB
               | transaction handling. Moreover, it can suggest a
               | contextually relevant proper fix (`except BaseException:`
               | to indicate to future readers that you considered the
               | problems and definitely want this behavior, `except
               | Exception:` to indicate that you do want to catch
               | "everything" but without weird shutdown bugs, `except
               | SomeSpecificException:` because the developer was just
               | being lazy and would have accidentally written a new bug
               | if they caught `Exception` instead, or perhaps just
               | suggesting a different API if exceptions weren't a
               | reasonable way to control the flow of execution at that
               | point).
               | 
               | As another example, you might have a linting rule banning
               | low-level atomics (fences, seq_cst loads, that sort of
               | thing). Sometimes they're useful though, and an AI linter
               | could handle the majority of cases with advice along the
               | lines of "the thing you're trying to do can be easily
               | handled with a mutex; please remove the low-level
               | atomics". Incorporating the context like that is
               | impossible for normal linters.
               | 
               | My point wasn't that you're replacing a linter with an
               | AI-powered linter; it's that the tool generates the same
               | sort of local, mechanical feedback a linter does -- all
               | the stuff that might bog down a human reviewer and keep
               | them from handling the big-picture items. If the tool is
               | tuned to have a low false-positive rate then almost any
               | advice it gives is, by definition, an important
               | improvement to your codebase. Human reviewers will still
               | be important, both in catching anything that slips
               | through, and with the big-picture code review tasks.
        
           | eterm wrote:
           | Code review is also one of the last places to get adequate
           | knowledge-sharing into many processes which have gone fully
           | remote / asynchronous.
        
       | righthand wrote:
       | Our AI code review bot (Codacy) is just an LLM that compiles all
       | linter rules and might be the most annoying useless thing. For
       | example it will ding your PR for not considering Opera browser
       | limitations on a backend NodeJS PR.
       | 
       | Furthermore most of the code reviews I perform, rarely do I ever
       | really leave commentary. There are so many frameworks and
       | libraries today that solve whatever problem, unless someone adds
       | complex code or puts a file in a goofy spot, it's an instant
       | approval. So an AI bot doesn't help something which is a minimal
       | non-problem task.
        
         | mcbishop wrote:
         | Some have a smaller tighter codebase... where it's realistic to
         | pursue consistent application of an internal style guide (which
         | current AI seems well suited to help with (or take ownership
         | of)).
        
           | fallingknife wrote:
           | A linter is well suited for this. If you have style rules too
           | complex for a linter to handle I think the problem is the
           | rules, not the enforcement mechanism.
        
           | doikor wrote:
           | Code style is something that should be enforceable with
           | linters and formatters for the most part without any need for
           | an AI.
        
             | lupire wrote:
             | That's not what style means.
        
         | Spivak wrote:
         | I've honestly considered just making a script that hits the
         | checkbox automatically because we have to have "code review"
         | for compliance but 99% of the time I simply don't care. Is the
         | change you made gonna bring down prod, no, okay ship it.
         | 
         | At some level if I didn't trust you to not write shitty code
         | you wouldn't be on our team. I don't think I want to go all the
         | way and say code review is a smell but not really needing it is
         | what code ownership, good integration tests, and good qa buys
         | you.
        
           | pacifika wrote:
           | Code review is a good onboarding on upskilling tool, as a
           | reviewer.
        
           | rgbrgb wrote:
           | How do you bootstrap the trust without code review? Always
           | confused how teams without review onboard someone new.
        
             | ajmurmann wrote:
             | Pair-programming is one option. It might seem expensive at
             | first glance but usually both parties get a lot of value
             | from it (new possible ideas for the teacher if the learner
             | isn't entirely junior and obviously quick learning for the
             | new person)
        
             | criley2 wrote:
             | I think you just accept the risk, same as not doing code
             | review. If you're doing something Serious(tm) then you
             | follow the serious rules to protect your serious business.
             | If it doesn't actually matter that much if prod goes down
             | or you release a critical vulnerability or leak client data
             | or whatever, then you accept the risk. It's cheaper to push
             | out code quickly and cleanup the mess as long as
             | stakeholders and investors are happy.
        
           | Tainnor wrote:
           | > At some level if I didn't trust you to not write shitty
           | code you wouldn't be on our team.
           | 
           | Code review isn't about preventing "shitty code" because you
           | don't trust your coworkers.
           | 
           | Of course, if 99% of the time you don't care about the code
           | review then it's not going to be an effective process, but
           | that's a self-fulfilling prophecy.
        
         | jahnu wrote:
         | Our approach is to leave PR comments as pointers in the areas
         | where the reviewer needs to look, where it is most important,
         | and they can forget the rest. I want good code reviews but
         | don't want to waste anyone's time.
        
           | minasmorath wrote:
           | To be honest, I tried to work with with GitHub Copilot to see
           | if it could help junior devs focus in on the important parts
           | of a PR and increase their efficiency and such, but... I've
           | found it to be worse than useless at identifying where the
           | real complexities and bug opportunities actually are.
           | 
           | When it does manage to identify the areas of a PR with the
           | highest importance to review, other problematic parts of the
           | PR will go effectively unreviewed, because the juniors are
           | trusting that the AI tool was 100% correct in identifying the
           | problematic spots and nothing else is concerning. This is
           | partially a training issue, but these tools are being
           | marketed as if you can trust them 100% and so new devs
           | just... do.
           | 
           | In the worst cases I see, which happen a good 30% of the time
           | at this point based on some rough napkin math, the AI directs
           | junior engineers into time-wasting rabbit holes around things
           | that are actually non-issues while actual issues with a PR go
           | completely unnoticed. They spend ages writing defensive code
           | and polyfills for things that the framework and its included
           | polyfills already 100% cover. But if course, that code was
           | usually AI generated too, and it's incomplete for the case
           | they're trying to defend against anyway.
           | 
           | So IDK, I still think there's value in there somewhere, but
           | extracting it has been an absolute nightmare for me.
        
           | righthand wrote:
           | I do this as well for my PRs, it's a good way to leave a
           | little explainer on the code especially with large merge
           | requests.
        
         | oakejp12 wrote:
         | Are the junior programmers that you hire that good that you
         | don't need training or commentary? I find I spend a lot of time
         | reviewing MRs and leaving commentary. For instance, this past
         | week, I had a junior apply the same business logic across
         | classes instead of creating a class/service and injecting it as
         | a dependency.
         | 
         | This, improper handling of exceptions, missed testing cases or
         | no tests at all, incomplete types, a misunderstanding of a
         | nuanced business case, etc. An automatic approval would leave
         | the codebase in such a dire state.
        
       | anonzzzies wrote:
       | We found that in general it's pretty hard to make the llm just
       | stop without human intervention. You can see with things like
       | Cline that if the llm has to check its own work, it'll keep
       | making 'improvements' in a loop; removing all comments, adding
       | all comments etc. It needs to generate something and seems overly
       | helpful to give you something.
        
         | seb1204 wrote:
         | Would the LLM make less nit picking comments if the code base
         | included coding style and commenting rules in the repository?
        
       | extr wrote:
       | I'm surprised the authors didn't try the "dumb" version of the
       | solution they went with: instead of using fancy cosine similarity
       | create to implicit clusters, just ask it to classify the comment
       | along a few dimensions and then do your own filtering on that (or
       | create your own 0-100 scoring!) Seems like you would have more
       | control that way and actually derive some rich(er) data to fine
       | tune on. It seems they are already almost doing this: all the
       | examples in the article start with "style"!
       | 
       | I have seen this pattern a few times actually, where you want the
       | AI to mimic some heuristic humans use. You never want to ask it
       | for the heuristic directly, just create the constitute data so
       | you can do some simple regression or whatever on top of it and
       | control the cutoff yourself.
        
         | pacifika wrote:
         | I thought the llm would just makes up the scores because of
         | lack of training.
        
       | fnqi8ckfek wrote:
       | Reading from the other comments here, I'm the only one thinking
       | that this is just busywork...? Just get rid of the thing, it's a
       | solution without a problem.
        
         | dannersy wrote:
         | I've been following this conversation and many of the
         | sentiments against an AI code review bot are gone. I guess
         | they're getting shadow banned?
        
         | stavros wrote:
         | I'm fairly sure there are at least a few companies that have
         | the problem of needing PRs reviewed.
        
         | Hilift wrote:
         | I guess normally it's difficult to stop an employee from
         | leaving nitpicky comments. But with AI you can finally take
         | control.
        
       | profsummergig wrote:
       | Is there a name for the activity of trying out different
       | strategies to improve the output of AI?
       | 
       | I found this article surprisingly enjoyable and interesting and
       | if like to find more like it.
        
         | thrw42A8N wrote:
         | Prompt engineering
        
           | profsummergig wrote:
           | But this wasn't a different prompt that made it work better.
           | Here they created a database of prior human created comments
           | with dev ratings of how good the devs found them.
        
       | keybored wrote:
       | Why review bots. Why not a bot that runs as part of the
       | validation suite? And then you dismiss and follow up on what you
       | want?
       | 
       | You can run that locally.
        
       | Kwpolska wrote:
       | > Please don't use HN primarily for promotion. It's ok to post
       | your own stuff part of the time, but the primary use of the site
       | should be for curiosity.
       | 
       | https://news.ycombinator.com/newsguidelines.html
        
         | mosselman wrote:
         | I don't feel like this is primarily promotion. Sure it is a bit
         | of a content marketing article, but still offers some
         | potentially valuable insight.
        
           | Kwpolska wrote:
           | A look at OP's posting history suggests he posts a lot of
           | advertising.
        
       | Retr0id wrote:
       | > LLMs (which are paid by the token)
       | 
       | Straight-forwardly true and yet I'd never thought about it like
       | this before. i.e. that there's a perverse incentive for LLM
       | vendors to tune for verbose outputs. We rightly raise eyebrows at
       | the idea of developers being paid per volume of code, but it's
       | the default for LLMs.
        
         | britannio wrote:
         | Only while a vendor is ahead of the others. We developers will
         | favour a vendor with faster inference and lower pricing.
        
           | Retr0id wrote:
           | I think real endgame is some kind of "no win no fee"
           | arrangement. In the case of the article in the OP, it'd be if
           | they billed clients per "addressed" comment. It's less clear
           | how that would map onto someone selling direct access to an
           | LLM, though.
        
           | dustingetz wrote:
           | then, cartels
        
         | vergessenmir wrote:
         | It's why I prefer Claude to let's Chat GPT. I'm paying for
         | tokens generated too.
        
         | rgbrgb wrote:
         | In the current landscape I think competition handles this
         | neatly, especially since open models have almost the opposite
         | incentive (training on concision is cheaper). As siblings note,
         | Claude tends to be a little less verbose. Though I find they
         | can all be quite concise when instructed (e.g. "just the code,
         | no yapping").
        
         | johnfn wrote:
         | LLMs are paid by input token _and_ output token. LLM developers
         | are just as incentivized to give quality output so you make
         | more calls to the LLM.
        
       | thomasahle wrote:
       | TLDR:
       | 
       | > Giving few-shot examples to the generator didn't work.
       | 
       | > Using an LLM-judge (with no training) didn't work.
       | 
       | > Using an embedding + KNN-classifier (lots of training data)
       | worked.
       | 
       | I don't know why they didn't try fine-tuning the LLM-judge, or at
       | least give it some few-shot examples.
       | 
       | But it shows that embeddings can make very simple classifiers
       | work well.
        
       | planetpluta wrote:
       | > Essentially we needed to teach LLMs (which are paid by the
       | token) to only generate a small number of high quality comments.
       | 
       | The solution of filtering after the comment is generated doesn't
       | seem to address the "paid by the token" piece.
        
         | dietr1ch wrote:
         | It may be a step forward in terms of gathering data and
         | defining the filter, but pushing up this filter might be
         | expensive unless there's enough volume to justify it.
         | 
         | I took it as a comment on that generally models will be biased
         | towards being nitty and that's something that needs to be dealt
         | with as the incentives are not there to fix things at the
         | origin.
        
       | wzdd wrote:
       | Previously: https://news.ycombinator.com/item?id=42465374
        
       | hsbauauvhabzb wrote:
       | I can't fathom a world where an LLM would be able to review code
       | in any meaningful way -at all-.
       | 
       | It should not substitute a human, and probably wasted more effort
       | than it solves by a wide margin.
        
         | hsbauauvhabzb wrote:
         | And for any poor soul forced into an environment using AI
         | driven commit reviews, be sure you reply to the commit reviews
         | with gpt output and continue the conversation, eventually bring
         | in some of the AI 'engineers' to help resolve the issue, waste
         | their time back.
        
         | dakshgupta wrote:
         | I understand your skepticism and discomfort, and also agree
         | that an LLM should not replace a human code reviewer.
         | 
         | I would encourage you to try one (nearly all including ours
         | have a free trial).
         | 
         | When done right, they serve as a solid first pass and surface
         | things that warrant a second look. Repeated code where there
         | should be an abstraction, inconsistent patterns from other
         | similar code elsewhere in the codebase, etc. Things that
         | linters can't do.
         | 
         | Would you not want your coworkers to have AI look at their PRs,
         | have them address the relevant comments, and then pass it to
         | you for review?
        
           | smallpipe wrote:
           | > Would you not want your coworkers to have AI look at their
           | PRs, have them address the relevant comments
           | 
           | God no. Review is where I teach more junior people about the
           | code base. And where I learn from the more senior people.
           | Either of us spending time making the AI happy just to be
           | told we're solving the wrong problem is a ridiculous waste of
           | time.
        
         | makingstuffs wrote:
         | Yeah it seems like the inverse of what is logical: Have a human
         | review the code which _may_ have been written/influenced by an
         | LLM. Having an LLM do code review seems about as useful as YOLO
         | merging to main and praying to the Flying Spaghetti Monster...
        
       | pcwelder wrote:
       | A whole article without mentioning the name of the LLM. It's not
       | like Sonnet and O1 have the same modalities.
       | 
       | Anything you do today might become irrelevant tomorrow.
        
       | AgentOrange1234 wrote:
       | I'd be curious to hear more on Attempt 2. It sounds like the
       | approach was basically to ask an llm for a score for each
       | comment. Adding specifics to this prompt might go a long way?
       | Like, what specifically is the rationale for this change, is this
       | likely to be a functional bug, is it a security issue, how does
       | it impact maintainability over the long run, etc.; basically I
       | wonder if asking about more specific criteria and trying to
       | define what you mean by nits can help the LLM give you more
       | reliable scores.
        
         | dakshgupta wrote:
         | That's an interesting point - we didn't try this. Now that you
         | said that, I bet even clearly defining what each number on the
         | scale means would help.
        
       | iandanforth wrote:
       | "As a last resort we tried machine learning ..."
       | 
       | - Hilarious that a cutting edge solution (document embedding and
       | search) from 5-6 years ago was their last resort.
       | 
       | - Doubly hilarious that "throw more AI at it" surprised them when
       | it didn't work.
        
       | aarondia wrote:
       | I find these retro blog posts from people building LLM solutions
       | super useful for helping me try out new prompting, eval, etc.
       | techniques. Which blog posts have you all found most useful?
       | Would love a link.
        
       | utdiscant wrote:
       | "We picked the latter, which also gave us our performance metric
       | - percentage of generated comments that the author actually
       | addresses."
       | 
       | This metric would go up if you leave almost no comments. Would it
       | not be better to find a metric that rewards you for generating
       | many comments which are addressed, not just having a high
       | relevance?
       | 
       | You even mention this challenge yourselves: "Sadly, even with all
       | kinds of prompting tricks, we simply could not get the LLM to
       | produce fewer nits without also producing fewer critical
       | comments."
       | 
       | If that was happening, that doesn't sound like it would be
       | reflected in your performance metric.
        
         | dakshgupta wrote:
         | Good criticism that we should pay closer attention to. Someone
         | else pointed this out and too and since then we've started
         | tracking addressed comment per file changed as well.
        
         | SomewhatLikely wrote:
         | You could probably modify the metric to addressed comments per
         | 1000 lines of code.
        
       | lupire wrote:
       | Instead of training an LLM to make comments, train the LLM to
       | generate test cases and then guess the line of code that breaks
       | the test.
        
         | dakshgupta wrote:
         | You might enjoy @goodside on Twitter. He's a prompt engineer at
         | Scale AI and a lot of his observations and techniques are
         | fascinating.
        
       ___________________________________________________________________
       (page generated 2024-12-22 23:02 UTC)