[HN Gopher] Codeball - AI-powered code review
___________________________________________________________________
Codeball - AI-powered code review
Author : lladnar
Score : 56 points
Date : 2022-05-27 18:38 UTC (4 hours ago)
(HTM) web link (codeball.ai)
(TXT) w3m dump (codeball.ai)
| sanity31415 wrote:
| "Download free money" sounds like a scam.
| videlov wrote:
| It's just poking fun at other websites being too serious.
| videlov wrote:
| Creator of Codeball here, somebody beat us to sharing it :).
|
| Codeball is a result of a hack-week at Sturdy - we were thinking
| about ways to reduce the waiting-for-code-review and were curious
| exactly how predictable the entire process is. It turned out very
| predictable!
|
| Happy to answer any questions.
| dchichkov wrote:
| Hi. I've tried creating the same service, about 5 years back ;)
| articoder.com ;) Was digging at it for a few months. But
| natural language processing of the time was not up to the
| task...
|
| Good to know that now it is doable in a week, with such good
| precision! Or do you have humans in the backend ;) ?
|
| How do you compare yourself to PullRequest (they've been
| digging at it for 5 years as well) and recently folded? [funny
| fact, we've been interviewed in the same YC batch, which always
| makes me wonder, if YC liked the idea enough to have it
| implemented by another team ;) ]
| videlov wrote:
| It's really cool to hear that others have thought about this
| too!
|
| >How do you compare yourself to PullRequest
|
| So it turns out that the most of code contributions nowadays
| get merged without fixes or feedback during the review (about
| 2/3). I think this is because the increased focus on
| continuous delivery and shipping small & often. Codeball's
| purpose is to identify and approve those 'easy' PRs and
| humans get to deal with the trickier ones. The cool part
| about it is being less blocked.
| emeraldd wrote:
| Is your model trained per language?
|
| Without something that semantically understands the code under
| review ( which all but requires general AI or at the least a
| strong static analyslzer) doing anything more than adding noise
| to the process or worse leading to certain groups of developers
| effectively being given a free pass.
| sidlls wrote:
| I know for a fact I would not want to automate many of the
| predictable aspects of code reviews at any job I've ever had.
| This is because many of the predictable aspects of code review
| are due to poor review practices. Things like rubber-stamping
| review requests with a certain change size (e.g. lines of code,
| number of files), surface-level reviews (e.g. "yep this doesn't
| violate our style guidelines that the linter can't catch"), and
| similar items.
|
| A proper code review isn't simply catching API or style errors
| --it seeks to understand how the change affects the
| architecture and structure of the existing code. I'm sure AI
| can help with that, and for a broad class of changes it's
| likely somewhat to very predictable--but I'm skeptical that it
| is predictable for enough use cases to make it worth spending
| money on (for now), say.
|
| Put another way: "approves code reviews a human would have
| approved" isn't exactly the standard I'd want automated reviews
| to aspire to. Human approval, in my experience, is mostly not
| good quality reviews.
| visarga wrote:
| Maybe the AI approach is still useful. I am thinking
| analysing the AST to measure the impact of a code change, or
| the complexity of the various components of the project. Some
| kind of graph analysis to measure complexity and
| maintainability on a project level.
| videlov wrote:
| My thesis is a tool like Codeball would reduce the amount of
| rubber-stamping. Thing is, many devs aim to ship code "small
| and often" which inevitably leads to fast pattern matching
| style reviews. If software can reliably deal with those,
| humans can focus their energy on the tricky ones. Kind of
| like how using a code formatter eliminates this type of
| discussions and lets people focus on semantics.
| cobbal wrote:
| How much will you be charging for the adversarial network to
| allow someone to get any PR approved? ;)
| gombosg wrote:
| I'm a bit skeptical here. We should ask the question: why are we
| reviewing code in the first place? This has some hot debates on
| HN every now and then, and it's because reviews are not just
| automated checks but part of the engineering culture, which is a
| defining part of any company or eng department.
|
| PR reviews are a way of learning from each other, keeping up with
| how the codebase evolves, sharing progress and ideas, giving
| feedback and asking questions. For example at $job we 90% approve
| PRs with various levels of pleas, suggestions, nitpicks and
| questions. We approve because of trust (each PR contains a demo
| video of a working feature or fix) and not to block each other,
| but there might be important feedback or suggestions given among
| the comments. A "rubber stamp bot" would be hard to train in such
| a review system and simply misses the point of what reviews are
| about.
|
| What happens if there is a mistake (hidden y2k bomb, deployment
| issue, incident, regression, security bug, bad database
| migration, wrong config) in a PR that passes a human review? At a
| toxic company you get finger pointing, but with a healthy team,
| people can learn a lot when something bad passes a review. But
| you can't discuss anything with an indeterministic review bot.
| There's no responsibility there.
|
| Another question is the review culture. If this app is trained on
| some repo (whether PRs were approved or not), past reviews
| reflect the review culture of the company. What happens when a
| blackbox AI takes that over? Is it going to train itself on its
| own reviews? People and review culture can be changed, but a
| black box AI is hard to change in a predictable way.
|
| I'd rather set up code conventions, automated linters (i.e.
| deterministic checks) etc. than have a review bot allow code into
| production. Or just let go of PR reviews altogether, there were
| some articles shared on HN about that recently. :)
| apugoneappu wrote:
| Explanation of results for non-ML folks (results on the default
| supabase repo shown on the homepage):
|
| Codeball's precision is 0.99. It simply means that 99% PRs that
| were predicted approvable by Codeball were actually approved. In
| layman, if Codeball says that a PR is approvable, you can be 99%
| sure that it is.
|
| But recall is 48%, meaning that only 48% of actually approved PRs
| were predicted to be approvable. So Codeball incorrectly flagged
| 52% of the approvable PRs to be un-approvable, just to be safe.
|
| So Codeball is like a strict bartender who only serves you when
| they are absolutely sure you're old enough. You may still be
| overage but Codeball's not serving you.
| Der_Einzige wrote:
| A LOT of ML applications should be exactly like this.
|
| I want systems with low recall that "flag" things but ultra
| ultra high precision. Many times, we get exactly the opposite -
| which is far worse!
| l33t2328 wrote:
| That's still super useful.
|
| I'm assuming most PR's are approvable. If that's the case then
| this should cut down on time spent doing reviews by a lot.
| sabujp wrote:
| is this just (not) approving or is it actually providing
| automated feedback for what needs to be fixed and suggestions?
| videlov wrote:
| It is like a first-line reviewer. It approves contributions
| that it is really confident are good and leaves the rest to
| humans. So basically it saves time and content switching for
| developers.
| mikeryan wrote:
| Is there no marker that can be provided to indicate why it
| failed or even a line number?
|
| Can't tell if it's something like formatting and code style
| or "bad code" or what. Even as a first line reviewer I can't
| tell if this is valuable or not without any details on why it
| would approve something.
|
| The PR's it would Approve here were all super minor. Could
| probably get similar number of these approved just by doing a
| Lines of Code changed + "Has it been linted"
|
| It's really hard to tell if this is valuable or not yet.
| videlov wrote:
| You are making a very good point. Right now it can't give
| such indication because it is a black-box model. There are
| hundreds of inputs that go in (eg. characteristics of the
| code, how much the author has worked with this code in the
| past, how frequently this part of the code changes) and the
| output is how confident the model is that the contribution
| is safe to merge.
|
| With that said, there are ways of exposing more details to
| developers. For example, scoring is done per-file, and
| Codeball can tell you which files it was not confident in.
| apugoneappu wrote:
| Probably I'm misled but how is it a code review without looking
| at the actual code? (not listed as an input feature on the 'how'
| page)
| videlov wrote:
| It does look at the code at a meta level, in particular if the
| kind of change in the PR has previously been objected to or
| corrected afterwards. It creates perceptual hashes out of the
| code changes which are used as categorical variables that go in
| the neural net.
|
| Deriving features about the code contributions is probably the
| most challenging aspect of the project so far.
| anonred wrote:
| So I dry ran it against a tiny open source repo I maintain and it
| worked on exactly 0 of the last 50 PRs. For example, it didn't
| auto-approve a PR that was just deleting stale documentation
| files... The idea sounds nice, but the execution is a bit lacking
| right now.
| moffkalast wrote:
| I don't really get the point of it either, since it just
| approves PRs. I know when my PR is mergeable, you don't have to
| tell me that. What I need is some feedback since that's what
| code review is for.
|
| Any linter is more useful than this.
| mdaniel wrote:
| I can't tell if this is a joke or not
| videlov wrote:
| It is definitely not a joke. This started off as scratching our
| own itch in answering 'how predictable are programmers?' but it
| turned out to be really useful, so we made a site.
| mdaniel wrote:
| Good to know; is there an example (other than its own GH
| Action repo) to see what it has approved?
|
| Given that it's a model, is there a feedback mechanism
| through which one could advise it (or you) of false
| positives?
|
| I would be thrilled to see what it would have said about:
| https://gitlab.com/gitlab-org/gitlab/-/merge_requests/76318
| (q.v. https://news.ycombinator.com/item?id=30872415)
| videlov wrote:
| It would have said nothing. The model's idea is to identify
| the bulk of easy / safe contributions that get approved
| without objection and let the humans focus on the tricky
| ones (like the example above).
|
| On the site you can give it GitHub repos where it will test
| the last 50 PRs and show you what it would have done (false
| negatives and false positives included). You can also give
| it a link to an individual PR as well, but GitLab is not
| yet supported.
| zegl wrote:
| I've tried to reproduce #76318 as best as I could (using a
| fork of the CE version of GitLab).
| https://github.com/zegl/gitlabhq-cve-test/pull/1
|
| Codeball did not approve the PR! https://codeball.ai/predic
| tion/8cc54ce2-9f50-4e5c-9a16-3bc48...
| jldugger wrote:
| if len(diff) > 500 lines: return "Looks good to me"
| time.sleep(86400) return "+1"
| [deleted]
| danielmarkbruce wrote:
| Looks awesome.
|
| Tone down the marketing page :) This page makes it sound like a
| non-serious person built the tool.
|
| How about: "Codeball approves Pull Requests that a human would
| approve. Reduce waiting for reviews, save time and money."
|
| And make the download button: "Download"
| donkarma wrote:
| I would never use something like this. Seems to me that it's just
| a heuristic based on the char diff count. I made a simple repo
| that has a shell script that does rm -rf
| /usr/old_files_and_stuff, added a space next to the first slash
| and it was approved, which is dangerous. If I need to manually
| verify it anyways for stuff like this, why would I use it?
| iamnafets wrote:
| I generally feel the same way, but just to steel man the
| argument: would your manual code review process have caught
| this issue?
|
| Sometimes we compare new things against their hypothetical
| ideal rather than the status quo. The latter is significantly
| more tractable.
| tehsauce wrote:
| I would be a bit concerned about adversarial attacks with this.
| I'm sure someone will be able to come up with an innocent looking
| PR that the system will always approve, but actually is
| malicious. Then any repo which auto-approves PRs with this could
| be vulnerable.
| videlov wrote:
| There are 3 categories of predictors that the model takes into
| account, here are some examples: (1) The code complexity, and
| its perceptual hash. (2) The author and their track record in
| the repository. (3) The author's past involvement in the
| specific files being modified.
|
| With that said, an adversarial from somebody within the
| team/organisation would be very difficult to detect.
| codeenlightener wrote:
| This is an exciting direction for AI code tools! I'm curious to
| see code review tools that give feedback on non-approved code to
| developers, which I think is the an important purpose of code
| review, to build a shared understanding of technical standards.
|
| On a related note, I'm working on https://denigma.app, which is
| an AI that tries to explain code, giving a second opinion on what
| it looks like it does. One company said they found it useful for
| code review. Maybe just seeing how clear an AI explanation is is
| a decent metric of code quality.
| mchusma wrote:
| I think I like this better expressed as a linter than a code
| reviewer. Maybe it doesn't sell as well. But giving this to devs
| to help them make better PRs and have more confidence in
| approval? Good. Skipping code review? Bad.
|
| In my experience, most "issues" in code review are not technical
| errors, they are business logic errors, which there is most of
| the time not even enough context in the code to know what the
| right answer is. It is in a PM or Sales Person's head.
| Imnimo wrote:
| >Codeball uses a Multi-layer Perceptron classifier neural network
| as prediction model. The model takes hundreds of inputs in it's
| input layer, has two hidden layers and a single output scoring
| the likelihood a Pull Request would be approved.
|
| Really bringing out the big guns here!
| eyelidlessness wrote:
| This is a neat idea but gives me pause. Thinking about how it
| would work in projects I maintain, it would either:
|
| - be over-confident, providing negative value because the
| proportion of PRs which "LGTM" is extraordinarily low, and my
| increasingly deep familiarity with the code and areas of risk
| makes me even more suspicious when something looks that safe
|
| - never gain confidence in any PR, providing no value
|
| I can't think of a scenario where I'd use this for these
| projects. But I can certainly imagine it in the abstract, under
| circumstances where baseline safety of changes is much higher.
___________________________________________________________________
(page generated 2022-05-27 23:00 UTC)