[HN Gopher] We Can Just Measure Things
       ___________________________________________________________________
        
       We Can Just Measure Things
        
       Author : tosh
       Score  : 52 points
       Date   : 2025-06-17 11:15 UTC (2 days ago)
        
 (HTM) web link (lucumr.pocoo.org)
 (TXT) w3m dump (lucumr.pocoo.org)
        
       | ToucanLoucan wrote:
       | Still RTFA but this made me rage:
       | 
       | > In fact, we as engineers are quite willing to subject each
       | others to completely inadequate tooling, bad or missing
       | documentation and ridiculous API footguns all the time. "User
       | error" is what we used to call this, nowadays it's a "skill
       | issue". It puts the blame on the user and absolves the creator,
       | at least momentarily. For APIs it can be random crashes if you
       | use a function wrong
       | 
       | I recently implemented Microsoft's MSAL authentication on iOS
       | which includes as you might expect a function that retrieves the
       | authenticated accounts. Oh sorry, I said function, but there's
       | two actually: one that retrieves one account, and one that
       | retrieves multiple accounts, which is odd but harmless enough
       | right?
       | 
       | Wrong, because whoever designed this had an absolutely galaxy
       | brained moment and decided if you try and retrieve one account
       | when multiple accounts are signed in, instead of, oh I dunno,
       | just returning an error message, or perhaps returning the most
       | recently used account, no no no, what we should do in that case
       | is _throw an exception and crash the fucking app._
       | 
       | I just. Why. Why would you design anything this way!? I can't
       | fathom any situation you would use the one-account function in
       | when the multi-account one does the exact same fucking thing,
       | notably WITHOUT the potential to cause a CRASH, and just returns
       | a set of one, and further, why then if you were REALLY INTENT ON
       | making available one that only returned one, it wouldn't itself
       | just call the other function and return Accounts.first.
       | 
       | </ rant>
        
         | layer8 wrote:
         | How is an exception different from "returning an error
         | message"?
        
           | dewey wrote:
           | Seems like the main differentiator is that one crashed and
           | one doesn't. Unrelated to error message or exception.
        
             | johnmaguire wrote:
             | I'm not sure I understand how both occurred at once.
             | Typically an uncaught exception will result in a crash, but
             | this would generally be considered an error at the call
             | site (i.e. failing to handle error conditions.)
        
             | layer8 wrote:
             | I understood "crashing" as them not catching the exception.
             | 
             | Most functions can fail, and any user-facing app has to be
             | prepared for it so that it behaves gracefully towards the
             | user. In that sense I agree that the error reporting
             | mechanism doesn't matter. It's unclear though what the
             | difference was for the GP.
        
           | ToucanLoucan wrote:
           | For one: terminating execution
           | 
           | More importantly: why is having more than one account an
           | "exception" at all? That's not an error or fail condition, at
           | least in my mind. I wouldn't call our use of the framework an
           | edge case by any means, it opens a web form in which one puts
           | authentication details, passes through the flow, and then we
           | are given authentication tokens and the user data we need.
           | It's not unheard of for more than one account to be returned
           | (especially on our test devices which have many) and I get
           | the one-account function not being suitable for handling
           | that, my question is... why even have it then, when the
           | multi-account one performs the exact same function, better,
           | without an extra error condition that might arise?
        
             | TOGoS wrote:
             | > why is having more than one account an "exception" at
             | all? That's not an error or fail condition
             | 
             | It is if the caller is expecting there to be exactly one
             | account.
             | 
             | This is why I generally like to return a set of things from
             | any function that might possibly return zero or more than
             | one things. Fewer special cases that way.
             | 
             | But if the API of the function is to return one, then you
             | either give one at random, which is probably not right, or
             | throw an exception. And with the latter, the person
             | programming the caller will be nudged towards using the
             | other API, which is probably what they should have done
             | anyway, and then, as you say, the returns-one-account
             | function should probably just not exist at all.
        
               | lazide wrote:
               | Chances are, the initial function was written when the
               | underlying auth backend only supported a single account
               | (structurally), and most clients were using that method.
               | 
               | Then later on, it was figured out that multiple accounts
               | per credential set (?!?) needed to be supported, but the
               | original clients still needed to be supported.
               | 
               | And either no one could afree on a sane convention if
               | this happened (like returning the first from the list),
               | or someone was told to 'just do it'.
               | 
               | So they made the new call, migrated _themselves_ , and
               | put in a uncaught exception in the old place (can't put
               | any other type there without breaking the API) and blam -
               | ticket closed.
               | 
               | Not that I've ever seen that happen before, of course.
               | 
               | Oh, and since the multi-account functionality is
               | obviously new and probably quite rare at first, it could
               | be years before anyone tracks down whoever is
               | responsible, if ever.
        
               | layer8 wrote:
               | There's no good way to solve this, though. Returning an
               | arbitrary account can have unpredictable consequences as
               | well if it isn't the expected one. It's a compatibility
               | break either way.
        
               | lazide wrote:
               | Exactly, which is probably why a better 'back
               | compatibility' change couldn't be agreed on.
               | 
               | But there is a way that closes your ticket fast and will
               | compile!
        
               | layer8 wrote:
               | Sure, but not introducing the ability to be logged into
               | multiple accounts isn't the best choice as well.
               | Arguably, throwing an exception upon multiple logins for
               | the old API is the lesser evil overall.
        
               | ToucanLoucan wrote:
               | > There's no good way to solve this, though.
               | 
               | Yes there is! Just get rid of it. It's useless. The re-
               | implementation from using one to the other was barely a
               | few moments of work, and even if you want to say "well
               | that's a breaking change" I mean, yeah? Then break it. I
               | would be far less annoyed if a function was just removed
               | and Xcode went "hey this is pointed at nothing, gotta
               | sort that" rather than letting it run in a way that turns
               | the use of authentication functionality into a landmine.
        
               | lazide wrote:
               | I take it you've never had to support a widely used
               | publicly available API?
               | 
               | You might be bound to support these calls for many, many
               | years.
        
             | kfajdsl wrote:
             | > For one: terminating execution
             | 
             | Seems like you should have a generic error handler that
             | will at a minimum catch unexpected, unhandled exceptions
             | with a 'Something went wrong' toast or similar?
        
             | zahlman wrote:
             | > For one: terminating execution
             | 
             | Not if you handle the exception properly.
             | 
             | > why is having more than one account an "exception" at
             | all? That's not an error or fail condition, at least in my
             | mind.
             | 
             | Because you explicitly asked for "the" account, and your
             | request is based on a false premise.
             | 
             | >why even have it then, when the multi-account one performs
             | the exact same function, better, without an extra error
             | condition that might arise?
             | 
             | Because other users of the library _explicitly want_ that
             | to be an error condition, and would rather not write the
             | logic for it themselves.
             | 
             | Performance could factor into it, too, depending on
             | implementation details that obviously I know nothing about.
             | 
             | Or for legacy reasons as described in
             | https://news.ycombinator.com/item?id=44321644 .
        
         | Jabrov wrote:
         | "crash the app" sounds like the app's problem (ie. not handling
         | exceptions properly) as opposed to the design of the API. It
         | doesn't seem that unreasonable to throw an exception if
         | unexpected conditions are hit? Also, more likely than not,
         | there is probably an explicit reason that an exception is
         | thrown here instead of something else.
        
         | raincole wrote:
         | > nowadays it's a "skill issue"
         | 
         | > throw an exception and crash the fucking app
         | 
         | Yes, if your app crashes when a third-party API throws an
         | exception, it's a "skill issue" of you. This comment is an
         | example why sometimes blaming the user's skill issue is valid.
        
         | jiggawatts wrote:
         | At the risk of being an amateur psychologist, your approach
         | feels like that of a front end developer used to a forgiving
         | programming model with the equivalent of the old BASIC
         | programming language statement ON EFROR RESUME NEXT.
         | 
         | Server side APIs and _especially_ authentication APIs tend
         | towards the "fail fast" approach. When APIs are accidentally
         | mis-used this is treated either as a compiler error or a
         | deliberate crash to let the developer know. Silent failures are
         | verboten for entire categories of circumstances.
         | 
         | There's a gradient of: silent success, silent failure, error
         | codes you can ignore, exceptions you can't, runtime panic, and
         | compilation error.
         | 
         | That you can't even tell the qualitative difference between the
         | last half of that list is why I'm thinking you're primarily a
         | JavaScript programmer where only the first two in the list
         | exist for the most part.
        
       | lostdog wrote:
       | A lot of the "science" we do is experimenting on bunches of
       | humans, giving them surveys, and treating the result as
       | objective. How many places can we do much better by surveying a
       | specific AI?
       | 
       | It may not be objective, but at least it's consistent, and it
       | reflects something about the default human position.
       | 
       | For example, there are no good ways of measuring the amount of
       | technical debt in a codebase. It's such a fuzzy question that
       | only subjective measures work. But what if we show the AI one
       | file at a time, ask "Rate, 1-10, the comprehensibility,
       | complexity, and malleability of this code," and then average
       | across the codebase. Then we get measure of tech debt, which we
       | can compare over time to measure if it's rising or falling. The
       | AI makes subjective measurements consistent.
       | 
       | This essay gives such a cool new idea, while only scratching the
       | surface.
        
         | delusional wrote:
         | > it reflects something about the default human position
         | 
         | No it doesn't. Nothing that comes out of an LLM reflects
         | anything except the corpus it was trained on and the sampling
         | method used. That definitionally true, since those are the very
         | things it is a product of.
         | 
         | You get NO subjective or objective insight from asking the AI
         | about "technical debt" you only get an opaque statistical
         | metric that you can't explain.
        
           | BriggyDwiggs42 wrote:
           | If you knew that the model never changed it might be very
           | helpful, but most of the big providers constantly mess with
           | their models.
        
             | cwillu wrote:
             | Even if you used a local copy of a model, it would still
             | just be a semi-quantitative version of "everyone knows
             | <thing-you-don't-have-a-grounded-argument-for>"
        
             | layer8 wrote:
             | Their performance also varies depending on load (concurrent
             | users).
        
               | BriggyDwiggs42 wrote:
               | Dear god does it really? That's very funny.
        
       | layer8 wrote:
       | We can just measure things, but then there's Goodhart's law.
       | 
       | With the proposed way of measuring code quality, it's also
       | unclear how comparable the resulting numbers would be between
       | different projects. If one project has more essential complexity
       | than another project, it's bound to yield a worse score, even if
       | the code quality is on par.
        
         | Marazan wrote:
         | I would argue you can't compare between projects due to the
         | reasons you state. But you can try and improve the metrics
         | within a single project.
         | 
         | Cycolmatic complexity is a terrible metric to obsesses over yet
         | in a project I was on it was undeniably true that the newer
         | code written by more experienced Devs was both subjectively
         | nicer and also had lower cycolmatic complexity than the older
         | code worked on by a bunch of juniors (some of the juniors had
         | then become some of the experienced Devs who wrote the newer
         | code)
        
           | layer8 wrote:
           | > But you can try and improve the metrics within a single
           | project.
           | 
           | Yes. But it means that it doesn't let you assess code
           | quality, only (at best) changes in code quality. And it's
           | difficult as soon as you add or remove functionality, because
           | then it isn't strictly speaking the same project anymore, as
           | you may have increased or decreased the essential complexity.
           | What you _can_ assess is whether a pure refactor improves or
           | worsens a project's amenibility to AI coding.
        
       | elktown wrote:
       | I think this is advertisement for an upcoming product. Sure, join
       | the AI gold rush, but at least be transparent about it.
        
         | falcor84 wrote:
         | Even if he does have some aspiration to make money by
         | operationalizing this (which I didn't sense that he does), what
         | Armin describes there is something that's almost trivial to
         | implement a basic version of yourself in under an hour.
        
           | elktown wrote:
           | > which I didn't sense that he does
           | 
           | I'd take a wager.
        
             | the_mitsuhiko wrote:
             | If your wager is that I will build an AI code quality
             | measuring tool then you will lose it. I'm not advertising
             | anything here, I'm just playing with things.
        
               | elktown wrote:
               | > code quality measuring tool
               | 
               | I didn't, just an AI tool in general.
        
       | GardenLetter27 wrote:
       | I'm really skeptical of using current LLMs for judging codebases
       | like this. Just today I got Gemini to solve a tricky bug, but it
       | only worked after providing it more debug output after solving
       | some of it myself.
       | 
       | The first time I tried without the deeper output, it "solved" it
       | by writing a load of code that failed in loads of other ways, and
       | ended up not even being related to the actual issue.
       | 
       | Like you can be certain it'll give you some nice looking metrics
       | and measurements - but how do you know if they're accurate?
        
         | the_mitsuhiko wrote:
         | > I'm really skeptical of using current LLMs for judging
         | codebases like this.
         | 
         | I'm not necessarily convinced that the current generation of
         | LLMs are overly amazing at this, but they definitely are very
         | good at measuring inefficiency of tooling and problematic APIs.
         | That's not all the issues, but it can at least be useful to
         | evaluate some classes of problems.
        
         | falcor84 wrote:
         | What do you mean that it "ended up not even being related to
         | the actual issue"? If you give it a failing test suite to turn
         | green and it does, then either its solution is indeed related
         | to the issue, or your tests are incomplete; so you improve the
         | tests and try again, right? Or am I missing something?
        
           | GardenLetter27 wrote:
           | It made the other tests fail, I wasn't using it in agent
           | mode, just trying to debug the issue.
           | 
           | The issue is that it can happily go down the completely wrong
           | path and report exactly the same as though it's solved the
           | problem.
        
           | cmrdporcupine wrote:
           | I explain this in sibling-node comment but I've caught Claude
           | multiple times in the last week just inserting special-case
           | kludges to make things "pass", without actually successfully
           | fixing the underlying problem that the test was checking for.
           | 
           | Just outright "if test-is-running { return success; }" level
           | stuff.
           | 
           | Not kidding. 3 or 4 times in the past week.
           | 
           | Thinking of cancelling my subscription, but I also find it
           | kind of... entertaining?
        
             | jiggawatts wrote:
             | I just realised that this is probably a side-effect of a
             | faulty training regime. I've heard several industry heads
             | say that programming is "easy" to generate synthetic data
             | for and is also amenable to training methods that teach the
             | AI to increase the pass rate of unit tests.
             | 
             | So... it did.
             | 
             | It made the tests pass.
             | 
             | "Job done boss!"
        
         | cmrdporcupine wrote:
         | I have mixed results but one of the more disturbing things I've
         | found Claude doing is that when confronted with a failing test
         | case, and not being able to solve a tricky problem.. just
         | writing a kludge into the code that identifies that here's a
         | test running, and makes it pass. But only for that case.
         | Basically, totally cheating.
         | 
         | You have to be super careful and review _everything_ because if
         | you don 't you can find your code littered with this strange
         | mix of seeming brilliance which makes you complacent... and
         | total Junior SWE behaviour or just outright negligence.
         | 
         | That, or recently, it's just started declaring victory and
         | claiming to have fixed things, even when the test continues to
         | fail. Totally trying to gaslight me.
         | 
         | I swear I wasn't seeing this kind of thing two weeks ago, which
         | makes me wonder if Anthropic has been turning some dials...
        
           | quesera wrote:
           | > _identifies that here 's a test running, and makes it pass.
           | But only for that case_
           | 
           | My team refers to this as a "VW Bugfix".
        
           | alwa wrote:
           | I also feel like I've seen a lot more of these over the past
           | week or two, whereas I don't remember noticing it at all
           | before then.
           | 
           | It feels like it's become grabbier and less able to stay in
           | its lane: ask for a narrow thing, and next thing you know
           | it's running hog wild across the codebase shoehorning in
           | half-cocked major architectural changes you never asked for.
           | 
           | Then it smugly announces success, even when it runs the tests
           | and sees them fail. "Let me test our fix" / [tests fail] /
           | [accurately summarizes the way the tests are failing] /
           | "Great! The change is working now!"
        
             | cmrdporcupine wrote:
             | Yes, or I've seen lately "a few unrelated tests are failing
             | [actually same test as before] but the core problem is
             | solved."
             | 
             | After leaving a trail of mess all over.
             | 
             | Wat?
             | 
             | Someone is changing some weights and measures over at
             | Anthropic and it's not appreciated.
        
       | yujzgzc wrote:
       | Another, related benefit of LLMs in this situation is that we can
       | observe their hallucinations and use them for design. I've come
       | up with a couple situations where I saw Copilot hallucinate a
       | method, and I agreed that that method should've been there. It
       | helps confirm whether the naming of things makes sense too.
       | 
       | What's ironic about this is that the very things that TFA points
       | out are needed for success (test coverage, debuggability, a way
       | to run locally etc) are exactly the things that typical LLMs
       | themselves lack.
        
         | crazygringo wrote:
         | I've found LLM's to be _extremely_ helpful in naming and
         | general function /API design, where there a lot of different
         | ways to express combinations of parameters.
         | 
         | I know what seems natural to _me_ but that 's because I'm
         | extremely familiar with the internal workings of the project.
         | LLM's seem to be very good at coming with names that are just
         | descriptive enough but not too long, and most importantly
         | follow "general conventions" from similar projects that I may
         | not be aware of. I can't count the number of times an LLM has
         | given me a name for a function that I've thought, oh of course,
         | that's a much clearer name that what I was using. And I thought
         | I was already pretty good at naming things...
        
       ___________________________________________________________________
       (page generated 2025-06-19 23:01 UTC)