[HN Gopher] We Can Just Measure Things
___________________________________________________________________
We Can Just Measure Things
Author : tosh
Score : 52 points
Date : 2025-06-17 11:15 UTC (2 days ago)
(HTM) web link (lucumr.pocoo.org)
(TXT) w3m dump (lucumr.pocoo.org)
| ToucanLoucan wrote:
| Still RTFA but this made me rage:
|
| > In fact, we as engineers are quite willing to subject each
| others to completely inadequate tooling, bad or missing
| documentation and ridiculous API footguns all the time. "User
| error" is what we used to call this, nowadays it's a "skill
| issue". It puts the blame on the user and absolves the creator,
| at least momentarily. For APIs it can be random crashes if you
| use a function wrong
|
| I recently implemented Microsoft's MSAL authentication on iOS
| which includes as you might expect a function that retrieves the
| authenticated accounts. Oh sorry, I said function, but there's
| two actually: one that retrieves one account, and one that
| retrieves multiple accounts, which is odd but harmless enough
| right?
|
| Wrong, because whoever designed this had an absolutely galaxy
| brained moment and decided if you try and retrieve one account
| when multiple accounts are signed in, instead of, oh I dunno,
| just returning an error message, or perhaps returning the most
| recently used account, no no no, what we should do in that case
| is _throw an exception and crash the fucking app._
|
| I just. Why. Why would you design anything this way!? I can't
| fathom any situation you would use the one-account function in
| when the multi-account one does the exact same fucking thing,
| notably WITHOUT the potential to cause a CRASH, and just returns
| a set of one, and further, why then if you were REALLY INTENT ON
| making available one that only returned one, it wouldn't itself
| just call the other function and return Accounts.first.
|
| </ rant>
| layer8 wrote:
| How is an exception different from "returning an error
| message"?
| dewey wrote:
| Seems like the main differentiator is that one crashed and
| one doesn't. Unrelated to error message or exception.
| johnmaguire wrote:
| I'm not sure I understand how both occurred at once.
| Typically an uncaught exception will result in a crash, but
| this would generally be considered an error at the call
| site (i.e. failing to handle error conditions.)
| layer8 wrote:
| I understood "crashing" as them not catching the exception.
|
| Most functions can fail, and any user-facing app has to be
| prepared for it so that it behaves gracefully towards the
| user. In that sense I agree that the error reporting
| mechanism doesn't matter. It's unclear though what the
| difference was for the GP.
| ToucanLoucan wrote:
| For one: terminating execution
|
| More importantly: why is having more than one account an
| "exception" at all? That's not an error or fail condition, at
| least in my mind. I wouldn't call our use of the framework an
| edge case by any means, it opens a web form in which one puts
| authentication details, passes through the flow, and then we
| are given authentication tokens and the user data we need.
| It's not unheard of for more than one account to be returned
| (especially on our test devices which have many) and I get
| the one-account function not being suitable for handling
| that, my question is... why even have it then, when the
| multi-account one performs the exact same function, better,
| without an extra error condition that might arise?
| TOGoS wrote:
| > why is having more than one account an "exception" at
| all? That's not an error or fail condition
|
| It is if the caller is expecting there to be exactly one
| account.
|
| This is why I generally like to return a set of things from
| any function that might possibly return zero or more than
| one things. Fewer special cases that way.
|
| But if the API of the function is to return one, then you
| either give one at random, which is probably not right, or
| throw an exception. And with the latter, the person
| programming the caller will be nudged towards using the
| other API, which is probably what they should have done
| anyway, and then, as you say, the returns-one-account
| function should probably just not exist at all.
| lazide wrote:
| Chances are, the initial function was written when the
| underlying auth backend only supported a single account
| (structurally), and most clients were using that method.
|
| Then later on, it was figured out that multiple accounts
| per credential set (?!?) needed to be supported, but the
| original clients still needed to be supported.
|
| And either no one could afree on a sane convention if
| this happened (like returning the first from the list),
| or someone was told to 'just do it'.
|
| So they made the new call, migrated _themselves_ , and
| put in a uncaught exception in the old place (can't put
| any other type there without breaking the API) and blam -
| ticket closed.
|
| Not that I've ever seen that happen before, of course.
|
| Oh, and since the multi-account functionality is
| obviously new and probably quite rare at first, it could
| be years before anyone tracks down whoever is
| responsible, if ever.
| layer8 wrote:
| There's no good way to solve this, though. Returning an
| arbitrary account can have unpredictable consequences as
| well if it isn't the expected one. It's a compatibility
| break either way.
| lazide wrote:
| Exactly, which is probably why a better 'back
| compatibility' change couldn't be agreed on.
|
| But there is a way that closes your ticket fast and will
| compile!
| layer8 wrote:
| Sure, but not introducing the ability to be logged into
| multiple accounts isn't the best choice as well.
| Arguably, throwing an exception upon multiple logins for
| the old API is the lesser evil overall.
| ToucanLoucan wrote:
| > There's no good way to solve this, though.
|
| Yes there is! Just get rid of it. It's useless. The re-
| implementation from using one to the other was barely a
| few moments of work, and even if you want to say "well
| that's a breaking change" I mean, yeah? Then break it. I
| would be far less annoyed if a function was just removed
| and Xcode went "hey this is pointed at nothing, gotta
| sort that" rather than letting it run in a way that turns
| the use of authentication functionality into a landmine.
| lazide wrote:
| I take it you've never had to support a widely used
| publicly available API?
|
| You might be bound to support these calls for many, many
| years.
| kfajdsl wrote:
| > For one: terminating execution
|
| Seems like you should have a generic error handler that
| will at a minimum catch unexpected, unhandled exceptions
| with a 'Something went wrong' toast or similar?
| zahlman wrote:
| > For one: terminating execution
|
| Not if you handle the exception properly.
|
| > why is having more than one account an "exception" at
| all? That's not an error or fail condition, at least in my
| mind.
|
| Because you explicitly asked for "the" account, and your
| request is based on a false premise.
|
| >why even have it then, when the multi-account one performs
| the exact same function, better, without an extra error
| condition that might arise?
|
| Because other users of the library _explicitly want_ that
| to be an error condition, and would rather not write the
| logic for it themselves.
|
| Performance could factor into it, too, depending on
| implementation details that obviously I know nothing about.
|
| Or for legacy reasons as described in
| https://news.ycombinator.com/item?id=44321644 .
| Jabrov wrote:
| "crash the app" sounds like the app's problem (ie. not handling
| exceptions properly) as opposed to the design of the API. It
| doesn't seem that unreasonable to throw an exception if
| unexpected conditions are hit? Also, more likely than not,
| there is probably an explicit reason that an exception is
| thrown here instead of something else.
| raincole wrote:
| > nowadays it's a "skill issue"
|
| > throw an exception and crash the fucking app
|
| Yes, if your app crashes when a third-party API throws an
| exception, it's a "skill issue" of you. This comment is an
| example why sometimes blaming the user's skill issue is valid.
| jiggawatts wrote:
| At the risk of being an amateur psychologist, your approach
| feels like that of a front end developer used to a forgiving
| programming model with the equivalent of the old BASIC
| programming language statement ON EFROR RESUME NEXT.
|
| Server side APIs and _especially_ authentication APIs tend
| towards the "fail fast" approach. When APIs are accidentally
| mis-used this is treated either as a compiler error or a
| deliberate crash to let the developer know. Silent failures are
| verboten for entire categories of circumstances.
|
| There's a gradient of: silent success, silent failure, error
| codes you can ignore, exceptions you can't, runtime panic, and
| compilation error.
|
| That you can't even tell the qualitative difference between the
| last half of that list is why I'm thinking you're primarily a
| JavaScript programmer where only the first two in the list
| exist for the most part.
| lostdog wrote:
| A lot of the "science" we do is experimenting on bunches of
| humans, giving them surveys, and treating the result as
| objective. How many places can we do much better by surveying a
| specific AI?
|
| It may not be objective, but at least it's consistent, and it
| reflects something about the default human position.
|
| For example, there are no good ways of measuring the amount of
| technical debt in a codebase. It's such a fuzzy question that
| only subjective measures work. But what if we show the AI one
| file at a time, ask "Rate, 1-10, the comprehensibility,
| complexity, and malleability of this code," and then average
| across the codebase. Then we get measure of tech debt, which we
| can compare over time to measure if it's rising or falling. The
| AI makes subjective measurements consistent.
|
| This essay gives such a cool new idea, while only scratching the
| surface.
| delusional wrote:
| > it reflects something about the default human position
|
| No it doesn't. Nothing that comes out of an LLM reflects
| anything except the corpus it was trained on and the sampling
| method used. That definitionally true, since those are the very
| things it is a product of.
|
| You get NO subjective or objective insight from asking the AI
| about "technical debt" you only get an opaque statistical
| metric that you can't explain.
| BriggyDwiggs42 wrote:
| If you knew that the model never changed it might be very
| helpful, but most of the big providers constantly mess with
| their models.
| cwillu wrote:
| Even if you used a local copy of a model, it would still
| just be a semi-quantitative version of "everyone knows
| <thing-you-don't-have-a-grounded-argument-for>"
| layer8 wrote:
| Their performance also varies depending on load (concurrent
| users).
| BriggyDwiggs42 wrote:
| Dear god does it really? That's very funny.
| layer8 wrote:
| We can just measure things, but then there's Goodhart's law.
|
| With the proposed way of measuring code quality, it's also
| unclear how comparable the resulting numbers would be between
| different projects. If one project has more essential complexity
| than another project, it's bound to yield a worse score, even if
| the code quality is on par.
| Marazan wrote:
| I would argue you can't compare between projects due to the
| reasons you state. But you can try and improve the metrics
| within a single project.
|
| Cycolmatic complexity is a terrible metric to obsesses over yet
| in a project I was on it was undeniably true that the newer
| code written by more experienced Devs was both subjectively
| nicer and also had lower cycolmatic complexity than the older
| code worked on by a bunch of juniors (some of the juniors had
| then become some of the experienced Devs who wrote the newer
| code)
| layer8 wrote:
| > But you can try and improve the metrics within a single
| project.
|
| Yes. But it means that it doesn't let you assess code
| quality, only (at best) changes in code quality. And it's
| difficult as soon as you add or remove functionality, because
| then it isn't strictly speaking the same project anymore, as
| you may have increased or decreased the essential complexity.
| What you _can_ assess is whether a pure refactor improves or
| worsens a project's amenibility to AI coding.
| elktown wrote:
| I think this is advertisement for an upcoming product. Sure, join
| the AI gold rush, but at least be transparent about it.
| falcor84 wrote:
| Even if he does have some aspiration to make money by
| operationalizing this (which I didn't sense that he does), what
| Armin describes there is something that's almost trivial to
| implement a basic version of yourself in under an hour.
| elktown wrote:
| > which I didn't sense that he does
|
| I'd take a wager.
| the_mitsuhiko wrote:
| If your wager is that I will build an AI code quality
| measuring tool then you will lose it. I'm not advertising
| anything here, I'm just playing with things.
| elktown wrote:
| > code quality measuring tool
|
| I didn't, just an AI tool in general.
| GardenLetter27 wrote:
| I'm really skeptical of using current LLMs for judging codebases
| like this. Just today I got Gemini to solve a tricky bug, but it
| only worked after providing it more debug output after solving
| some of it myself.
|
| The first time I tried without the deeper output, it "solved" it
| by writing a load of code that failed in loads of other ways, and
| ended up not even being related to the actual issue.
|
| Like you can be certain it'll give you some nice looking metrics
| and measurements - but how do you know if they're accurate?
| the_mitsuhiko wrote:
| > I'm really skeptical of using current LLMs for judging
| codebases like this.
|
| I'm not necessarily convinced that the current generation of
| LLMs are overly amazing at this, but they definitely are very
| good at measuring inefficiency of tooling and problematic APIs.
| That's not all the issues, but it can at least be useful to
| evaluate some classes of problems.
| falcor84 wrote:
| What do you mean that it "ended up not even being related to
| the actual issue"? If you give it a failing test suite to turn
| green and it does, then either its solution is indeed related
| to the issue, or your tests are incomplete; so you improve the
| tests and try again, right? Or am I missing something?
| GardenLetter27 wrote:
| It made the other tests fail, I wasn't using it in agent
| mode, just trying to debug the issue.
|
| The issue is that it can happily go down the completely wrong
| path and report exactly the same as though it's solved the
| problem.
| cmrdporcupine wrote:
| I explain this in sibling-node comment but I've caught Claude
| multiple times in the last week just inserting special-case
| kludges to make things "pass", without actually successfully
| fixing the underlying problem that the test was checking for.
|
| Just outright "if test-is-running { return success; }" level
| stuff.
|
| Not kidding. 3 or 4 times in the past week.
|
| Thinking of cancelling my subscription, but I also find it
| kind of... entertaining?
| jiggawatts wrote:
| I just realised that this is probably a side-effect of a
| faulty training regime. I've heard several industry heads
| say that programming is "easy" to generate synthetic data
| for and is also amenable to training methods that teach the
| AI to increase the pass rate of unit tests.
|
| So... it did.
|
| It made the tests pass.
|
| "Job done boss!"
| cmrdporcupine wrote:
| I have mixed results but one of the more disturbing things I've
| found Claude doing is that when confronted with a failing test
| case, and not being able to solve a tricky problem.. just
| writing a kludge into the code that identifies that here's a
| test running, and makes it pass. But only for that case.
| Basically, totally cheating.
|
| You have to be super careful and review _everything_ because if
| you don 't you can find your code littered with this strange
| mix of seeming brilliance which makes you complacent... and
| total Junior SWE behaviour or just outright negligence.
|
| That, or recently, it's just started declaring victory and
| claiming to have fixed things, even when the test continues to
| fail. Totally trying to gaslight me.
|
| I swear I wasn't seeing this kind of thing two weeks ago, which
| makes me wonder if Anthropic has been turning some dials...
| quesera wrote:
| > _identifies that here 's a test running, and makes it pass.
| But only for that case_
|
| My team refers to this as a "VW Bugfix".
| alwa wrote:
| I also feel like I've seen a lot more of these over the past
| week or two, whereas I don't remember noticing it at all
| before then.
|
| It feels like it's become grabbier and less able to stay in
| its lane: ask for a narrow thing, and next thing you know
| it's running hog wild across the codebase shoehorning in
| half-cocked major architectural changes you never asked for.
|
| Then it smugly announces success, even when it runs the tests
| and sees them fail. "Let me test our fix" / [tests fail] /
| [accurately summarizes the way the tests are failing] /
| "Great! The change is working now!"
| cmrdporcupine wrote:
| Yes, or I've seen lately "a few unrelated tests are failing
| [actually same test as before] but the core problem is
| solved."
|
| After leaving a trail of mess all over.
|
| Wat?
|
| Someone is changing some weights and measures over at
| Anthropic and it's not appreciated.
| yujzgzc wrote:
| Another, related benefit of LLMs in this situation is that we can
| observe their hallucinations and use them for design. I've come
| up with a couple situations where I saw Copilot hallucinate a
| method, and I agreed that that method should've been there. It
| helps confirm whether the naming of things makes sense too.
|
| What's ironic about this is that the very things that TFA points
| out are needed for success (test coverage, debuggability, a way
| to run locally etc) are exactly the things that typical LLMs
| themselves lack.
| crazygringo wrote:
| I've found LLM's to be _extremely_ helpful in naming and
| general function /API design, where there a lot of different
| ways to express combinations of parameters.
|
| I know what seems natural to _me_ but that 's because I'm
| extremely familiar with the internal workings of the project.
| LLM's seem to be very good at coming with names that are just
| descriptive enough but not too long, and most importantly
| follow "general conventions" from similar projects that I may
| not be aware of. I can't count the number of times an LLM has
| given me a name for a function that I've thought, oh of course,
| that's a much clearer name that what I was using. And I thought
| I was already pretty good at naming things...
___________________________________________________________________
(page generated 2025-06-19 23:01 UTC)