[HN Gopher] It's not enough for a program to work - it has to wo...
___________________________________________________________________
It's not enough for a program to work - it has to work for the
right reasons
Author : BerislavLopac
Score : 107 points
Date : 2024-10-16 15:20 UTC (7 hours ago)
(HTM) web link (buttondown.com)
(TXT) w3m dump (buttondown.com)
| BerislavLopac wrote:
| > How do I know whether my tests are passing because they're
| properly testing correct code or because they're failing to test
| incorrect code?
|
| One mechanism to verify that is by running a mutation testing [0]
| tool. They are available for many languages; mutmut [1] is a
| great example for Python.
|
| [0] https://en.wikipedia.org/wiki/Mutation_testing
|
| [1] https://mutmut.readthedocs.io
| layer8 wrote:
| That's basically the approach mentioned in the article's
| paragraph starting with "A broader technique I follow is _make
| it work, make it break_."
| bunderbunder wrote:
| If there's one thing that engineer engineers have considered
| standard practice for ever and ever, but software engineers
| seem to still not entirely grok, it's destructive testing.
|
| I see this a lot with performance measurement, for example. A
| team will run small-scale benchmarks, and then try to
| estimate how a system will scale by linearly extrapolating
| those results. I don't think I've ever seen it work out well
| in practice. Nothing scales linearly forever, and there's no
| reliable way to know when and how it will break down unless
| you actually push it to the point of breaking down.
| erik_seaberg wrote:
| One of the things I like about cloud is that it's
| relatively easy to spin up an isolated full-scale
| environment and find out where prod's redline probably is.
| On-prem hardware might have different bottlenecks.
| johnnyanmac wrote:
| I think it's because companies tend to segregate
| engineering from testing/QA. There's more features to work
| on, stress testing is Qa's job and those tickets will come
| later. Engineers will do some basic common tests to make
| sure it functions as expected, but aren't given the time
| nor tools to really dig in and ensure it's truly robust.
|
| It also reflects the domain. For mission critical code
| there better be 10 different layers of red lines between
| development and shipping. For web code, care for stuff like
| performance and even correctness can fall by the wayside.
| pron wrote:
| Yep. And for Java: https://pitest.org
| computersuck wrote:
| Website not quite loading.. HN hug of death?
| ilrwbwrkhv wrote:
| Buttondown is a great non-success. Ergo they are a good
| company.
| hwayne wrote:
| I like buttondown because I can directly contact the
| developer when I have problems. Some downsides to small
| companies, lots of upsides too.
| JohnMakin wrote:
| There are few things that terrify me more nowadays at this point
| in my career than spending a lot of time writing something and
| setting it up, only to turn it on for the first time and it works
| without any issues.
| sudhirj wrote:
| Oh god this is such a nightmare. It takes much longer to build
| something that works not the first try, because then I have to
| force-simulate a mistake to make sure things were actually
| correct in the first place.
|
| Test Driven Development had a fix for this, which I used to do
| back in day when I was evangelical about the one true way the
| write software. You wrote a test that failed, and added or
| wrote code only to make that test pass. Never add any code
| except to make a failing test pass.
|
| It didn't guarantee 100% correct software, of course, but it
| prevented you from gaslighting yourself for being too awesome.
| ipaddr wrote:
| Tests are like the burning sun in your eyes after you wake up
| for a night of drinking.
|
| I prefer separating writing some code down, making it
| functionally work on screen and writing tests. I usually
| cover cases in step 2 but when you add sometime new later it
| is nice to have step 3.
| norir wrote:
| Yes and the first thing I might ask is "how can I break this?"
| If I can't easily break it with a small change, I've probably
| missed something.
| twic wrote:
| "If it ain't broke, open it up and see what makes it so bloody
| special." -- The BOFH
| kellymore wrote:
| Site is buggy
| mobeigi wrote:
| Works fine for me?
| teddyh wrote:
| > _This is why test-driven development gurus tell people to write
| a failing test first._
|
| To be precise, it's one of the big reasons, but it's far from the
| _only_ reason to write the test first.
| klabb3 wrote:
| I'm increasingly of the opinion that TDD is only as good as
| your system is testable.
|
| This means that the time of writing your first test is too
| late. It's part of the core business logic architecture - the
| whiteboard stage.
|
| If you can make it testable, TDD isn't just good practice -
| it's what you _want to do_ because it's so natural. Similar to
| how unit tests are already natural when you write hermetic code
| (like say a string formatter).
|
| If, OTOH, your business logic is inseparable from prod
| databases, files, networking, current time & time zone, etc,
| then TDD and tests in general are both cumbersome to write and
| simultaneously delivers much less value (as in finding errors)
| per test-case. Controversially, I think that for a spaghetti
| code application tests are quite useless and are largely
| ritualistic.
|
| The only way I know how to design such testable systems (or
| subsystems) is through the "functional core - imperative shell"
| pattern. Not necessarily religious adherence to "no side
| effects", but isolation is a must.
| TechDebtDevin wrote:
| This is my problem, I don't worry about tests until I'm
| already putting the marinara sauce on my main functions.
| pdimitar wrote:
| > _If, OTOH, your business logic is inseparable from prod
| databases, files, networking, current time & time zone, etc,
| then TDD and tests in general are both cumbersome to write
| and simultaneously delivers much less value (as in finding
| errors) per test-case. Controversially, I think that for a
| spaghetti code application tests are quite useless and are
| largely ritualistic._
|
| I don't disagree with this and I have found it to be quite
| true -- though IMO it still has to be said that you can mock
| / isolate a lot of stuff, system time included. I am guessing
| you already accounted for that when you said that tests can
| become cumbersome to write and I agree. But we should still
| try because there are projects where you can't ever get a
| truly isolated system to test f.ex. I recently finished a
| contract where I had to write an server for dispatching SMS
| jobs to the right per-tenant & per-data-center instances of
| the actual connected-to-the-telco-network SMS servers; the
| dev environment was practically useless because the servers
| there did not emit half the events my application needed to
| function properly so I had to record the responses from the
| production servers and use them as mocks in my dev env tests.
|
| Did the test succeed? Sure they did but ultimately gave me
| almost no confidence. :/
|
| But yeah, anyway, I agree with your premise, I just think
| that we should still go the extra mile to reduce entropy and
| chaos as much as we can. Because nobody likes being woken up
| to fight a fire in production.
| treflop wrote:
| Writing for reusability also tends to make software testable
| from my experience. If you make excessively involved units of
| code, you can't test, but you also can't re-use.
|
| And I'm big on reusability because I'm lazy. If requirements
| change, I rather tweak than rebuild.
| maxbond wrote:
| > It's not enough for a program to work, it has to work for the
| right reasons. Code working for the wrong reasons is code that's
| going to break when you least expect it.
|
| This reminds me of the recent discussion of gettiers[1]. That
| article focused on Gettier bugs, but this passage discusses what
| you might call Gettier features.
|
| Something that's gotten me before is Python's willingness to
| interpret a comma as a tuple. So instead of:
| my_event.set()
|
| I wrote: my_event,set()
|
| Which was syntactically correct, equivalent to:
| _ = (my_event, set())
|
| The auto formatter does insert a space though, which helps. Maybe
| it could be made to transform it as I did above, that would make
| it screamingly obvious.
|
| [1a] https://jsomers.net/blog/gettiers
|
| [1b] https://news.ycombinator.com/item?id=41840390
| HL33tibCe7 wrote:
| Your font and/or eyesight might need attention!
| settsu wrote:
| Tell me you're a 20-something engineer without telling me
| you're a 20-something engineer.
| HL33tibCe7 wrote:
| Wrong
| lcnPylGDnU4H9OF wrote:
| The implication being that older programmers would be
| entirely unconcerned with one's eyesight and the effect
| that reading a small font could have on such? Somehow that
| seems a bit backwards. People don't know what they have
| until it's gone.
| maxbond wrote:
| You know what, I do use a small font size in my editor. I
| like to see a lot of code at once. And if memory serves I
| spotted this in the browser, where I do the opposite.
|
| I'll have to look into hyper legible monospace fonts. Or
| maybe I'll just use Atkinson and deal with the variable
| spacing.
| nomel wrote:
| My editor uses a different color for comma and period.
| csours wrote:
| I was expecting to find the word Gettier in the text.
|
| My comment on that Gettier post:
|
| Puttiers: When a junior engineer fixes something, but a
| different error is returned, so they cannot tell if progress
| was made or not.
|
| https://news.ycombinator.com/item?id=41850429
| praptak wrote:
| So we have an analogy:
|
| accidentally working app : correct app :: Gettier "not
| knowledge" JTB : proper knowledge JTB
|
| Is it possible to backport the program analogy back into the
| realm of philosophy? I'm dreaming of a philosophy paper along
| the lines of "Knowledge is JTB with proper testing".
| hansvm wrote:
| One pattern that eliminates a lot of such bugs is never using
| any name that's a keyword or common name in any mainstream
| programming language. The existence of `def set...` in your
| code was already asking for trouble, and you were unlucky
| enough to find it.
| maxbond wrote:
| I agree, but this is a method on an object in the standard
| library!
|
| https://docs.python.org/3/library/asyncio-
| sync.html#asyncio....
|
| Could've been called fire() or activate(), perhaps. This is
| also the kind of problem lints are really good for. I
| wouldn't be surprised if there was an lint for this already
| (I haven't checked).
| dan-robertson wrote:
| One general way I like to think about this is that most software
| you use has passed through some filter - it needed to be complete
| enough for people to use it, people needed to find it somehow (eg
| through marketing), etc. If you have some fixed amount of
| resources to spend on making that software, there is a point
| where investing more of them in reducing bugs harms one's chances
| of passing the filter more than it helps. In particularly
| competitive markets you are likely to find that the most popular
| software is relatively buggy (because it won by spending more on
| marketing or features) and you are often more likely to be using
| that software (for eg interoperability reasons) too.
| TeMPOraL wrote:
| Conversely, the occasional success Open Source tooling has is
| in large part due it _not competing_ , therefore not being
| forced by competitive pressure to spend ~all resources on
| marketing, and ~nil on development. I'm not sure where
| computing would be today if _all_ software was marketing-
| driven, but I guess nowhere near as far as it is now.
| talldayo wrote:
| > I'm not sure where computing would be today if all software
| was marketing-driven
|
| Basically just look at the 80s and early 90s. Video games, C
| compilers, NAS software, operating systems and hardware sales
| were all almost entirely marketing driven. Before any serious
| Open Source revolution, you paid for almost any code that was
| perceived to have value. Functionality built-in was not
| something people took for granted.
|
| Open Source won not because you can't market it (in fact, you
| can - it's just that nobody is paid to do it), but because
| it's free. The ultimate victory Linux wielded over it's
| contemporaries was that you could host a web server without
| paying out the ass to do it. It turned out to be so
| competitive that it pretty much decimated the market for
| commercial OSes with word-of-mouth alone. It's less about
| their neglect of marketing tactics and more a reflection of
| the resentment for the paid solutions at the time.
| sciencesama wrote:
| Nvidea ?
| foobar8495345 wrote:
| In my regressions, i make sure i include an "always fail" test,
| to make sure the test infrastructure is capable of correctly
| flagging it.
| joeyagreco wrote:
| could you give a concrete example of what you mean by this?
| maxbond wrote:
| Not GP but when I feel like I'm going crazy I insert an
| "assert False" test into my test suite. It's a good way to
| reveal when you're testing a cached version of your code for
| some reason (for instance integration tests using Docker
| Compose that aren't picking up your changes because you've
| forgotten to specify --build or your .dockerignore is
| misconfigured).
|
| But I delete it when I'm done.
| singron wrote:
| We once accidentally made a change to a python project test
| suite that caused it to successfully run none of the tests.
| Then we broke some stuff but the tests kept "passing".
|
| It's a little difficult to productionize an always_fail test
| since you do actually want the test suite to succeed. You
| could affirmatively test that you have non-zero passing
| tests, which is I think what we did. If you have an
| always_fail test, you could check that that's your only
| failure, but you have to be super careful that your test
| suite doesn't stop after a failure.
| maxbond wrote:
| Maybe you could ignore that test by default, and then write
| a shell script to run your tests in two stages. First you
| run only the should-fail test(s) and assert that they fail.
| Then you can run your actual tests.
| SoftTalker wrote:
| Sounds like the old George Carlin one-liner. Or maybe
| it's a two-liner:
|
| The following statement is true.
|
| The preceeding statement is false.
| robotresearcher wrote:
| Even older than George Carlin. The Liar Paradox is
| documented from at least 400BC.
|
| https://en.m.wikipedia.org/wiki/Liar_paradox
| maxbond wrote:
| I have to imagine it's about as old as propositional
| logic (so, as old as the hills).
|
| I most closely associate it with Godel and his work on
| incompleteness.
| marcosdumay wrote:
| > We once accidentally made a change to a python project
| test suite that caused it to successfully run none of the
| tests.
|
| That shouldn't be an easy mistake to make.
|
| Your test code should be clearly marked, and better if
| slightly separated from the rest of the code. Also, there
| should be some feedback about the amount of tests that run.
|
| And yeah, I know Python doesn't help you make those things.
| rzzzt wrote:
| This opens up a philosophical can of worms. Does the test pass
| when it fails? Is it marked green or red?
| heisgone wrote:
| You want both. To test green and red pixels.
| TeMPOraL wrote:
| So basically you want yellow? As it's what you get when you
| start testing red and green subpixels simultaneously.
| jerf wrote:
| Not only philosophical, it can come out in the code too. I've
| written a number of testing packages over the years, and it's
| a rare testing platform that can assert that some sort of
| test failure assertion "correctly" fails without _some_ sort
| of major hoop jumping, usually having to run that test in an
| isolated OS process and parse the output of that process.
|
| This isn't a complaint; it's too marginal and weird a test
| case to complain about, and the separate OS process is always
| there as a fallback solution.
| RajT88 wrote:
| I had a customer complain once about how great Akamai WAF was,
| because it never had false positives. (My company's WAF solution
| had many)
|
| Is that actually desirable? This article articulates my exact gut
| feeling.
| lo_zamoyski wrote:
| As the Dijkstrian expression goes, testing shows the presence of
| bugs, not their absence. Unit tests can show that a bug exists,
| but it cannot show you that there are no bugs, save for the
| particular cases tested and even then, only in a behaviorist sort
| of way (meaning, a your buggy code may still produce the expected
| output for tested cases). For that, you need to be able to
| _prove_ your code possesses certain properties.
|
| Type systems and various forms of static analysis are going to
| increasingly shape the future of software development, I think.
| Large software systems especially become practically impossible
| to work with and impossible to verify and test without types.
| shahzaibmushtaq wrote:
| The author is simply talking about the most common testing
| types[0] but in a more philosophical way.
|
| [0] https://www.perfecto.io/resources/types-of-testing
| RangerScience wrote:
| Colleagues: If the code works, it's good!
|
| Me: Hmmm.
|
| Managers, a week later: We're starting everyone on a 50% on-call
| rotation because there's so many bugs that the business is on
| fire.
|
| Anyway, now I get upset and ask them to define "works", which...
| they haven't been able to do yet.
| peterldowns wrote:
| Haven't seen it mentioned here in the comments so I'll throw in
| -- this is one of the best uses for code coverage tooling. When
| I'm trying to make sure something really works, I'll start with a
| failing testcase, get it passing, and then also use coverage to
| make sure that the testcase is actually exercising the logic I
| expect. I'll also use the coverage measured when running the
| entire suite to make sure that I'm hitting all the corner cases
| or edges that I _thought_ I was hitting.
|
| I never measure coverage percentage as a goal, I don't even
| bother turning it on in CI, but I do use it locally as part of my
| regular debugging and hardening workflow. Strongly recommend
| doing this if you haven't before.
|
| I'm spoiled in that the golang+vscode integration works really
| well and can highlight executed code in my editor in a fast
| cycle; if you're using different tools, it might be harder to try
| out and benefit from it.
| hinkley wrote:
| I don't mind coverage in CI except when someone fails builds
| based on reductions in coverage percent, because it ends up
| squashing refactoring and we want people doing more of that not
| less.
|
| Sometimes very well covered code is dead code. If it has higher
| coverage than the rest of the project, then deleting it removes
| for example 1000 lines of code at 99% coverage, which could
| reduce the overall by .1%.
|
| And even if it wasn't 99% when you started, rewriting modules
| often involves first adding pinning tests, so replacing 1000
| lines with 200 new could first raise the coverage percent and
| then drop it again at the end.
|
| There are some things in CI/CD that should be charts not
| failures and this is one.
___________________________________________________________________
(page generated 2024-10-16 23:01 UTC)