[HN Gopher] It's not enough for a program to work - it has to wo...
       ___________________________________________________________________
        
       It's not enough for a program to work - it has to work for the
       right reasons
        
       Author : BerislavLopac
       Score  : 107 points
       Date   : 2024-10-16 15:20 UTC (7 hours ago)
        
 (HTM) web link (buttondown.com)
 (TXT) w3m dump (buttondown.com)
        
       | BerislavLopac wrote:
       | > How do I know whether my tests are passing because they're
       | properly testing correct code or because they're failing to test
       | incorrect code?
       | 
       | One mechanism to verify that is by running a mutation testing [0]
       | tool. They are available for many languages; mutmut [1] is a
       | great example for Python.
       | 
       | [0] https://en.wikipedia.org/wiki/Mutation_testing
       | 
       | [1] https://mutmut.readthedocs.io
        
         | layer8 wrote:
         | That's basically the approach mentioned in the article's
         | paragraph starting with "A broader technique I follow is _make
         | it work, make it break_."
        
           | bunderbunder wrote:
           | If there's one thing that engineer engineers have considered
           | standard practice for ever and ever, but software engineers
           | seem to still not entirely grok, it's destructive testing.
           | 
           | I see this a lot with performance measurement, for example. A
           | team will run small-scale benchmarks, and then try to
           | estimate how a system will scale by linearly extrapolating
           | those results. I don't think I've ever seen it work out well
           | in practice. Nothing scales linearly forever, and there's no
           | reliable way to know when and how it will break down unless
           | you actually push it to the point of breaking down.
        
             | erik_seaberg wrote:
             | One of the things I like about cloud is that it's
             | relatively easy to spin up an isolated full-scale
             | environment and find out where prod's redline probably is.
             | On-prem hardware might have different bottlenecks.
        
             | johnnyanmac wrote:
             | I think it's because companies tend to segregate
             | engineering from testing/QA. There's more features to work
             | on, stress testing is Qa's job and those tickets will come
             | later. Engineers will do some basic common tests to make
             | sure it functions as expected, but aren't given the time
             | nor tools to really dig in and ensure it's truly robust.
             | 
             | It also reflects the domain. For mission critical code
             | there better be 10 different layers of red lines between
             | development and shipping. For web code, care for stuff like
             | performance and even correctness can fall by the wayside.
        
         | pron wrote:
         | Yep. And for Java: https://pitest.org
        
       | computersuck wrote:
       | Website not quite loading.. HN hug of death?
        
         | ilrwbwrkhv wrote:
         | Buttondown is a great non-success. Ergo they are a good
         | company.
        
           | hwayne wrote:
           | I like buttondown because I can directly contact the
           | developer when I have problems. Some downsides to small
           | companies, lots of upsides too.
        
       | JohnMakin wrote:
       | There are few things that terrify me more nowadays at this point
       | in my career than spending a lot of time writing something and
       | setting it up, only to turn it on for the first time and it works
       | without any issues.
        
         | sudhirj wrote:
         | Oh god this is such a nightmare. It takes much longer to build
         | something that works not the first try, because then I have to
         | force-simulate a mistake to make sure things were actually
         | correct in the first place.
         | 
         | Test Driven Development had a fix for this, which I used to do
         | back in day when I was evangelical about the one true way the
         | write software. You wrote a test that failed, and added or
         | wrote code only to make that test pass. Never add any code
         | except to make a failing test pass.
         | 
         | It didn't guarantee 100% correct software, of course, but it
         | prevented you from gaslighting yourself for being too awesome.
        
           | ipaddr wrote:
           | Tests are like the burning sun in your eyes after you wake up
           | for a night of drinking.
           | 
           | I prefer separating writing some code down, making it
           | functionally work on screen and writing tests. I usually
           | cover cases in step 2 but when you add sometime new later it
           | is nice to have step 3.
        
         | norir wrote:
         | Yes and the first thing I might ask is "how can I break this?"
         | If I can't easily break it with a small change, I've probably
         | missed something.
        
         | twic wrote:
         | "If it ain't broke, open it up and see what makes it so bloody
         | special." -- The BOFH
        
       | kellymore wrote:
       | Site is buggy
        
         | mobeigi wrote:
         | Works fine for me?
        
       | teddyh wrote:
       | > _This is why test-driven development gurus tell people to write
       | a failing test first._
       | 
       | To be precise, it's one of the big reasons, but it's far from the
       | _only_ reason to write the test first.
        
         | klabb3 wrote:
         | I'm increasingly of the opinion that TDD is only as good as
         | your system is testable.
         | 
         | This means that the time of writing your first test is too
         | late. It's part of the core business logic architecture - the
         | whiteboard stage.
         | 
         | If you can make it testable, TDD isn't just good practice -
         | it's what you _want to do_ because it's so natural. Similar to
         | how unit tests are already natural when you write hermetic code
         | (like say a string formatter).
         | 
         | If, OTOH, your business logic is inseparable from prod
         | databases, files, networking, current time & time zone, etc,
         | then TDD and tests in general are both cumbersome to write and
         | simultaneously delivers much less value (as in finding errors)
         | per test-case. Controversially, I think that for a spaghetti
         | code application tests are quite useless and are largely
         | ritualistic.
         | 
         | The only way I know how to design such testable systems (or
         | subsystems) is through the "functional core - imperative shell"
         | pattern. Not necessarily religious adherence to "no side
         | effects", but isolation is a must.
        
           | TechDebtDevin wrote:
           | This is my problem, I don't worry about tests until I'm
           | already putting the marinara sauce on my main functions.
        
           | pdimitar wrote:
           | > _If, OTOH, your business logic is inseparable from prod
           | databases, files, networking, current time & time zone, etc,
           | then TDD and tests in general are both cumbersome to write
           | and simultaneously delivers much less value (as in finding
           | errors) per test-case. Controversially, I think that for a
           | spaghetti code application tests are quite useless and are
           | largely ritualistic._
           | 
           | I don't disagree with this and I have found it to be quite
           | true -- though IMO it still has to be said that you can mock
           | / isolate a lot of stuff, system time included. I am guessing
           | you already accounted for that when you said that tests can
           | become cumbersome to write and I agree. But we should still
           | try because there are projects where you can't ever get a
           | truly isolated system to test f.ex. I recently finished a
           | contract where I had to write an server for dispatching SMS
           | jobs to the right per-tenant & per-data-center instances of
           | the actual connected-to-the-telco-network SMS servers; the
           | dev environment was practically useless because the servers
           | there did not emit half the events my application needed to
           | function properly so I had to record the responses from the
           | production servers and use them as mocks in my dev env tests.
           | 
           | Did the test succeed? Sure they did but ultimately gave me
           | almost no confidence. :/
           | 
           | But yeah, anyway, I agree with your premise, I just think
           | that we should still go the extra mile to reduce entropy and
           | chaos as much as we can. Because nobody likes being woken up
           | to fight a fire in production.
        
           | treflop wrote:
           | Writing for reusability also tends to make software testable
           | from my experience. If you make excessively involved units of
           | code, you can't test, but you also can't re-use.
           | 
           | And I'm big on reusability because I'm lazy. If requirements
           | change, I rather tweak than rebuild.
        
       | maxbond wrote:
       | > It's not enough for a program to work, it has to work for the
       | right reasons. Code working for the wrong reasons is code that's
       | going to break when you least expect it.
       | 
       | This reminds me of the recent discussion of gettiers[1]. That
       | article focused on Gettier bugs, but this passage discusses what
       | you might call Gettier features.
       | 
       | Something that's gotten me before is Python's willingness to
       | interpret a comma as a tuple. So instead of:
       | my_event.set()
       | 
       | I wrote:                   my_event,set()
       | 
       | Which was syntactically correct, equivalent to:
       | _ = (my_event, set())
       | 
       | The auto formatter does insert a space though, which helps. Maybe
       | it could be made to transform it as I did above, that would make
       | it screamingly obvious.
       | 
       | [1a] https://jsomers.net/blog/gettiers
       | 
       | [1b] https://news.ycombinator.com/item?id=41840390
        
         | HL33tibCe7 wrote:
         | Your font and/or eyesight might need attention!
        
           | settsu wrote:
           | Tell me you're a 20-something engineer without telling me
           | you're a 20-something engineer.
        
             | HL33tibCe7 wrote:
             | Wrong
        
             | lcnPylGDnU4H9OF wrote:
             | The implication being that older programmers would be
             | entirely unconcerned with one's eyesight and the effect
             | that reading a small font could have on such? Somehow that
             | seems a bit backwards. People don't know what they have
             | until it's gone.
        
           | maxbond wrote:
           | You know what, I do use a small font size in my editor. I
           | like to see a lot of code at once. And if memory serves I
           | spotted this in the browser, where I do the opposite.
           | 
           | I'll have to look into hyper legible monospace fonts. Or
           | maybe I'll just use Atkinson and deal with the variable
           | spacing.
        
           | nomel wrote:
           | My editor uses a different color for comma and period.
        
         | csours wrote:
         | I was expecting to find the word Gettier in the text.
         | 
         | My comment on that Gettier post:
         | 
         | Puttiers: When a junior engineer fixes something, but a
         | different error is returned, so they cannot tell if progress
         | was made or not.
         | 
         | https://news.ycombinator.com/item?id=41850429
        
         | praptak wrote:
         | So we have an analogy:
         | 
         | accidentally working app : correct app :: Gettier "not
         | knowledge" JTB : proper knowledge JTB
         | 
         | Is it possible to backport the program analogy back into the
         | realm of philosophy? I'm dreaming of a philosophy paper along
         | the lines of "Knowledge is JTB with proper testing".
        
         | hansvm wrote:
         | One pattern that eliminates a lot of such bugs is never using
         | any name that's a keyword or common name in any mainstream
         | programming language. The existence of `def set...` in your
         | code was already asking for trouble, and you were unlucky
         | enough to find it.
        
           | maxbond wrote:
           | I agree, but this is a method on an object in the standard
           | library!
           | 
           | https://docs.python.org/3/library/asyncio-
           | sync.html#asyncio....
           | 
           | Could've been called fire() or activate(), perhaps. This is
           | also the kind of problem lints are really good for. I
           | wouldn't be surprised if there was an lint for this already
           | (I haven't checked).
        
       | dan-robertson wrote:
       | One general way I like to think about this is that most software
       | you use has passed through some filter - it needed to be complete
       | enough for people to use it, people needed to find it somehow (eg
       | through marketing), etc. If you have some fixed amount of
       | resources to spend on making that software, there is a point
       | where investing more of them in reducing bugs harms one's chances
       | of passing the filter more than it helps. In particularly
       | competitive markets you are likely to find that the most popular
       | software is relatively buggy (because it won by spending more on
       | marketing or features) and you are often more likely to be using
       | that software (for eg interoperability reasons) too.
        
         | TeMPOraL wrote:
         | Conversely, the occasional success Open Source tooling has is
         | in large part due it _not competing_ , therefore not being
         | forced by competitive pressure to spend ~all resources on
         | marketing, and ~nil on development. I'm not sure where
         | computing would be today if _all_ software was marketing-
         | driven, but I guess nowhere near as far as it is now.
        
           | talldayo wrote:
           | > I'm not sure where computing would be today if all software
           | was marketing-driven
           | 
           | Basically just look at the 80s and early 90s. Video games, C
           | compilers, NAS software, operating systems and hardware sales
           | were all almost entirely marketing driven. Before any serious
           | Open Source revolution, you paid for almost any code that was
           | perceived to have value. Functionality built-in was not
           | something people took for granted.
           | 
           | Open Source won not because you can't market it (in fact, you
           | can - it's just that nobody is paid to do it), but because
           | it's free. The ultimate victory Linux wielded over it's
           | contemporaries was that you could host a web server without
           | paying out the ass to do it. It turned out to be so
           | competitive that it pretty much decimated the market for
           | commercial OSes with word-of-mouth alone. It's less about
           | their neglect of marketing tactics and more a reflection of
           | the resentment for the paid solutions at the time.
        
       | sciencesama wrote:
       | Nvidea ?
        
       | foobar8495345 wrote:
       | In my regressions, i make sure i include an "always fail" test,
       | to make sure the test infrastructure is capable of correctly
       | flagging it.
        
         | joeyagreco wrote:
         | could you give a concrete example of what you mean by this?
        
           | maxbond wrote:
           | Not GP but when I feel like I'm going crazy I insert an
           | "assert False" test into my test suite. It's a good way to
           | reveal when you're testing a cached version of your code for
           | some reason (for instance integration tests using Docker
           | Compose that aren't picking up your changes because you've
           | forgotten to specify --build or your .dockerignore is
           | misconfigured).
           | 
           | But I delete it when I'm done.
        
           | singron wrote:
           | We once accidentally made a change to a python project test
           | suite that caused it to successfully run none of the tests.
           | Then we broke some stuff but the tests kept "passing".
           | 
           | It's a little difficult to productionize an always_fail test
           | since you do actually want the test suite to succeed. You
           | could affirmatively test that you have non-zero passing
           | tests, which is I think what we did. If you have an
           | always_fail test, you could check that that's your only
           | failure, but you have to be super careful that your test
           | suite doesn't stop after a failure.
        
             | maxbond wrote:
             | Maybe you could ignore that test by default, and then write
             | a shell script to run your tests in two stages. First you
             | run only the should-fail test(s) and assert that they fail.
             | Then you can run your actual tests.
        
               | SoftTalker wrote:
               | Sounds like the old George Carlin one-liner. Or maybe
               | it's a two-liner:
               | 
               | The following statement is true.
               | 
               | The preceeding statement is false.
        
               | robotresearcher wrote:
               | Even older than George Carlin. The Liar Paradox is
               | documented from at least 400BC.
               | 
               | https://en.m.wikipedia.org/wiki/Liar_paradox
        
               | maxbond wrote:
               | I have to imagine it's about as old as propositional
               | logic (so, as old as the hills).
               | 
               | I most closely associate it with Godel and his work on
               | incompleteness.
        
             | marcosdumay wrote:
             | > We once accidentally made a change to a python project
             | test suite that caused it to successfully run none of the
             | tests.
             | 
             | That shouldn't be an easy mistake to make.
             | 
             | Your test code should be clearly marked, and better if
             | slightly separated from the rest of the code. Also, there
             | should be some feedback about the amount of tests that run.
             | 
             | And yeah, I know Python doesn't help you make those things.
        
         | rzzzt wrote:
         | This opens up a philosophical can of worms. Does the test pass
         | when it fails? Is it marked green or red?
        
           | heisgone wrote:
           | You want both. To test green and red pixels.
        
             | TeMPOraL wrote:
             | So basically you want yellow? As it's what you get when you
             | start testing red and green subpixels simultaneously.
        
           | jerf wrote:
           | Not only philosophical, it can come out in the code too. I've
           | written a number of testing packages over the years, and it's
           | a rare testing platform that can assert that some sort of
           | test failure assertion "correctly" fails without _some_ sort
           | of major hoop jumping, usually having to run that test in an
           | isolated OS process and parse the output of that process.
           | 
           | This isn't a complaint; it's too marginal and weird a test
           | case to complain about, and the separate OS process is always
           | there as a fallback solution.
        
       | RajT88 wrote:
       | I had a customer complain once about how great Akamai WAF was,
       | because it never had false positives. (My company's WAF solution
       | had many)
       | 
       | Is that actually desirable? This article articulates my exact gut
       | feeling.
        
       | lo_zamoyski wrote:
       | As the Dijkstrian expression goes, testing shows the presence of
       | bugs, not their absence. Unit tests can show that a bug exists,
       | but it cannot show you that there are no bugs, save for the
       | particular cases tested and even then, only in a behaviorist sort
       | of way (meaning, a your buggy code may still produce the expected
       | output for tested cases). For that, you need to be able to
       | _prove_ your code possesses certain properties.
       | 
       | Type systems and various forms of static analysis are going to
       | increasingly shape the future of software development, I think.
       | Large software systems especially become practically impossible
       | to work with and impossible to verify and test without types.
        
       | shahzaibmushtaq wrote:
       | The author is simply talking about the most common testing
       | types[0] but in a more philosophical way.
       | 
       | [0] https://www.perfecto.io/resources/types-of-testing
        
       | RangerScience wrote:
       | Colleagues: If the code works, it's good!
       | 
       | Me: Hmmm.
       | 
       | Managers, a week later: We're starting everyone on a 50% on-call
       | rotation because there's so many bugs that the business is on
       | fire.
       | 
       | Anyway, now I get upset and ask them to define "works", which...
       | they haven't been able to do yet.
        
       | peterldowns wrote:
       | Haven't seen it mentioned here in the comments so I'll throw in
       | -- this is one of the best uses for code coverage tooling. When
       | I'm trying to make sure something really works, I'll start with a
       | failing testcase, get it passing, and then also use coverage to
       | make sure that the testcase is actually exercising the logic I
       | expect. I'll also use the coverage measured when running the
       | entire suite to make sure that I'm hitting all the corner cases
       | or edges that I _thought_ I was hitting.
       | 
       | I never measure coverage percentage as a goal, I don't even
       | bother turning it on in CI, but I do use it locally as part of my
       | regular debugging and hardening workflow. Strongly recommend
       | doing this if you haven't before.
       | 
       | I'm spoiled in that the golang+vscode integration works really
       | well and can highlight executed code in my editor in a fast
       | cycle; if you're using different tools, it might be harder to try
       | out and benefit from it.
        
         | hinkley wrote:
         | I don't mind coverage in CI except when someone fails builds
         | based on reductions in coverage percent, because it ends up
         | squashing refactoring and we want people doing more of that not
         | less.
         | 
         | Sometimes very well covered code is dead code. If it has higher
         | coverage than the rest of the project, then deleting it removes
         | for example 1000 lines of code at 99% coverage, which could
         | reduce the overall by .1%.
         | 
         | And even if it wasn't 99% when you started, rewriting modules
         | often involves first adding pinning tests, so replacing 1000
         | lines with 200 new could first raise the coverage percent and
         | then drop it again at the end.
         | 
         | There are some things in CI/CD that should be charts not
         | failures and this is one.
        
       ___________________________________________________________________
       (page generated 2024-10-16 23:01 UTC)