[HN Gopher] Introducing Qodo Cover: Automate Test Coverage
       ___________________________________________________________________
        
       Introducing Qodo Cover: Automate Test Coverage
        
       Author : timbilt
       Score  : 12 points
       Date   : 2024-12-04 17:10 UTC (5 hours ago)
        
 (HTM) web link (www.qodo.ai)
 (TXT) w3m dump (www.qodo.ai)
        
       | m3kw9 wrote:
       | Why can't i just use cursor to just "generate tests" instead?
        
         | timbilt wrote:
         | > validates each test to ensure it runs successfully, passes,
         | and increases code coverage
         | 
         | This seems to be based on the cover agent open source which
         | implements Meta's TestGen-LLM paper.
         | https://www.qodo.ai/blog/we-created-the-first-open-source-im...
         | 
         | After generating each test, it's automatically run -- it needs
         | to pass and increase coverage, otherwise it's discarded.
         | 
         | This means you're guaranteed to get working tests that aren't
         | repetitions of existing tests. You just need to do a quick
         | review to check that they aren't doing something strange and
         | they're good to go.
        
           | torginus wrote:
           | What the reasoning behind generating tests until they pass?
           | Isn't the point of tests to discover erroneous corner cases?
           | 
           | What purpose does this serve besides the bragging rights of
           | 'we need 90% coverage otherwise Sonarqube fails the build'?
        
             | timbilt wrote:
             | Unit tests are more commonly written to future proof code
             | from issues down the road, rather than to discover existing
             | bugs. A code base with good test coverage is considered
             | more maintainable -- you can make changes without worrying
             | that it will break something in an unexpected place.
             | 
             | I think automating test coverage would be really useful if
             | you needed to refactor a legacy project -- you want to be
             | sure that as you change the code, the existing
             | functionality is preserved. I could imagine running this to
             | generate tests and get to good coverage before starting the
             | refactor.
        
               | HideousKojima wrote:
               | >Unit tests are more commonly written to future proof
               | code from issues down the road, rather than to discover
               | existing bugs. A code base with good test coverage is
               | considered more maintainable -- you can make changes
               | without worrying that it will break something in an
               | unexpected place.
               | 
               | The problem is a lot of unit tests could accurately be
               | described as testing "that the code does what the code
               | does." If the future changes to your code also require
               | you to modify your tests (which they likely will) then
               | your tests are largely useless. And if tests for parts of
               | your code that you _aren 't_ changing start failing when
               | you make code changes, that means you made terrible
               | design decisions in the first place that led to your code
               | being too tightly coupled (or had too many side effects,
               | or something like global mutable state).
               | 
               | Integration tests are far, far more useful than unit
               | tests. A good type system and avoiding the bad design
               | patterns I mentioned handle 95% of what unit tests could
               | conceivably be useful for.
        
               | torginus wrote:
               | I disagree in my experience, poorly designed tests test
               | implementation rather than behavior. To test behavior you
               | must know what is actually supposed to happen when the
               | user presses a button.
               | 
               | One of the issues with getting high coverage is that
               | often tests need to be written for testing
               | implementation, rather than desired outcomes.
               | 
               | Why is this an issue? As you mentioned, testing is useful
               | for future proofing codebases and making sure changing
               | the code doesn't break existing use cases.
               | 
               | When test look for desired behavior, this usually means
               | that unless the spec changes, all tests should pass.
               | 
               | The problem is when you test implementation - suppose you
               | do a refactoring, cleanup, or extend the code to support
               | future use cases - the test start failing. Clearly
               | something must be changed in the tests - but what? Which
               | cases encode actual important rules about how the code
               | should behave, and which ones were just tautologically
               | testing that the code did what it did?
               | 
               | This introduces murkiness and diminishes the value of
               | tests.
        
       | swyx wrote:
       | congrats team! we just had Itamar back on the pod who
       | reintroduced Qodo, AlphaCodium and teased Qodo Cover:
       | https://www.latent.space/p/bolt
        
       | foundry27 wrote:
       | First off, congratulations folks! It's never easy getting a new
       | product off the ground, and I wish you the best of luck. So
       | please don't take this as anything other than genuine
       | constructive criticism as a potential customer: generating tests
       | to increase coverage is a misunderstanding of the point of
       | collecting code coverage metrics, and businesses that depend on
       | getting verification activities right will know this when they
       | evaluate your product.
       | 
       | A high-quality test passes when the functionality of the software
       | under test is consistent with the design intent of that software.
       | If the software doesn't do the Right Thing, the test must fail.
       | It's why TDD is effective: you're essentially specifying the
       | intent and then implementing code against it, like a self-
       | verifying requirements specification. When we look at Qodo tests
       | in the GitHub MRs you've linked, it's argued that a high-quality
       | test is defined as one that:
       | 
       | 1. Executes successfully
       | 
       | 2. Passes all assertions
       | 
       | 3. Increases overall code coverage
       | 
       | 4. Tests previously uncovered behaviors (as specified in the LLM
       | prompt)
       | 
       | So, given source code for a project as input, a hypothetical
       | "perfect AI" built into Qodo that always writes a high-quality
       | test would (naturally!) _never fail_ to write a passing test for
       | that code; the semantics of the code would be perfectly encoded
       | in the test. If the code had a defect, it follows logically that
       | optimizing the quality of your AI for the metrics Qodo is aiming
       | for will actually LOWER the probability of finding that defect!
       | The generated test would have successfully managed to validate
       | the code against itself, enshrining defective behavior as
       | correct. It's easy to say that higher code coverage is good, more
       | maintainable, etc., but this outcome is actually the exact
       | opposite of maintainable and actively undermines confidence in
       | the code under test and the ability to refactor.
       | 
       | There are better ways to do this, and you've got competitors who
       | are already well on the way to doing them using a diverse range
       | of inputs besides code. It boils down to answering two questions:
       | 
       | 1. Can a technique be applied so that a LLM, with or without
       | explicit specifications and understanding of developer
       | intentions, will reliably reconstruct the intended behavior of
       | code?
       | 
       | 2. Can a technique be applied so that tests generated by a LLM
       | truly verify the specific behaviors the LLM was prompted to test,
       | as opposed to writing a valid test but not the one that was asked
       | for?
        
       ___________________________________________________________________
       (page generated 2024-12-04 23:02 UTC)