[HN Gopher] Show HN: Checksum - generate and maintain end-to-end...
___________________________________________________________________
Show HN: Checksum - generate and maintain end-to-end tests using AI
Hey HN! I'm Gal, co-founder at Checksum (https://checksum.ai).
Checksum is a tool for automatically generating and maintaining
end-to-end tests using AI. I cut my teeth in applied ML in 2016 at
a maritime tech company called TSG, based in Israel. When I was
there, I worked on a cool product that used machine learning to
detect suspicious vehicles. Radar data is pretty tough for humans
to parse, but a great fit for AI - and it worked very well for
detecting smugglers, terrorist activity, and that sort of thing.
In 2021, after a few years working in big tech (Lyft, Google), I
joined a YC company, seer W21, as CTO. This is where I experienced
the unique pain of trying to keep end-to-end tests in a good state.
The app was quite featureful, and it was a struggle to get and
maintain good test coverage. Like the suspicious maritime vehicle
problem I had previously encountered, building and maintaining E2E
tests had all the markings of a problem where machines could
outperform humans. Also, in the early user interviews, it became
clear that this problem wasn't one that just went away as
organizations grew past the startup phase, but one that got even
more tangled up and unpleasant. We've been building the product
for a little over a year now, and it's been interesting to learn
that some problems were surprisingly easy, and others unusually
tough. To get the data we need to train our models, we use the same
underlying technology that tools like Fullstory and Hotjar use, and
it works quite well. Also, we're able to get good tests from
relatively few user sessions (in most cases, fewer than 200
sessions). Right now, the models are really good at improving test
coverage for featureful web-apps that don't have much coverage (ie;
generating and maintaining a bunch of new tests), but making
existing tests better has been a tougher nut to crack. We don't
have as much of a place in organizations where test coverage is
great and test quality is medium-to-poor, but we're keen to develop
in that direction. We're still early, and spend basically all of
our time working with a small handful of design partners (mostly
medium-sized startups struggling with test coverage), but it felt
like time to share with the HN community. Thanks so much, happy to
answer any questions, and excited to hear your thoughts!
Author : Bootstrapper909
Score : 65 points
Date : 2023-04-19 15:05 UTC (7 hours ago)
| sachuin23 wrote:
| How is the product different from the other test generation
| tools? How do you check if the are testing the intended behavior.
| My experience with automated testing solutions has been lukewarm
| so far.
| Bootstrapper909 wrote:
| I agree! My experience with test generation tools was also
| lukewarm which is why we founded Checksum.
|
| > How is the product different from the other test generation
| tools
|
| We train our models based on real user sessions. So our tests
| are: 1. Completely auto-generated 2. Achieve high coverage of
| real user flows, including detecting edge cases 3.
| Automatically maintained and execute with our models so they
| are less flakey.
|
| > How do you check if the are testing the intended behavior
|
| Our models are trained on many real sessions so it learns how
| your website (and others) should behave. In that sense, it's
| similar to a manual QA tester which can detect bugs. To
| supplement for functionality that is not obvious by the UI, we
| are now looking at adding LLMs to parse code, but most of the
| functionality can be inferred from the UI
| [deleted]
| johnsillings wrote:
| I know Gal has the link in plaintext above, but for folks who
| want to check out the homepage, it's here: https://checksum.ai
| mackeyja92 wrote:
| Does it only support web? What about react native mobile apps?
| Bootstrapper909 wrote:
| We're currently focusing on web apps.
|
| There's nothing "specific" in the underlying model that
| prevents it from testing mobile. It's just a matter of focus at
| the current time.
| 8organicbits wrote:
| I'm always suspicious of tests when test coverage is the main
| metric. I've seen developers write tests that don't really check
| anything but run all the code paths. I've also seen tests that
| check every bit of output, which end up being brittle.
|
| How well do the tests hold up over time, and how well are the
| tests validating the contract of the code instead of just
| historical behavior and quirks?
| Bootstrapper909 wrote:
| That's a great question!
|
| We actually use real user sessions to train our model, so when
| I use the term coverage our main metric is covering as many
| user behaviors as possible.
|
| We collect data in a privacy-focused way essentially
| anonymizing all sensitive information, as we don't need to know
| the user specific context. Only the main flow.
| johnhenning wrote:
| If this is trained on user sessions, how would the model
| learn to generate tests for edge cases that wouldn't
| necessarily show up in the training data?
| varunjain99 wrote:
| Congrats! Definitely think QA farms can be automated using AI!
| Can you explain more what part Checksum is using AI?
|
| Is it for the identification of user sessions that are good
| candidates to make into tests? Is it the generation of test
| specification in some DSL / Cucumber / Selenium / etc.?
| Bootstrapper909 wrote:
| It's all of the above but more specifically:
|
| 1. We use AI to analyze the user patterns and find common paths
| and edge cases, basically building a representation of your UX
| in a DB
|
| 2. We then use the DB to train another ML model that learns how
| to use your app the same way a user does. Given a certain page
| and user context, the ML can complete UX flows.
|
| 3. Finally, we learn to generate assertions, run the tests and
| convert the model actions in step 2 into proper Playwright or
| Cypress tests
| jtambunt wrote:
| Congratulations on all the progress you've made! We are all
| learning as we're building and talking to users. I know for my
| team, E2E/Integration testing is our main priority (over unit
| tests), and maintaining E2E tests is definitely a struggle. I
| imagine this problem is even more of an issue for larger
| codebases so I see why you're going after medium-size startups
| where the product isn't completely rebuilt every few months.
| Bootstrapper909 wrote:
| Thanks for your kind words! Yes many teams struggle with that
| (and I have in the past) and the essence of ur mission is to
| allow dev teams to focus on progressing on their roadmap and
| goals instead of wrestling with tests.
|
| Feel free to sign up for a demo if that's a priority for your
| team. Even if it's just to chat and connect.
| JohnFriel wrote:
| This is a really compelling idea - but I'm having a little
| trouble making the leap from the high level description to what
| it would mean for my projects in more concrete terms. Would it be
| possible to show off some example tests that the model generated
| and maybe even a story about how the generated tests caught a bug
| before the code made it to production?
| Bootstrapper909 wrote:
| Our landing page at checksum.ai has a video in the hero section
| of test. We added some graphics (e.g. the green checkmark), but
| the steps executed are real tests that we generated.
|
| But the tl;dr is 1. We learn how to use your app based on real
| sessions (we remove sensitive information on the client side)
| 2. We train a model on this data 3. We connect this model to a
| browser and generate Playwright or Cypress tests
|
| The end result is code written and Playwright or Cypress. You
| can edit and run the tests regularly
| ezekg wrote:
| > Our impact on performance is non-existent as we use battle-
| tested open source tools used by Fortune 500 companies
|
| What does that mean, exactly? Just because it's open source and
| used by F500s doesn't mean it can't have performance issues.
| Bootstrapper909 wrote:
| That's a fair comment and I guess we are missing and "AND"
| there.
|
| 1. We (and others) have tested our tools' impact on memory,
| CPU, network performance and found only negligible impact, even
| on slower/older devices
|
| 2. Also, they are used by F500 companies and have wide
| adoption, which indicates that other well established devs have
| run the same tests and decided to move forward.
|
| We'll work on the language there to clarify.
| hiatus wrote:
| Noticed a couple small typos in the marketing copy
|
| > Our impact on pefromence is non-existant as we use battle-
| tested open source tools used by Fortune 500 companies
| johnsillings wrote:
| Thank you for calling those out - will get that fixed right up!
| execore-1 wrote:
| Awesome idea! Excited to see where this goes
| shyamkumar7 wrote:
| Thank you! Appreciate the kind words.
| therealrifath wrote:
| This is dope, a couple of feedback points from a technical person
| that is a potential customer (I could be wrong on these!):
|
| -- I think the name doesn't sell me or even most people because
| "checksum" is more of a security/crypto term. When I saw the HN
| post say Checksum I didn't think it was going to be about end-to-
| end tests. I thought it was going to be some crypto thing. Maybe
| a name like "Tested" or "Covered" is going to click better with
| the potential customer.
|
| -- I don't feel like the demo video is making me feel like I know
| what this product is doing. I could also be misunderstanding the
| product. It might help more if the demo showed the following (in
| ideally less than 5-10s or most users might tune out):
|
| 1. A quick setup step for checksum 2. A set of generated tests 3.
| Passing tests
|
| Seeing those steps would give me the emotion as an end-user "wow
| this must something I can quickly setup and will make me feel
| like I have test coverage out of the box"
| Bootstrapper909 wrote:
| Thanks for your feedback. It definitely makes sense and we'll
| incorporate it!
___________________________________________________________________
(page generated 2023-04-19 23:02 UTC)