[HN Gopher] AI-powered conversion from Enzyme to React Testing L...
___________________________________________________________________
AI-powered conversion from Enzyme to React Testing Library
Author : GavCo
Score : 167 points
Date : 2024-06-19 09:51 UTC (13 hours ago)
(HTM) web link (slack.engineering)
(TXT) w3m dump (slack.engineering)
| AmalgatedAmoeba wrote:
| The conversion is between two testing libraries for React. Not to
| be too cynical (this sort of works seems to me like a pretty good
| niche for llms), but I don't think I'd be that far off of 80%
| with just vim macros...
| dgan wrote:
| You really think you could achieve 80% success rate with just
| syntaxic transformations, while the article says they only
| reached 45% success rate with fine grained ast transformations?
|
| I am no vim hater, but allow me to cast a large, fat doubt on
| your comment!
| Dr_Birdbrain wrote:
| You might be underestimating vim ;)
|
| Key point is that vim macros are interactive. You don't just
| write a script that runs autonomously, you say "ok, for the
| next transformation do this macro. Oh wait, except for that,
| in the next 500 lines do this other thing." You write the
| macro, then the next macro, adjust on the fly.
| pqb wrote:
| That sounds interesting! Would you mind sharing some links
| to the articles or videos that focus on this possibility?
| duffmancd wrote:
| This Vimcast (http://vimcasts.org/episodes/converting-
| markdown-to-structur...) recording is an example of a
| quite complex macro for converting (a specific file's)
| markdown to HTML. At the beginning of the video you see
| that they save the macro to the "a" register. You can
| record macros of similar complexity to each of the other
| letters of the alphabet, to get some idea of the maximum
| complexity (though I tend to stick to about 3 or less in
| a single session).
| rlupi wrote:
| From the article:
|
| > Our initiative began with a monumental task of converting
| more than 15,000 Enzyme test cases, which translated to
| more than 10,000 potential engineering hours
|
| That's a lot of editing.
| gray_-_wolf wrote:
| Out of curiosity, can you drop into edit session during the
| macro? It is some time since I last used vim, so I do not
| recall, but in emacs you can record a macro along the lines
| of "do A, do B, drop to edit session letting user do
| whatever, do C, do D". Is that possible with vim macros?
| suplix wrote:
| Just use --- calculating... --- 2 macros.
| slaymaker1907 wrote:
| I don't think so since you need to leave edit mode to
| terminate the macro.
| AmalgatedAmoeba wrote:
| Fair enough :) It was very much an exaggeration. But, I do
| wonder how far would "dumb" text editing go in this scenario.
| And, more importantly, whether it wouldn't be faster overall
| than writing a tool that still requires humans to go through
| its output and clean/fix it up.
| rottc0dd wrote:
| I agree, once I had to write a groovy conf out of java library
| constructors and setters and vim macros were really good for
| that.
| stefan_ wrote:
| Not to mention the possible savings if you just don't switch to
| whatever the latest testing framework is your resume driven
| developers want. 100% time savings!
| hermanradtke wrote:
| Enzyme is abandoned and doesn't work on newer versions of
| React. Many teams are doing this conversion for their React
| apps.
| stefan_ wrote:
| Gee, if "many teams" want to spend their time migrating
| their _unit-test framework_ and _unit tests_ because their
| _frontend framework_ hit version 18 I suppose that 's their
| prerogative.
|
| Not me to applaud Teams but it seems Slacks lunch is being
| eaten by people who are busy building things on the corpse
| of Skype, not churning through churn incarnate.
| morgante wrote:
| I think you're significantly underestimating the complexity of
| automatic transforms. It's not like they didn't try writing
| codemods first, and vim macros aren't more powerful than
| codemods.
| __jonas wrote:
| Seems like a reasonable approach. I wonder if it took less time
| than it would have taken to build some rule-based codemod script
| that operates on the AST, but I assume it did.
| AdamN wrote:
| It also takes less context potentially - allowing a more junior
| engineer or somebody who doesn't know much about the
| language/library to implement the change.
| morgante wrote:
| If you read the source article[0], they tried a rule-based
| approach first and the complexity exploded.
|
| [0] https://slack.engineering/balancing-old-tricks-with-new-
| feat...
| ec109685 wrote:
| The rules based made the job of the LLM easier, so it was a
| worthwhile part of project.
| semanser wrote:
| I'm working on a similar project (DepsHub) where LLMs are used to
| make major library updates as smooth as possible. While it
| doesn't work in 100% cases, it really helps to minimize all the
| noise while keeping your project up to date. I'm not surprised
| Slack decided to go this way as well.
| larodi wrote:
| Another proof this probabilistic stochastic approach works on the
| prediction/token level, but not on the semantic level, where it
| needs a discreet system. This essentially reminds of RAG setup
| and is similar in its nature.
|
| Perhaps reiterating my previous sentiment that such application
| of LLMs together with discreet structures brings/hides much more
| value than chatbots who will be soon considered mere console UI.
| eru wrote:
| You probably mean discrete, not discreet?
| robertlagrant wrote:
| Sssh! Keep it under your hat.
| anymouse123456 wrote:
| Misleading title. Maybe try this one?
|
| "Slack uses ASTs to convert test code from Enzyme to React with
| 22% success rate"
|
| This article is a poor summary of the actual article, which is at
| least linked to Slack's engineering blog [0].
|
| [0] https://slack.engineering/balancing-old-tricks-with-new-
| feat...
|
| [updated]
| doubloon wrote:
| Reminds me of Hitchhikers Guide ... they had to figure out the
| right question to ask.
| dvh wrote:
| Is it true that they give out free BMWs in Moscow? Yes, it is
| true! But it's not Moscow but St. Petersburg. And it's not
| BMWs but Ladas. And they don't give them out, they steal
| them.
| Joker_vD wrote:
| So all in all, still a mostly accurate news reporting, as
| they go.
| dloranc wrote:
| This is a variation of famous jokes about Radio Yerevan,
| very popular in former soviet states.
|
| I live in Poland and I know this version:
|
| Is it true that they give away cars on Red Square?
|
| Radio Yerevan answers: not cars, only bicycles, not on Red
| Square, but near the Warsaw station, and they don't give
| them away, they steal them.
| zer00eyz wrote:
| The slack blog is for engineers. It's PR to hire in talent.
|
| The INFOQ article is for your C types. It's the spin on what
| buzz words they should be using with their peers and to make a
| splash in the market.
|
| NFT, Crypto, Cloud, Microservice, SAS, Podcast (the first time
| when there wasnt video), Web 2.0, The Ad (double-click, pre
| google) market, The Dot Com Bubble...
|
| Im sure I missed a few hype cycles in there.
|
| Both articles show how deep we are in the stink of this one.
|
| Im tired of swimming the bullshit, It keeps getting in my
| mouth.
| ben_jones wrote:
| Thus the conclusion should be heavily scrutinize future infoq
| .com articles and perhaps future articles by the same author
| Eran Stiller.
|
| We shouldn't detach responsibility from the publisher and
| author.
| nl wrote:
| Except the article turns out to be accurate (see
| https://news.ycombinator.com/item?id=40728179)
|
| So I guess the publisher and author should get credit. I'll
| leave others to discuss the misleading comment...
| phillipcarter wrote:
| Hah. A lot of tech folks like to trash journalists (it's
| fine, I get it, there's legitimate reasons) ... but then
| misread source content that the journalist interpreted
| better/correctly.
| nl wrote:
| _Actually_ the Infoq article is more correct than this comment!
|
| The comment: ""Slack uses ASTs to convert test code from Enzyme
| to React with 22% success rate""
|
| To quote[1], this 22% comes from this part:
|
| > We examined the conversion rates of approximately 2,300
| individual test cases spread out within 338 files. Among these,
| approximately 500 test cases were successfully converted,
| executed, and passed. This highlights how effective AI can be,
| leading to a significant saving of 22% of developer time. It's
| important to note that this 22% time saving represents only the
| documented cases where the test case passed.
|
| So that 22% rate is 22% saving of developer time, measured on a
| sample. No reasonable reading of that makes it a "22% success
| rate".
|
| Over the whole set of tests:
|
| > This strategic pivot, and the integration of both AST and AI
| technologies, helped us achieve the remarkable 80% conversion
| success rate, based on selected files, demonstrating the
| complementary nature of these approaches and their combined
| efficacy in addressing the challenges we faced.
|
| and
|
| > Our benchmark for quality was set by the standards achieved
| by the frontend developers based on our quality rubric that
| covers imports, rendering methods, JavaScript/TypeScript logic,
| and Jest assertions. We aimed to match their level of quality.
| The evaluation revealed that 80% of the content within these
| files was accurately converted, while the remaining 20%
| required manual intervention.
|
| (So I guess the "80% conversion success rate" is this
| percentage of files?)
|
| The Infoq title "Slack Combines ASTs with Large Language Models
| to Automatically Convert 80% of 15,000 Unit Tests" certainly
| more accurately reflects the underlying article than this
| comment.
|
| Edit: they do have a diagram that talks about 22% of the subset
| of manually inspected files being 100% complete. This doesn't
| appear to be what Slack considers their success rate because
| they manually inspect files anyway.
|
| [1] https://slack.engineering/balancing-old-tricks-with-new-
| feat...
| moefh wrote:
| > No reasonable reading of that makes it a "22% success
| rate".
|
| Well, 500/2300 is 22%, so calling it 22% seems pretty
| reasonable.
|
| From what I get from the rest, the 78% remaining tests (the
| ones that failed to convert) were "80% accurately converted",
| I guess they had some metric for measuring that.
|
| So it looks like it depends on how you interpret
| "automatically converted 80%". If it's taken to mean "80%
| could be used without manual intervention", then it's clearly
| false. If you take it to mean "it required manual
| intervention on just 20% of the contents to be usable", then
| it's reasonable.
| dang wrote:
| We changed the URL to that from
| https://www.infoq.com/news/2024/06/slack-automatic-test-
| conv.... Thanks!
| anymouse123456 wrote:
| The actual efficiency claim (which is also likely incorrect) is
| inverted from the original article, "We examined the conversion
| rates of approximately 2,300 individual test cases spread out
| within 338 files. Among these, approximately 500 test cases were
| successfully converted, executed, and passed. This highlights how
| effective AI can be, leading to a significant saving of 22% of
| developer time."
|
| Reading that leads me to believe that 22% of the conversions
| succeeded and someone at Slack is making up numbers about
| developer time.
| taylorfinley wrote:
| > approximately 500 test cases were successfully converted,
| executed, and passed.
|
| How many of these passing tests are still actually testing
| anything? Do they test the tests?
| BigJono wrote:
| > 500 test cases were successfully converted, executed, and
| passed.
|
| Wonder what "successfully converted" means? A converted test
| executing and passing doesn't tell you whether it's still
| testing the same thing as before.
| Phemist wrote:
| That suggests a test suite to test the test suite, which
| again would suggest another test suite to test the test suite
| testing test suite...
|
| In the end, it is test suites all the way down.
| eru wrote:
| You could evaluate the quality of your test suite with
| something other than a test suite.
| netdevnet wrote:
| such as?
| cdogl wrote:
| A developer who knows the code and will own the
| consequences can review and merge it - easy. Just not
| sure why the LLM needed to get involved in the first
| place.
| michaelt wrote:
| Presumably because they've got 500 test cases?
| eru wrote:
| Well, you need someone to write the code, and someone
| (else) to review the code.
|
| In this situation you would replace the author with the
| LLM, but leave the reviewer as human.
|
| It's not as pointless as you make it out to be. You still
| save one human.
| ZeroGravitas wrote:
| You can run mutation tests that intentionally seed errors
| into the codebase (flipping Boolean and to or, for
| example) and rerun the tests.
|
| A good test suite will catch these errors. Something that
| is effectively a noop to get to green will not.
| lvncelot wrote:
| There actually is a way to test your test suite: mutation
| tests [1]. Basically, you change the covered codebase
| (invert an if statement, change a variable, etc.) and
| expect the tests to then fail. If the tests actually
| survive the mutations, they might not be good enough.
|
| [1] https://en.wikipedia.org/wiki/Mutation_testing
| Scea91 wrote:
| Also given that there are huge differences in complexity
| between tests, how do we know that the successful 22% are not
| just trivial one-liner tests?
|
| Thinking about the test suite in my current project there is a
| clear Pareto distribution with majority of tests being simple
| or almost trivial.
| desdenova wrote:
| Someone will still have to review and validate all the tests,
| which may take more time than rewriting the code.
| robertlagrant wrote:
| Yep. Honestly, I don't know when I last read an InfoQ article
| all the way through. Too painful.
| madeofpalk wrote:
| If you're reviewing the tests (and not just the results)
| after conversion by automation, wouldn't someone else also
| review the tests converted by a person rewriting them
| manually?
| imtringued wrote:
| This is something people seemingly don't grasp about LLMs.
| If you want "Human alignment", then you will need humans in
| the loop either way.
| jacobr1 wrote:
| For a test that is already passing, the validation is already
| there. The human is reviewing the PR, but reviewing a diff is
| much less time intensive when you can presume it basically
| works, and you are just assessing if the logic make makes
| sense relative to the prior code.
| AmalgatedAmoeba wrote:
| Yes, the 80% claim comes from comparing 9 tests converted by
| both the conversion tool and humans and comparing the quality -
| "80% of the content within these files was accurately
| converted, while the remaining 20% required manual
| intervention." Not sure what to make of it since they claim
| only 16% of files get fully converted.
| lucianbr wrote:
| Was it the 20% of the code that requires 80% of the time to
| write?
| mminer237 wrote:
| I guess they believe the files that didn't get fully
| converted got like 76% converted on average?
| jmull wrote:
| > saving considerable developer time of at least 22% of 10,000
| hours
|
| I wonder how much time or money it would take to just update
| Enzyme to support react 18? (fork, or, god forbid, by supporting
| development of the actual project).
|
| Nah, let's play with LLMs instead, and retask all the frontend
| teams in the company to rewriting unit tests to a new framework
| we won't support either.
|
| I guess when you're swimming in pools of money there's no need to
| do reasonable things.
| __jonas wrote:
| When I read 'React testing library' I thought they had added an
| official testing library to the React project which would have
| been fantastic and a worthwhile migration target for sure, sad
| that it's just another third party one which also might one day
| stop supporting newer React versions
| zer00eyz wrote:
| >> by supporting development of the actual project
|
| You mean good engineering.
|
| >> play with LLMs instead
|
| You mean good on the resume.
|
| >> I guess when you're swimming in pools of money there's no
| need to do reasonable things.
|
| We dont. Reasonable sailed when FB/Google/AMZN started "giving
| back" to the community... They leaked out all their greatest
| examples of Conways Law and we lapped it up like water at a
| desert oasis.
|
| The thing is, these technologies have massive downsides if you
| aren't Google/FB/Amazon... But we're out here busy singing
| their praises and making sure jr dev's are all pre trained out
| of college for the FA(not this one)ANG lifestyle.
|
| Think about how much react being public saves Facebook on
| onboarding a new dev.
| bastawhiz wrote:
| Enzyme is kind of dead, so it would mean picking up sponsorship
| and maintainership (indefinitely) rather than a one-off project
| to convert to the official testing library for the ecosystem.
| Hasu wrote:
| react-testing-library isn't the "official testing library"
| for React, it isn't made by the React team, and testing
| library provides testing libraries for other frameworks.
|
| It's just a change from an outdated, unmaintained testing
| library to a more 'modern', well-maintained library. There
| are also some philosophical differences in the testing
| approach.
| bastawhiz wrote:
| Monster Energy is the official energy drink of NASCAR but
| that doesn't mean NASCAR manufactures energy drinks. As
| best as I can tell, RTL is the only testing framework
| mentioned in the React docs, so that's pretty "official"
| jmull wrote:
| Well, it might get a lot less dead with a small fraction of
| the resources spent on this project.
|
| > indefinitely
|
| You'll note they've switch to another open source framework
| which has the same potential to fail without
| support/resources. They've kicked the can down the road, but
| are now accruing the technical debt that lead to this effort
| exactly the same as before. Since that technical debt will
| inevitably turn into a real expenditure of resources, they
| are stuck with expenses indefinitely, however they do it.
| Though I think it's pretty obvious that one way is a lot
| cheaper and less disruptive to the business than the other.
|
| (BTW, if they were concerned with indefinite expenses, you
| might also wonder why they want to continue to build their
| product on the shifting sands that are react, or pursue a
| unit testing strategy that is so tightly coupled to specific
| versions of their UI framework. These are "fuck-the-cost"
| decisions, both short term and long term.)
| bastawhiz wrote:
| In fact, enzyme didn't support the previous version of
| React either, except for the grace of some random guy who
| wrote a driver to make it work. Airbnb, who built and
| maintained enzyme, abandoned it. There's (afaik) no way to
| add React 18 support without major changes to the enzyme
| core. So not only is this a problem that will plague them
| indefinitely (that is, dealing with their test framework
| not supporting a recent version) if they don't switch, it's
| adopting a project that they don't own and didn't start to
| avoid a one time cost of rewriting some tests.
|
| > Since that technical debt will inevitably turn into a
| real expenditure of resources, they are stuck with expenses
| indefinitely, however they do it.
|
| I simply can't see how becoming the maintainer of a testing
| framework to rewrite it to add support for the last two
| versions of the library it no longer works with is a
| comparable investment to the ongoing cost of maintaining
| your own unit tests. That's like if Docker became abandoned
| and didn't update to support modern kernels so you decided
| it's better to become the maintainer of Docker instead of
| switching to Podman.
| jmull wrote:
| It's a unit test framework though, not a suite of
| containerization software and services. Maintained mostly
| by one person for years.
| SloopJon wrote:
| Slack's blog post links to "Enzyme is dead. Now what?"[1], in
| which Wojciech Maj said, "A couple of tireless evenings later,
| @wojtekmaj/enzyme-adapter-react-17 was born." Now, he was
| building on someone else's pull request, and he also said that
| adapting to React 18 would have required a huge rework. Still,
| I'm thinking that @slack/enzyme-adapter-react-18 just might
| have taken less than 10,000 hours.
|
| Then again, the idea of a testing framework that is so tightly
| coupled that it breaks every version is foreign to me, so I
| probably don't know what I'm talking about.
|
| [1] https://dev.to/wojtekmaj/enzyme-is-dead-now-what-ekl
| jpnelson wrote:
| This community built React 18 adapter actually works pretty
| well in my experience. Some failures, but worked for multiple
| thousands of test files for my use case.
| https://www.npmjs.com/package/@cfaester/enzyme-adapter-react...
|
| That said, making the 19 adapter is a whole new task, and I
| think these tests should be converted to RTL eventually, so the
| approach described in the blog post is still valuable.
| jmartin2683 wrote:
| Sounds like a nightmare to be involved with anything that is
| written in react and requires 15,000 unit tests.
| gjvc wrote:
| web guis are the worst event/interaction model
| denys_potapov wrote:
| It's a 2024 webdev summary, nothing can be added:
|
| New React version made the lib obsolete, we used LLM to fix it
| (1/5 success rate)
| azangru wrote:
| A lib was heavily relying on React internals for testing,
| rather than just on components' public api. That this approach
| was going to be unsustainable was already obvious around 2020.
| The question is, after you've invested a lot of work in a bad
| practice, how to move to a better practice with the least
| amount of pain. Another, more philosophical, question is how
| does a bad practice gain so much traction in the developer
| community.
| viralpraxis wrote:
| Can someone elaborate if the term "AST" is used correctly in the
| article?
|
| I've been playing with mutation-injection framework for my
| master's thesis for some time. I had to use LibCST to preserve
| syntax information which is usually lost during AST
| serialization/deserialization (like whitespaces, indentation and
| so on). I thought that the difference between abstract and
| concrete trees is that it's guaranteed CST won't lose any
| information, so it can be used to specific tasks where ASTs are
| useless. So, did they actually use CST-based approach?
| arnorhs wrote:
| Usually, the ast can be converted to code, then formatted using
| a specific formatter.
|
| I'm sure slack has a particular code formatter they use.
|
| Most of the time when working with an AST you don't think about
| whitespace except when writing out the result
| Karliss wrote:
| In real world things don't fit neatly in the boxes established
| by textbook definitions. The borders between various concepts
| are fuzzy, and in production implementations for practical
| purposes like performance, code simplicity, better error
| reporting and exact usecase different stages of parsing and the
| parsed representations can be expanded, narrowed or skipped.
|
| A lot of time you will have a syntax tree. It might have
| preserved some of the concrete syntax details like
| subexpression ranges for error reporting and IDE functionality
| or somewhat optional nodes, but at the same time it might also
| contain information obtained form semantic analysis (which from
| computer science perspective isn't even a concern of parser),
| it might not even be a tree. And all of that potentially
| produced in single pass. Is it a concrete syntax tree, is it
| abstract syntax tree, is it a thing after AST? Just because a
| data structure didn't throw away all concrete syntax details,
| doesn't mean it contains all of them. From my experience in
| such situations it's more likely to be called AST, as it's
| closer to that than concrete syntax tree.
|
| It also depends how you define the language that you are
| interested in (even if it's the same language). The purpose of
| parsing isn't necessarily always code compilation. Formal
| grammars and parsers can be used for all kind of things like
| defining a file format, text markup, and also tasks like
| manipulation of source code and checking it for errors. Typical
| example of details not included in AST is parentheses. But that
| doesn't mean AST can never contain node for parentheses. If
| they have a meaning for task you are trying to achieve nothing
| prevents from assigning a node within AST tree. For example
| both Clang and GCC in some situations will give a warning
| depending on presence of parentheses even though they are
| meaningless based on C++ syntax. If you define comments as part
| of the language then they can be kept and manipulated within
| AST.
|
| CST doesn't really guarantee that you won't lose any
| information. The parsers don't operate on bytes they operate on
| abstract symbols, which might directly correspond to bytes in
| the file but not always. Again in real world systems what you
| actually have is 2-3 languages stacked on top of each other.
| Which of the CST are you talking about? Preserving CST for one
| stage doesn't mean no information was lost in previous steps.
| C++ standard defines ~6 steps before the main C++ language
| parsing , many of which can be considered a separate formal
| languages with their own CST/AST.
|
| 1) Text decoding bytes -> text. While most text encodings are
| trivial byte->character substitution, variable length encodings
| like UTF8 can be described as (simple) context free grammars. I
| don't think any programing language toolchain does Unicode
| normalization at this stage, but in theory you could have a
| programming language which does that.
|
| 1.5) Trigraph substitution
|
| 2) text -> preprocessing tokens
|
| 3) preprocessing tokens -> preprocessing AST
|
| 4) as a result of executing preprocessing directive execution
| you obtain new sequence of tokens
|
| 4.5) string literal merging
|
| 5) main parsing
|
| In practices some of these steps might be merged and not
| executed as separate stages, there are also a few more
| transformations I didn't mention.
|
| Stuff like this makes real world source to source
| transformations messy. As the later grammars are operating on
| symbols which only exist in intermediate steps and don't always
| have simple 1:1 to mapping to the input file.
|
| And in some cases you might have some custom algorithm doing a
| transformation which doesn't fit the model of context free
| grammars at all, thus whatever it did isn't part of any formal
| CST for the language (language in terms of formal grammars, not
| a programming language). Python is a good example of this. The
| indentation based scopes can't be handled by context free
| grammars, it relies on magic tokenizer which generate "indent",
| "dedent" tokens, so if you follow formal definitions CST of
| main python language doesn't contain exact information about
| original indentation. The fact that you can get it from LibCST
| is stretching definition of CST/changing the language it is
| parsing. At that point once you add all the extra information
| are you really building a CST or are you making an AST for a
| language where every character is significant, because you
| redefined which parts of program are important.
|
| With all that said I wouldn't be surprised that the thing slack
| did was using something closer to AST (with some additional
| syntax details preserved) than CST (with additional analysis
| and done). If you are not building a general purpose tool for
| making small adjustments to arbitrary existing codebase
| (otherwise preserving original code), it's not necessary to
| preserve every tiny syntax detail as long as comments are
| preserved. I would expect them to be using a standardized code
| formatter anyway so loss of insignificant whitespace shouldn't
| be a major concern, and the diff will likely touch almost
| everyone line of code.
|
| Whether "AST" or a "CST" is useless for specific task is in
| many situations less about "AST" vs "CST" but more about design
| choices of specific programming language, parser implementation
| and pushing things beyond the borders of strict formal
| definitions.
| dwringer wrote:
| It feels to me that there may be even more potential in flipping
| this idea around - human coders write tests to exact
| specifications, then an llm-using coding system evolves code
| until it passes the tests.
| mtndew4brkfst wrote:
| People who are terminally capitalist have been salivating over
| this idea basically since this hype curve first started.
|
| Someone made a solid joke about it as far back as 2016:
| https://www.commitstrip.com/2016/08/25/a-very-comprehensive-...
| willsmith72 wrote:
| Well yeah, TDD. Many companies already work this way, and the
| LLMs are ok at generating the test-passing code.
|
| From my experience though, it's better at writing tests (from
| natural language)
| dwringer wrote:
| My concern with having the LLM's write tests is that it's
| hard to be convinced that they've written _the right_ tests.
| Coupling human TDD with a genetic algorithm of some sort that
| uses LLM 's to generate candidate populations of solutions,
| one could be assured that once a solution gets far enough
| through the tests [assuming one ever does], it is guaranteed
| to have the correct behavior (as far as "correct" has been
| defined in the tests).
| willsmith72 wrote:
| yes, definitely it's a concern.
|
| the idea with llm tests first is tests should be extremely
| easy to read. of course ideally so should production code,
| but it's not always possible. if a test is extremely
| complicated, it could be a code smell or a sign that it
| should be broken up.
|
| this way it's very easy to verify the llm's output (weird
| typos or imports would be caught by intellisense anyway)
| tiffanyh wrote:
| Source article:
|
| https://slack.engineering/balancing-old-tricks-with-new-feat...
| gjvc wrote:
| infoq has gone to pure shit
| muglug wrote:
| For people unfamiliar with Enzyme and RTL, this was the basic
| problem:
|
| Each test made assertions about a rendered DOM from a given React
| component.
|
| Enzyme's API allowed you to query a snippet of rendered DOM using
| a traditional selector e.g. get the text of the DOM node with
| id="foo". RTL's API required you to say something like "get the
| text of the second header element", but prevents you from using
| selectors.
|
| To do the transformation successfully you have to run the tests,
| first to render each snippet, then have _some system_ for taking
| those rendered snippets and the Enzyme code that queries it and
| convert the Enzyme code to roughly-equivalent RTL calls.
|
| That's what the LLM was tasked with here.
| resoluteteeth wrote:
| If that's the entire issue couldn't someone just add support
| for selectors to rtl or something?
| s4i wrote:
| "Just add support for selectors" in a library of which whole
| philosophy is built around "you test the app like the user
| would" (via WAI-ARIA roles [1] and text visible to screen
| readers).
|
| Of course they could've forked the lib but that's definitely
| not a "just" decision to commit to.
|
| [1] https://developer.mozilla.org/en-
| US/docs/Web/Accessibility/A...
| azangru wrote:
| We did this for our codebase (several hundred tests) manually,
| two or three years ago (the problems were already apparent with
| React 17). It helped that we never used Enzyme's shallow
| renderer, because that type of testing was already falling out of
| favor by late 2010s.
|
| The next fronteer is ditching jest and jsdom in favor of testing
| in a real browser. But I am not sure the path for getting there
| is clear yet in the community.
| morgante wrote:
| The Slack engineering blog[0] is more pragmatic, and shows more
| about how the approaches were actually combined.
|
| This is basically our whole business at grit.io and we also take
| a hybrid approach. We've learned a fair amount from building our
| own tooling and delivering thousands of customer migrations.
|
| 1. Pure AI is likely to be inconsistent in surprising ways, and
| it's hard to iterate quickly. Especially on a large codebase, you
| can't interactively re-apply the full transform a bunch.
|
| 2. A significant reason syntactic tools (like jscodeshift) fall
| down is just that most codemod scripts are pretty verbose and
| hard to iterate on. We ended up open sourcing our own codemod
| engine[1] which has its own warts, but the declarative model
| makes handling exceptions cases much faster.
|
| 3. No matter what you do, you need to have an interactive
| feedback loop. We do two levels of iteration/feedback: (a)
| automatically run tests and verify/edit transformations based on
| their output, (b) present candidate files for approval / feedback
| and actually integrate feedback provided back into your
| transformation engine.
|
| [0] https://slack.engineering/balancing-old-tricks-with-new-
| feat...
|
| [1] https://github.com/getgrit/gritql
| ayewo wrote:
| I think you copy-pasted the wrong URL in your first link.
|
| Should be https://slack.engineering/balancing-old-tricks-with-
| new-feat...
| dang wrote:
| We changed the URL to that from
| https://www.infoq.com/news/2024/06/slack-automatic-test-conv...
| . Thanks!
| Aurornis wrote:
| This is from the actual Slack blog post:
|
| > We examined the conversion rates of approximately 2,300
| individual test cases spread out within 338 files. Among these,
| approximately 500 test cases were successfully converted,
| executed, and passed. This highlights how effective AI can be,
| leading to a significant saving of 22% of developer time. It's
| important to note that this 22% time saving represents only the
| documented cases where the test case passed.
|
| So the blog post says they converted 22% of tests, which they
| claim as saving 22% of developer time, which InfoQ interpreted as
| converting 80% of tests automatically?
|
| Am I missing something? Or is this InfoQ article just completely
| misinterpreting the blog post it's supposed to be reporting on?
|
| The topic itself is interesting, but between all of the
| statistics games and editorializing of the already editorialized
| blog post, it feels like I'm doing heavy work just to figure out
| what's going on.
| nl wrote:
| My reading of this is that the examination was a subset of the
| full set they manually examined.
|
| From the source:
|
| > It's so compelling that we at Slack decided to convert more
| than 15,000 of our frontend unit and integration Enzyme tests
| to RTL, as part of the update to React 18.
|
| and
|
| > Our benchmark for quality was set by the standards achieved
| by the frontend developers based on our quality rubric that
| covers imports, rendering methods, JavaScript/TypeScript logic,
| and Jest assertions. We aimed to match their level of quality.
| The evaluation revealed that 80% of the content within these
| files was accurately converted, while the remaining 20%
| required manual intervention.
|
| There is a diagram that mentions 22% of the subset of manually
| inspected files that were 100% converted. But Slack is manually
| checking all converted test cases anyway so they don't seem to
| consider this the success rate.
|
| https://slack.engineering/balancing-old-tricks-with-new-feat...
| dboreham wrote:
| Having a success criteria "the test passed" is a huge red flag.
| So the test can be :
|
| print("Passed")
|
| (or some more subtle variation on that) and we succeeded.
| 29athrowaway wrote:
| Saving 22% of 15,000 tests is 3,300 tests.
|
| While 22% sounds low, saving yourself the effort to rewrite 3,300
| tests is a good achievement.
| willsmith72 wrote:
| hypothetically yes, but not if you also have to manually
| rewrite them to compare results
| skywhopper wrote:
| Pretty misleading summary, given that LLMs played only a tiny
| part in the effort, and probably took more time to integrate than
| it saved in what is otherwise a pretty standard conversion
| pipeline, although I'm sure it's heavily in the Slack engineers'
| interest to go along with the AI story to please the Salesforce
| bosses who have mandated AI must be used in every task. Just
| don't fall for the spin here, and think this will actually save
| you time on a similar effort.
| trescenzi wrote:
| Slightly tangential but one of the largest problems I've had
| working with React Testing Library is a huge number of tests that
| pass when they should fail. This might be because of me and my
| team misusing it but regularly a test will be written, seem like
| it's testing something, and pass but if you flip the condition,
| or break the component it doesn't fail as expected. I'd really
| worry that any mass automated, or honestly manual, method for
| test conversion would result in a large percentage of tests which
| seem to be of value but actually just pass without testing
| anything.
| torginus wrote:
| Just to shamelessly plug one of my old projects, I did something
| like this at a German industrial engineering firm - they wanted
| us to rewrite a huge base of old tests written in TCL into C#.
|
| It was supposed to take 6 months for 12 people.
|
| Using an AST parser I wrote a program in two weeks, that
| converted like half the tests flawlessly, with about another
| third needing minor massaging, and the rest having to be done by
| hand (I could've done better, by handling more corner cases, but
| I kinda gave up once I hit diminishing returns ).
|
| Although it helped a bunch that most tests were brain dead
| simple.
|
| Reaction was mixed - the newly appointed manager was kinda fuming
| that his first project's glory was stolen from him by an Assi,
| and the guys under him missed out on half a year of leisuirely
| work.
|
| I left a month after that, but what I heard is that they decided
| to pretend that my solution didn't exist on the management level,
| and the devs just ended up manually copypasting the output of my
| tool, and did a days planned work in 20 minutes, with the whole
| thing taking 6 months as planned.
___________________________________________________________________
(page generated 2024-06-19 23:01 UTC)