[HN Gopher] AI-powered conversion from Enzyme to React Testing L...
       ___________________________________________________________________
        
       AI-powered conversion from Enzyme to React Testing Library
        
       Author : GavCo
       Score  : 167 points
       Date   : 2024-06-19 09:51 UTC (13 hours ago)
        
 (HTM) web link (slack.engineering)
 (TXT) w3m dump (slack.engineering)
        
       | AmalgatedAmoeba wrote:
       | The conversion is between two testing libraries for React. Not to
       | be too cynical (this sort of works seems to me like a pretty good
       | niche for llms), but I don't think I'd be that far off of 80%
       | with just vim macros...
        
         | dgan wrote:
         | You really think you could achieve 80% success rate with just
         | syntaxic transformations, while the article says they only
         | reached 45% success rate with fine grained ast transformations?
         | 
         | I am no vim hater, but allow me to cast a large, fat doubt on
         | your comment!
        
           | Dr_Birdbrain wrote:
           | You might be underestimating vim ;)
           | 
           | Key point is that vim macros are interactive. You don't just
           | write a script that runs autonomously, you say "ok, for the
           | next transformation do this macro. Oh wait, except for that,
           | in the next 500 lines do this other thing." You write the
           | macro, then the next macro, adjust on the fly.
        
             | pqb wrote:
             | That sounds interesting! Would you mind sharing some links
             | to the articles or videos that focus on this possibility?
        
               | duffmancd wrote:
               | This Vimcast (http://vimcasts.org/episodes/converting-
               | markdown-to-structur...) recording is an example of a
               | quite complex macro for converting (a specific file's)
               | markdown to HTML. At the beginning of the video you see
               | that they save the macro to the "a" register. You can
               | record macros of similar complexity to each of the other
               | letters of the alphabet, to get some idea of the maximum
               | complexity (though I tend to stick to about 3 or less in
               | a single session).
        
             | rlupi wrote:
             | From the article:
             | 
             | > Our initiative began with a monumental task of converting
             | more than 15,000 Enzyme test cases, which translated to
             | more than 10,000 potential engineering hours
             | 
             | That's a lot of editing.
        
             | gray_-_wolf wrote:
             | Out of curiosity, can you drop into edit session during the
             | macro? It is some time since I last used vim, so I do not
             | recall, but in emacs you can record a macro along the lines
             | of "do A, do B, drop to edit session letting user do
             | whatever, do C, do D". Is that possible with vim macros?
        
               | suplix wrote:
               | Just use --- calculating... --- 2 macros.
        
               | slaymaker1907 wrote:
               | I don't think so since you need to leave edit mode to
               | terminate the macro.
        
           | AmalgatedAmoeba wrote:
           | Fair enough :) It was very much an exaggeration. But, I do
           | wonder how far would "dumb" text editing go in this scenario.
           | And, more importantly, whether it wouldn't be faster overall
           | than writing a tool that still requires humans to go through
           | its output and clean/fix it up.
        
         | rottc0dd wrote:
         | I agree, once I had to write a groovy conf out of java library
         | constructors and setters and vim macros were really good for
         | that.
        
         | stefan_ wrote:
         | Not to mention the possible savings if you just don't switch to
         | whatever the latest testing framework is your resume driven
         | developers want. 100% time savings!
        
           | hermanradtke wrote:
           | Enzyme is abandoned and doesn't work on newer versions of
           | React. Many teams are doing this conversion for their React
           | apps.
        
             | stefan_ wrote:
             | Gee, if "many teams" want to spend their time migrating
             | their _unit-test framework_ and _unit tests_ because their
             | _frontend framework_ hit version 18 I suppose that 's their
             | prerogative.
             | 
             | Not me to applaud Teams but it seems Slacks lunch is being
             | eaten by people who are busy building things on the corpse
             | of Skype, not churning through churn incarnate.
        
         | morgante wrote:
         | I think you're significantly underestimating the complexity of
         | automatic transforms. It's not like they didn't try writing
         | codemods first, and vim macros aren't more powerful than
         | codemods.
        
       | __jonas wrote:
       | Seems like a reasonable approach. I wonder if it took less time
       | than it would have taken to build some rule-based codemod script
       | that operates on the AST, but I assume it did.
        
         | AdamN wrote:
         | It also takes less context potentially - allowing a more junior
         | engineer or somebody who doesn't know much about the
         | language/library to implement the change.
        
         | morgante wrote:
         | If you read the source article[0], they tried a rule-based
         | approach first and the complexity exploded.
         | 
         | [0] https://slack.engineering/balancing-old-tricks-with-new-
         | feat...
        
           | ec109685 wrote:
           | The rules based made the job of the LLM easier, so it was a
           | worthwhile part of project.
        
       | semanser wrote:
       | I'm working on a similar project (DepsHub) where LLMs are used to
       | make major library updates as smooth as possible. While it
       | doesn't work in 100% cases, it really helps to minimize all the
       | noise while keeping your project up to date. I'm not surprised
       | Slack decided to go this way as well.
        
       | larodi wrote:
       | Another proof this probabilistic stochastic approach works on the
       | prediction/token level, but not on the semantic level, where it
       | needs a discreet system. This essentially reminds of RAG setup
       | and is similar in its nature.
       | 
       | Perhaps reiterating my previous sentiment that such application
       | of LLMs together with discreet structures brings/hides much more
       | value than chatbots who will be soon considered mere console UI.
        
         | eru wrote:
         | You probably mean discrete, not discreet?
        
           | robertlagrant wrote:
           | Sssh! Keep it under your hat.
        
       | anymouse123456 wrote:
       | Misleading title. Maybe try this one?
       | 
       | "Slack uses ASTs to convert test code from Enzyme to React with
       | 22% success rate"
       | 
       | This article is a poor summary of the actual article, which is at
       | least linked to Slack's engineering blog [0].
       | 
       | [0] https://slack.engineering/balancing-old-tricks-with-new-
       | feat...
       | 
       | [updated]
        
         | doubloon wrote:
         | Reminds me of Hitchhikers Guide ... they had to figure out the
         | right question to ask.
        
           | dvh wrote:
           | Is it true that they give out free BMWs in Moscow? Yes, it is
           | true! But it's not Moscow but St. Petersburg. And it's not
           | BMWs but Ladas. And they don't give them out, they steal
           | them.
        
             | Joker_vD wrote:
             | So all in all, still a mostly accurate news reporting, as
             | they go.
        
             | dloranc wrote:
             | This is a variation of famous jokes about Radio Yerevan,
             | very popular in former soviet states.
             | 
             | I live in Poland and I know this version:
             | 
             | Is it true that they give away cars on Red Square?
             | 
             | Radio Yerevan answers: not cars, only bicycles, not on Red
             | Square, but near the Warsaw station, and they don't give
             | them away, they steal them.
        
         | zer00eyz wrote:
         | The slack blog is for engineers. It's PR to hire in talent.
         | 
         | The INFOQ article is for your C types. It's the spin on what
         | buzz words they should be using with their peers and to make a
         | splash in the market.
         | 
         | NFT, Crypto, Cloud, Microservice, SAS, Podcast (the first time
         | when there wasnt video), Web 2.0, The Ad (double-click, pre
         | google) market, The Dot Com Bubble...
         | 
         | Im sure I missed a few hype cycles in there.
         | 
         | Both articles show how deep we are in the stink of this one.
         | 
         | Im tired of swimming the bullshit, It keeps getting in my
         | mouth.
        
           | ben_jones wrote:
           | Thus the conclusion should be heavily scrutinize future infoq
           | .com articles and perhaps future articles by the same author
           | Eran Stiller.
           | 
           | We shouldn't detach responsibility from the publisher and
           | author.
        
             | nl wrote:
             | Except the article turns out to be accurate (see
             | https://news.ycombinator.com/item?id=40728179)
             | 
             | So I guess the publisher and author should get credit. I'll
             | leave others to discuss the misleading comment...
        
               | phillipcarter wrote:
               | Hah. A lot of tech folks like to trash journalists (it's
               | fine, I get it, there's legitimate reasons) ... but then
               | misread source content that the journalist interpreted
               | better/correctly.
        
         | nl wrote:
         | _Actually_ the Infoq article is more correct than this comment!
         | 
         | The comment: ""Slack uses ASTs to convert test code from Enzyme
         | to React with 22% success rate""
         | 
         | To quote[1], this 22% comes from this part:
         | 
         | > We examined the conversion rates of approximately 2,300
         | individual test cases spread out within 338 files. Among these,
         | approximately 500 test cases were successfully converted,
         | executed, and passed. This highlights how effective AI can be,
         | leading to a significant saving of 22% of developer time. It's
         | important to note that this 22% time saving represents only the
         | documented cases where the test case passed.
         | 
         | So that 22% rate is 22% saving of developer time, measured on a
         | sample. No reasonable reading of that makes it a "22% success
         | rate".
         | 
         | Over the whole set of tests:
         | 
         | > This strategic pivot, and the integration of both AST and AI
         | technologies, helped us achieve the remarkable 80% conversion
         | success rate, based on selected files, demonstrating the
         | complementary nature of these approaches and their combined
         | efficacy in addressing the challenges we faced.
         | 
         | and
         | 
         | > Our benchmark for quality was set by the standards achieved
         | by the frontend developers based on our quality rubric that
         | covers imports, rendering methods, JavaScript/TypeScript logic,
         | and Jest assertions. We aimed to match their level of quality.
         | The evaluation revealed that 80% of the content within these
         | files was accurately converted, while the remaining 20%
         | required manual intervention.
         | 
         | (So I guess the "80% conversion success rate" is this
         | percentage of files?)
         | 
         | The Infoq title "Slack Combines ASTs with Large Language Models
         | to Automatically Convert 80% of 15,000 Unit Tests" certainly
         | more accurately reflects the underlying article than this
         | comment.
         | 
         | Edit: they do have a diagram that talks about 22% of the subset
         | of manually inspected files being 100% complete. This doesn't
         | appear to be what Slack considers their success rate because
         | they manually inspect files anyway.
         | 
         | [1] https://slack.engineering/balancing-old-tricks-with-new-
         | feat...
        
           | moefh wrote:
           | > No reasonable reading of that makes it a "22% success
           | rate".
           | 
           | Well, 500/2300 is 22%, so calling it 22% seems pretty
           | reasonable.
           | 
           | From what I get from the rest, the 78% remaining tests (the
           | ones that failed to convert) were "80% accurately converted",
           | I guess they had some metric for measuring that.
           | 
           | So it looks like it depends on how you interpret
           | "automatically converted 80%". If it's taken to mean "80%
           | could be used without manual intervention", then it's clearly
           | false. If you take it to mean "it required manual
           | intervention on just 20% of the contents to be usable", then
           | it's reasonable.
        
         | dang wrote:
         | We changed the URL to that from
         | https://www.infoq.com/news/2024/06/slack-automatic-test-
         | conv.... Thanks!
        
       | anymouse123456 wrote:
       | The actual efficiency claim (which is also likely incorrect) is
       | inverted from the original article, "We examined the conversion
       | rates of approximately 2,300 individual test cases spread out
       | within 338 files. Among these, approximately 500 test cases were
       | successfully converted, executed, and passed. This highlights how
       | effective AI can be, leading to a significant saving of 22% of
       | developer time."
       | 
       | Reading that leads me to believe that 22% of the conversions
       | succeeded and someone at Slack is making up numbers about
       | developer time.
        
         | taylorfinley wrote:
         | > approximately 500 test cases were successfully converted,
         | executed, and passed.
         | 
         | How many of these passing tests are still actually testing
         | anything? Do they test the tests?
        
         | BigJono wrote:
         | > 500 test cases were successfully converted, executed, and
         | passed.
         | 
         | Wonder what "successfully converted" means? A converted test
         | executing and passing doesn't tell you whether it's still
         | testing the same thing as before.
        
           | Phemist wrote:
           | That suggests a test suite to test the test suite, which
           | again would suggest another test suite to test the test suite
           | testing test suite...
           | 
           | In the end, it is test suites all the way down.
        
             | eru wrote:
             | You could evaluate the quality of your test suite with
             | something other than a test suite.
        
               | netdevnet wrote:
               | such as?
        
               | cdogl wrote:
               | A developer who knows the code and will own the
               | consequences can review and merge it - easy. Just not
               | sure why the LLM needed to get involved in the first
               | place.
        
               | michaelt wrote:
               | Presumably because they've got 500 test cases?
        
               | eru wrote:
               | Well, you need someone to write the code, and someone
               | (else) to review the code.
               | 
               | In this situation you would replace the author with the
               | LLM, but leave the reviewer as human.
               | 
               | It's not as pointless as you make it out to be. You still
               | save one human.
        
               | ZeroGravitas wrote:
               | You can run mutation tests that intentionally seed errors
               | into the codebase (flipping Boolean and to or, for
               | example) and rerun the tests.
               | 
               | A good test suite will catch these errors. Something that
               | is effectively a noop to get to green will not.
        
             | lvncelot wrote:
             | There actually is a way to test your test suite: mutation
             | tests [1]. Basically, you change the covered codebase
             | (invert an if statement, change a variable, etc.) and
             | expect the tests to then fail. If the tests actually
             | survive the mutations, they might not be good enough.
             | 
             | [1] https://en.wikipedia.org/wiki/Mutation_testing
        
         | Scea91 wrote:
         | Also given that there are huge differences in complexity
         | between tests, how do we know that the successful 22% are not
         | just trivial one-liner tests?
         | 
         | Thinking about the test suite in my current project there is a
         | clear Pareto distribution with majority of tests being simple
         | or almost trivial.
        
         | desdenova wrote:
         | Someone will still have to review and validate all the tests,
         | which may take more time than rewriting the code.
        
           | robertlagrant wrote:
           | Yep. Honestly, I don't know when I last read an InfoQ article
           | all the way through. Too painful.
        
           | madeofpalk wrote:
           | If you're reviewing the tests (and not just the results)
           | after conversion by automation, wouldn't someone else also
           | review the tests converted by a person rewriting them
           | manually?
        
             | imtringued wrote:
             | This is something people seemingly don't grasp about LLMs.
             | If you want "Human alignment", then you will need humans in
             | the loop either way.
        
           | jacobr1 wrote:
           | For a test that is already passing, the validation is already
           | there. The human is reviewing the PR, but reviewing a diff is
           | much less time intensive when you can presume it basically
           | works, and you are just assessing if the logic make makes
           | sense relative to the prior code.
        
         | AmalgatedAmoeba wrote:
         | Yes, the 80% claim comes from comparing 9 tests converted by
         | both the conversion tool and humans and comparing the quality -
         | "80% of the content within these files was accurately
         | converted, while the remaining 20% required manual
         | intervention." Not sure what to make of it since they claim
         | only 16% of files get fully converted.
        
           | lucianbr wrote:
           | Was it the 20% of the code that requires 80% of the time to
           | write?
        
           | mminer237 wrote:
           | I guess they believe the files that didn't get fully
           | converted got like 76% converted on average?
        
       | jmull wrote:
       | > saving considerable developer time of at least 22% of 10,000
       | hours
       | 
       | I wonder how much time or money it would take to just update
       | Enzyme to support react 18? (fork, or, god forbid, by supporting
       | development of the actual project).
       | 
       | Nah, let's play with LLMs instead, and retask all the frontend
       | teams in the company to rewriting unit tests to a new framework
       | we won't support either.
       | 
       | I guess when you're swimming in pools of money there's no need to
       | do reasonable things.
        
         | __jonas wrote:
         | When I read 'React testing library' I thought they had added an
         | official testing library to the React project which would have
         | been fantastic and a worthwhile migration target for sure, sad
         | that it's just another third party one which also might one day
         | stop supporting newer React versions
        
         | zer00eyz wrote:
         | >> by supporting development of the actual project
         | 
         | You mean good engineering.
         | 
         | >> play with LLMs instead
         | 
         | You mean good on the resume.
         | 
         | >> I guess when you're swimming in pools of money there's no
         | need to do reasonable things.
         | 
         | We dont. Reasonable sailed when FB/Google/AMZN started "giving
         | back" to the community... They leaked out all their greatest
         | examples of Conways Law and we lapped it up like water at a
         | desert oasis.
         | 
         | The thing is, these technologies have massive downsides if you
         | aren't Google/FB/Amazon... But we're out here busy singing
         | their praises and making sure jr dev's are all pre trained out
         | of college for the FA(not this one)ANG lifestyle.
         | 
         | Think about how much react being public saves Facebook on
         | onboarding a new dev.
        
         | bastawhiz wrote:
         | Enzyme is kind of dead, so it would mean picking up sponsorship
         | and maintainership (indefinitely) rather than a one-off project
         | to convert to the official testing library for the ecosystem.
        
           | Hasu wrote:
           | react-testing-library isn't the "official testing library"
           | for React, it isn't made by the React team, and testing
           | library provides testing libraries for other frameworks.
           | 
           | It's just a change from an outdated, unmaintained testing
           | library to a more 'modern', well-maintained library. There
           | are also some philosophical differences in the testing
           | approach.
        
             | bastawhiz wrote:
             | Monster Energy is the official energy drink of NASCAR but
             | that doesn't mean NASCAR manufactures energy drinks. As
             | best as I can tell, RTL is the only testing framework
             | mentioned in the React docs, so that's pretty "official"
        
           | jmull wrote:
           | Well, it might get a lot less dead with a small fraction of
           | the resources spent on this project.
           | 
           | > indefinitely
           | 
           | You'll note they've switch to another open source framework
           | which has the same potential to fail without
           | support/resources. They've kicked the can down the road, but
           | are now accruing the technical debt that lead to this effort
           | exactly the same as before. Since that technical debt will
           | inevitably turn into a real expenditure of resources, they
           | are stuck with expenses indefinitely, however they do it.
           | Though I think it's pretty obvious that one way is a lot
           | cheaper and less disruptive to the business than the other.
           | 
           | (BTW, if they were concerned with indefinite expenses, you
           | might also wonder why they want to continue to build their
           | product on the shifting sands that are react, or pursue a
           | unit testing strategy that is so tightly coupled to specific
           | versions of their UI framework. These are "fuck-the-cost"
           | decisions, both short term and long term.)
        
             | bastawhiz wrote:
             | In fact, enzyme didn't support the previous version of
             | React either, except for the grace of some random guy who
             | wrote a driver to make it work. Airbnb, who built and
             | maintained enzyme, abandoned it. There's (afaik) no way to
             | add React 18 support without major changes to the enzyme
             | core. So not only is this a problem that will plague them
             | indefinitely (that is, dealing with their test framework
             | not supporting a recent version) if they don't switch, it's
             | adopting a project that they don't own and didn't start to
             | avoid a one time cost of rewriting some tests.
             | 
             | > Since that technical debt will inevitably turn into a
             | real expenditure of resources, they are stuck with expenses
             | indefinitely, however they do it.
             | 
             | I simply can't see how becoming the maintainer of a testing
             | framework to rewrite it to add support for the last two
             | versions of the library it no longer works with is a
             | comparable investment to the ongoing cost of maintaining
             | your own unit tests. That's like if Docker became abandoned
             | and didn't update to support modern kernels so you decided
             | it's better to become the maintainer of Docker instead of
             | switching to Podman.
        
               | jmull wrote:
               | It's a unit test framework though, not a suite of
               | containerization software and services. Maintained mostly
               | by one person for years.
        
         | SloopJon wrote:
         | Slack's blog post links to "Enzyme is dead. Now what?"[1], in
         | which Wojciech Maj said, "A couple of tireless evenings later,
         | @wojtekmaj/enzyme-adapter-react-17 was born." Now, he was
         | building on someone else's pull request, and he also said that
         | adapting to React 18 would have required a huge rework. Still,
         | I'm thinking that @slack/enzyme-adapter-react-18 just might
         | have taken less than 10,000 hours.
         | 
         | Then again, the idea of a testing framework that is so tightly
         | coupled that it breaks every version is foreign to me, so I
         | probably don't know what I'm talking about.
         | 
         | [1] https://dev.to/wojtekmaj/enzyme-is-dead-now-what-ekl
        
         | jpnelson wrote:
         | This community built React 18 adapter actually works pretty
         | well in my experience. Some failures, but worked for multiple
         | thousands of test files for my use case.
         | https://www.npmjs.com/package/@cfaester/enzyme-adapter-react...
         | 
         | That said, making the 19 adapter is a whole new task, and I
         | think these tests should be converted to RTL eventually, so the
         | approach described in the blog post is still valuable.
        
       | jmartin2683 wrote:
       | Sounds like a nightmare to be involved with anything that is
       | written in react and requires 15,000 unit tests.
        
         | gjvc wrote:
         | web guis are the worst event/interaction model
        
       | denys_potapov wrote:
       | It's a 2024 webdev summary, nothing can be added:
       | 
       | New React version made the lib obsolete, we used LLM to fix it
       | (1/5 success rate)
        
         | azangru wrote:
         | A lib was heavily relying on React internals for testing,
         | rather than just on components' public api. That this approach
         | was going to be unsustainable was already obvious around 2020.
         | The question is, after you've invested a lot of work in a bad
         | practice, how to move to a better practice with the least
         | amount of pain. Another, more philosophical, question is how
         | does a bad practice gain so much traction in the developer
         | community.
        
       | viralpraxis wrote:
       | Can someone elaborate if the term "AST" is used correctly in the
       | article?
       | 
       | I've been playing with mutation-injection framework for my
       | master's thesis for some time. I had to use LibCST to preserve
       | syntax information which is usually lost during AST
       | serialization/deserialization (like whitespaces, indentation and
       | so on). I thought that the difference between abstract and
       | concrete trees is that it's guaranteed CST won't lose any
       | information, so it can be used to specific tasks where ASTs are
       | useless. So, did they actually use CST-based approach?
        
         | arnorhs wrote:
         | Usually, the ast can be converted to code, then formatted using
         | a specific formatter.
         | 
         | I'm sure slack has a particular code formatter they use.
         | 
         | Most of the time when working with an AST you don't think about
         | whitespace except when writing out the result
        
         | Karliss wrote:
         | In real world things don't fit neatly in the boxes established
         | by textbook definitions. The borders between various concepts
         | are fuzzy, and in production implementations for practical
         | purposes like performance, code simplicity, better error
         | reporting and exact usecase different stages of parsing and the
         | parsed representations can be expanded, narrowed or skipped.
         | 
         | A lot of time you will have a syntax tree. It might have
         | preserved some of the concrete syntax details like
         | subexpression ranges for error reporting and IDE functionality
         | or somewhat optional nodes, but at the same time it might also
         | contain information obtained form semantic analysis (which from
         | computer science perspective isn't even a concern of parser),
         | it might not even be a tree. And all of that potentially
         | produced in single pass. Is it a concrete syntax tree, is it
         | abstract syntax tree, is it a thing after AST? Just because a
         | data structure didn't throw away all concrete syntax details,
         | doesn't mean it contains all of them. From my experience in
         | such situations it's more likely to be called AST, as it's
         | closer to that than concrete syntax tree.
         | 
         | It also depends how you define the language that you are
         | interested in (even if it's the same language). The purpose of
         | parsing isn't necessarily always code compilation. Formal
         | grammars and parsers can be used for all kind of things like
         | defining a file format, text markup, and also tasks like
         | manipulation of source code and checking it for errors. Typical
         | example of details not included in AST is parentheses. But that
         | doesn't mean AST can never contain node for parentheses. If
         | they have a meaning for task you are trying to achieve nothing
         | prevents from assigning a node within AST tree. For example
         | both Clang and GCC in some situations will give a warning
         | depending on presence of parentheses even though they are
         | meaningless based on C++ syntax. If you define comments as part
         | of the language then they can be kept and manipulated within
         | AST.
         | 
         | CST doesn't really guarantee that you won't lose any
         | information. The parsers don't operate on bytes they operate on
         | abstract symbols, which might directly correspond to bytes in
         | the file but not always. Again in real world systems what you
         | actually have is 2-3 languages stacked on top of each other.
         | Which of the CST are you talking about? Preserving CST for one
         | stage doesn't mean no information was lost in previous steps.
         | C++ standard defines ~6 steps before the main C++ language
         | parsing , many of which can be considered a separate formal
         | languages with their own CST/AST.
         | 
         | 1) Text decoding bytes -> text. While most text encodings are
         | trivial byte->character substitution, variable length encodings
         | like UTF8 can be described as (simple) context free grammars. I
         | don't think any programing language toolchain does Unicode
         | normalization at this stage, but in theory you could have a
         | programming language which does that.
         | 
         | 1.5) Trigraph substitution
         | 
         | 2) text -> preprocessing tokens
         | 
         | 3) preprocessing tokens -> preprocessing AST
         | 
         | 4) as a result of executing preprocessing directive execution
         | you obtain new sequence of tokens
         | 
         | 4.5) string literal merging
         | 
         | 5) main parsing
         | 
         | In practices some of these steps might be merged and not
         | executed as separate stages, there are also a few more
         | transformations I didn't mention.
         | 
         | Stuff like this makes real world source to source
         | transformations messy. As the later grammars are operating on
         | symbols which only exist in intermediate steps and don't always
         | have simple 1:1 to mapping to the input file.
         | 
         | And in some cases you might have some custom algorithm doing a
         | transformation which doesn't fit the model of context free
         | grammars at all, thus whatever it did isn't part of any formal
         | CST for the language (language in terms of formal grammars, not
         | a programming language). Python is a good example of this. The
         | indentation based scopes can't be handled by context free
         | grammars, it relies on magic tokenizer which generate "indent",
         | "dedent" tokens, so if you follow formal definitions CST of
         | main python language doesn't contain exact information about
         | original indentation. The fact that you can get it from LibCST
         | is stretching definition of CST/changing the language it is
         | parsing. At that point once you add all the extra information
         | are you really building a CST or are you making an AST for a
         | language where every character is significant, because you
         | redefined which parts of program are important.
         | 
         | With all that said I wouldn't be surprised that the thing slack
         | did was using something closer to AST (with some additional
         | syntax details preserved) than CST (with additional analysis
         | and done). If you are not building a general purpose tool for
         | making small adjustments to arbitrary existing codebase
         | (otherwise preserving original code), it's not necessary to
         | preserve every tiny syntax detail as long as comments are
         | preserved. I would expect them to be using a standardized code
         | formatter anyway so loss of insignificant whitespace shouldn't
         | be a major concern, and the diff will likely touch almost
         | everyone line of code.
         | 
         | Whether "AST" or a "CST" is useless for specific task is in
         | many situations less about "AST" vs "CST" but more about design
         | choices of specific programming language, parser implementation
         | and pushing things beyond the borders of strict formal
         | definitions.
        
       | dwringer wrote:
       | It feels to me that there may be even more potential in flipping
       | this idea around - human coders write tests to exact
       | specifications, then an llm-using coding system evolves code
       | until it passes the tests.
        
         | mtndew4brkfst wrote:
         | People who are terminally capitalist have been salivating over
         | this idea basically since this hype curve first started.
         | 
         | Someone made a solid joke about it as far back as 2016:
         | https://www.commitstrip.com/2016/08/25/a-very-comprehensive-...
        
         | willsmith72 wrote:
         | Well yeah, TDD. Many companies already work this way, and the
         | LLMs are ok at generating the test-passing code.
         | 
         | From my experience though, it's better at writing tests (from
         | natural language)
        
           | dwringer wrote:
           | My concern with having the LLM's write tests is that it's
           | hard to be convinced that they've written _the right_ tests.
           | Coupling human TDD with a genetic algorithm of some sort that
           | uses LLM 's to generate candidate populations of solutions,
           | one could be assured that once a solution gets far enough
           | through the tests [assuming one ever does], it is guaranteed
           | to have the correct behavior (as far as "correct" has been
           | defined in the tests).
        
             | willsmith72 wrote:
             | yes, definitely it's a concern.
             | 
             | the idea with llm tests first is tests should be extremely
             | easy to read. of course ideally so should production code,
             | but it's not always possible. if a test is extremely
             | complicated, it could be a code smell or a sign that it
             | should be broken up.
             | 
             | this way it's very easy to verify the llm's output (weird
             | typos or imports would be caught by intellisense anyway)
        
       | tiffanyh wrote:
       | Source article:
       | 
       | https://slack.engineering/balancing-old-tricks-with-new-feat...
        
       | gjvc wrote:
       | infoq has gone to pure shit
        
       | muglug wrote:
       | For people unfamiliar with Enzyme and RTL, this was the basic
       | problem:
       | 
       | Each test made assertions about a rendered DOM from a given React
       | component.
       | 
       | Enzyme's API allowed you to query a snippet of rendered DOM using
       | a traditional selector e.g. get the text of the DOM node with
       | id="foo". RTL's API required you to say something like "get the
       | text of the second header element", but prevents you from using
       | selectors.
       | 
       | To do the transformation successfully you have to run the tests,
       | first to render each snippet, then have _some system_ for taking
       | those rendered snippets and the Enzyme code that queries it and
       | convert the Enzyme code to roughly-equivalent RTL calls.
       | 
       | That's what the LLM was tasked with here.
        
         | resoluteteeth wrote:
         | If that's the entire issue couldn't someone just add support
         | for selectors to rtl or something?
        
           | s4i wrote:
           | "Just add support for selectors" in a library of which whole
           | philosophy is built around "you test the app like the user
           | would" (via WAI-ARIA roles [1] and text visible to screen
           | readers).
           | 
           | Of course they could've forked the lib but that's definitely
           | not a "just" decision to commit to.
           | 
           | [1] https://developer.mozilla.org/en-
           | US/docs/Web/Accessibility/A...
        
       | azangru wrote:
       | We did this for our codebase (several hundred tests) manually,
       | two or three years ago (the problems were already apparent with
       | React 17). It helped that we never used Enzyme's shallow
       | renderer, because that type of testing was already falling out of
       | favor by late 2010s.
       | 
       | The next fronteer is ditching jest and jsdom in favor of testing
       | in a real browser. But I am not sure the path for getting there
       | is clear yet in the community.
        
       | morgante wrote:
       | The Slack engineering blog[0] is more pragmatic, and shows more
       | about how the approaches were actually combined.
       | 
       | This is basically our whole business at grit.io and we also take
       | a hybrid approach. We've learned a fair amount from building our
       | own tooling and delivering thousands of customer migrations.
       | 
       | 1. Pure AI is likely to be inconsistent in surprising ways, and
       | it's hard to iterate quickly. Especially on a large codebase, you
       | can't interactively re-apply the full transform a bunch.
       | 
       | 2. A significant reason syntactic tools (like jscodeshift) fall
       | down is just that most codemod scripts are pretty verbose and
       | hard to iterate on. We ended up open sourcing our own codemod
       | engine[1] which has its own warts, but the declarative model
       | makes handling exceptions cases much faster.
       | 
       | 3. No matter what you do, you need to have an interactive
       | feedback loop. We do two levels of iteration/feedback: (a)
       | automatically run tests and verify/edit transformations based on
       | their output, (b) present candidate files for approval / feedback
       | and actually integrate feedback provided back into your
       | transformation engine.
       | 
       | [0] https://slack.engineering/balancing-old-tricks-with-new-
       | feat...
       | 
       | [1] https://github.com/getgrit/gritql
        
         | ayewo wrote:
         | I think you copy-pasted the wrong URL in your first link.
         | 
         | Should be https://slack.engineering/balancing-old-tricks-with-
         | new-feat...
        
         | dang wrote:
         | We changed the URL to that from
         | https://www.infoq.com/news/2024/06/slack-automatic-test-conv...
         | . Thanks!
        
       | Aurornis wrote:
       | This is from the actual Slack blog post:
       | 
       | > We examined the conversion rates of approximately 2,300
       | individual test cases spread out within 338 files. Among these,
       | approximately 500 test cases were successfully converted,
       | executed, and passed. This highlights how effective AI can be,
       | leading to a significant saving of 22% of developer time. It's
       | important to note that this 22% time saving represents only the
       | documented cases where the test case passed.
       | 
       | So the blog post says they converted 22% of tests, which they
       | claim as saving 22% of developer time, which InfoQ interpreted as
       | converting 80% of tests automatically?
       | 
       | Am I missing something? Or is this InfoQ article just completely
       | misinterpreting the blog post it's supposed to be reporting on?
       | 
       | The topic itself is interesting, but between all of the
       | statistics games and editorializing of the already editorialized
       | blog post, it feels like I'm doing heavy work just to figure out
       | what's going on.
        
         | nl wrote:
         | My reading of this is that the examination was a subset of the
         | full set they manually examined.
         | 
         | From the source:
         | 
         | > It's so compelling that we at Slack decided to convert more
         | than 15,000 of our frontend unit and integration Enzyme tests
         | to RTL, as part of the update to React 18.
         | 
         | and
         | 
         | > Our benchmark for quality was set by the standards achieved
         | by the frontend developers based on our quality rubric that
         | covers imports, rendering methods, JavaScript/TypeScript logic,
         | and Jest assertions. We aimed to match their level of quality.
         | The evaluation revealed that 80% of the content within these
         | files was accurately converted, while the remaining 20%
         | required manual intervention.
         | 
         | There is a diagram that mentions 22% of the subset of manually
         | inspected files that were 100% converted. But Slack is manually
         | checking all converted test cases anyway so they don't seem to
         | consider this the success rate.
         | 
         | https://slack.engineering/balancing-old-tricks-with-new-feat...
        
         | dboreham wrote:
         | Having a success criteria "the test passed" is a huge red flag.
         | So the test can be :
         | 
         | print("Passed")
         | 
         | (or some more subtle variation on that) and we succeeded.
        
       | 29athrowaway wrote:
       | Saving 22% of 15,000 tests is 3,300 tests.
       | 
       | While 22% sounds low, saving yourself the effort to rewrite 3,300
       | tests is a good achievement.
        
         | willsmith72 wrote:
         | hypothetically yes, but not if you also have to manually
         | rewrite them to compare results
        
       | skywhopper wrote:
       | Pretty misleading summary, given that LLMs played only a tiny
       | part in the effort, and probably took more time to integrate than
       | it saved in what is otherwise a pretty standard conversion
       | pipeline, although I'm sure it's heavily in the Slack engineers'
       | interest to go along with the AI story to please the Salesforce
       | bosses who have mandated AI must be used in every task. Just
       | don't fall for the spin here, and think this will actually save
       | you time on a similar effort.
        
       | trescenzi wrote:
       | Slightly tangential but one of the largest problems I've had
       | working with React Testing Library is a huge number of tests that
       | pass when they should fail. This might be because of me and my
       | team misusing it but regularly a test will be written, seem like
       | it's testing something, and pass but if you flip the condition,
       | or break the component it doesn't fail as expected. I'd really
       | worry that any mass automated, or honestly manual, method for
       | test conversion would result in a large percentage of tests which
       | seem to be of value but actually just pass without testing
       | anything.
        
       | torginus wrote:
       | Just to shamelessly plug one of my old projects, I did something
       | like this at a German industrial engineering firm - they wanted
       | us to rewrite a huge base of old tests written in TCL into C#.
       | 
       | It was supposed to take 6 months for 12 people.
       | 
       | Using an AST parser I wrote a program in two weeks, that
       | converted like half the tests flawlessly, with about another
       | third needing minor massaging, and the rest having to be done by
       | hand (I could've done better, by handling more corner cases, but
       | I kinda gave up once I hit diminishing returns ).
       | 
       | Although it helped a bunch that most tests were brain dead
       | simple.
       | 
       | Reaction was mixed - the newly appointed manager was kinda fuming
       | that his first project's glory was stolen from him by an Assi,
       | and the guys under him missed out on half a year of leisuirely
       | work.
       | 
       | I left a month after that, but what I heard is that they decided
       | to pretend that my solution didn't exist on the management level,
       | and the devs just ended up manually copypasting the output of my
       | tool, and did a days planned work in 20 minutes, with the whole
       | thing taking 6 months as planned.
        
       ___________________________________________________________________
       (page generated 2024-06-19 23:01 UTC)