[HN Gopher] Doctests in R
       ___________________________________________________________________
        
       Doctests in R
        
       Author : dash2
       Score  : 37 points
       Date   : 2022-11-26 22:57 UTC (1 days ago)
        
 (HTM) web link (hughjonesd.github.io)
 (TXT) w3m dump (hughjonesd.github.io)
        
       | tialaramex wrote:
       | I don't think "doctests built in" means the same thing when
       | Python has a module you must choose to import and add to your
       | test infrastructure, versus Rust just tests your documented
       | examples without any special steps - unless you tell it not to
       | (either that a specific example is not suitable for testing, or
       | just categorically don't run doctests).
       | 
       | This idea (doctests) is one of those crucial "You can, you
       | should, but you probably don't" things where the point is to make
       | doing The Right Thing(tm) so easy that you actually do it rather
       | than just nodding when people say you _should_ do it. You clearly
       | should check that your documented APIs match reality, that
       | examples work, but in many languages that 's not easy enough to
       | do out of the box, so either a project invests time and effort to
       | do this or they go without and most projects will go without.
       | 
       | Over a year ago I wrote some MS Teams integration code in C#.
       | Microsoft publishes documentation with examples for how to use
       | the APIs for this. The documentation is wrong. There are open
       | bugs on long abandoned (as is usual for Microsoft) repositories,
       | nobody cares. But chances are back when they _shipped_ the
       | documentation if that had failed, somebody would have fixed it,
       | maybe it 'd take a few minutes or even an hour, but it'd save a
       | nasty experience for lets say conservatively, thousands of
       | developers. Instead it's just a much upvoted Stack Overflow
       | answer with a workaround.
        
         | masklinn wrote:
         | > I don't think "doctests built in" means the same thing when
         | Python has a module you must choose to import and add to your
         | test infrastructure
         | 
         | If you're using pytest, then it's just a flag away (well two if
         | you're using both docstring-doctests and document-doctests, but
         | the latter seems unlikely).
         | 
         | But that it's opt-in makes sense either way, as doctest support
         | was added later on, whereas in Rust it's been there all along.
        
       | vharuck wrote:
       | Not a fan.
       | 
       | - Example sections will be cluttered with unit tests.
       | 
       | - Doc tests asserting warnings or errors will produce examples of
       | bad code. This might make sense for the example `safe_mean`
       | function, where its only purpose is wigging out for improper
       | input. But most functions should just show how to use them.
       | 
       | - Test scripts are still useful for setting up loops, creating
       | helper functions, or other stuff. But then test code will be
       | split between the roxygen comments and those test scripts.
       | 
       | I use doc tests in python scripts, because they're quick sanity
       | checks that fit in the same file. I don't use them in packages.
       | If R had doc tests, I'd rather use them in single-file scripts.
       | Maybe a function that acts like `source` but also generates and
       | inserts the tests.
        
         | masklinn wrote:
         | > - Example sections will be cluttered with unit tests.
         | 
         | The fundamental purpose of doctests is not to write unit tests,
         | but to ensure your examples are valid. It's easy to write
         | examples which don't work in docstrings.
         | 
         | Running a doctest system on your documentation doesn't preclude
         | having actual tests, quite the opposite. Edge cases or
         | complicated scenarios often don't make for great examples, but
         | are usually valuable tests.
         | 
         | For instance in Rust most methods of Vec have an example, which
         | is doctested, and yet Vec still has an extensive suite of unit
         | tests: https://github.com/rust-
         | lang/rust/blob/master/library/alloc/...
         | 
         | Technically you could use doctests as a literate-ish test
         | framework (assuming that's even supported, which it may not
         | be), but the oddball environment tends to make that not great,
         | and the "literate" part is not very useful when unit testing.
         | It's way more valuable to ensure docstrings and standalone
         | documentation are valid.
        
           | apwheele wrote:
           | Yeah, and the R ecosystem has this built in (checking the
           | examples run, not their output is correct). `R CMD check` has
           | as one of its checks whether the examples you build in the
           | help docs generate runtime errors.
           | 
           | I use roxygen example help file generation as well for R
           | packages, but have mixed feelings relative to python
           | documentation.
        
         | crispyambulance wrote:
         | I am inclined to agree. Unit tests and documentation are two
         | SEPARATE things with different intentions. IMHO, mixing these
         | together harms both.
         | 
         | Unit tests are primarily intended for the developers of the
         | library. They do help users, sometimes, when you're trying to
         | work out some fundamental misconception about what the library
         | DOES, but generally speaking the granularity of unit tests is
         | too fine unless you're REALLY digging in.
         | 
         | Much better, I think, to spend effort on writing clear
         | documentation. R has a problem with that. Docs typically have
         | overwhelming terse detail followed by anemic examples. Couple
         | this with that fact that R users tend to always be in the
         | middle of something urgent and completely unrelated to writing
         | libraries (like, analyzing data and making decisions based on
         | that data) and you get a recipe for frustration.
         | 
         | I find myself referring over and over again to the tidyverse
         | "cheatsheets" [https://posit.co/resources/cheatsheets/]. These
         | show, explicitly and clearly, what the things actually do. I
         | wish someone put that kind of, often graphical, content into
         | the docs for all functions.
        
           | StarlaAtNight wrote:
           | I SERIOUSLY relate with you on the "anemic examples" part.
           | One of my biggest frustrations with R
           | 
           | And granted, this is a complaint for base R and other older
           | spaces in the R world...the tidyverse packages (and modern
           | ones inspired by it) tend to have pretty great examples with
           | lots of iron
        
       | cardosof wrote:
       | Not really related to the post but since R is so rare here in
       | Hacker News, I will ask anyway: is R still worth using in
       | 2022-23? Even RStudio gave up it's R brand to focus on Python.
        
         | nerdponx wrote:
         | Use it if you find it useful. It still has a much better and
         | more vibrant ecosystem for statistics, including Bayesian
         | statistics and certain kinds of time series analysis.
         | Data.table is also a serious "power tool", although other non-
         | Pandas data frame libraries like Polars might be dethroning it.
         | Also GGPlot is still awesome, even if you can now get it in
         | Python with Plotnine.
        
         | BrandonS113 wrote:
         | R has much much better statistical packages that R, if it is
         | statistics, you can probably find a package in R to do it, not
         | same with python. And the programming language is much better
         | for statistics than numpy/pandas if a package is not
         | sufficient. I use both, and for statistics have no choice but
         | to use R. For data, I use python.
        
         | vhhn wrote:
         | There are still several areas where R beats Python: tabular
         | data crunching, data analysis (plotting, stats), finance
         | (econometrics etc...) but it's less and less obvious.
        
         | throwaway_2341 wrote:
         | < Even RStudio gave up it's R brand to focus on Python
         | 
         | Wouldn't R still be the primary language in RStudio, with
         | Python being made available as necessary? Or is the idea that
         | RStudio will turn into a proper Python IDE? Curious what makes
         | you say that RStudio is putting its 'focus' on Python.
        
           | cardosof wrote:
           | They changed their name to Posit so yeah, that's a conscious
           | move away from R.
        
         | jstx1 wrote:
         | If you work with mostly tabular data, never deploy anything and
         | don't need any deep learning, then it's fine.
        
         | goosedragons wrote:
         | I think so. It's still better at the things it was always
         | better at, data analysis. I could be biased since it's my main
         | language though.
        
         | kickout wrote:
         | Yes it's worth using IMO. Plotting and grokking is better than
         | python IMO.
        
       | closed wrote:
       | One interesting thing about R examples is their outputs tend to
       | be bigger. I think this is in direct contrast to python
       | docstrings, where outputs are very concise--because you manually
       | include the output for doctest.
       | 
       | I wonder if a challenge for doctests in R is they often have to
       | test larger, more realistic outputs?
       | 
       | For example, in dplyr's mutate doc, one example is this:
       | starwars %>%        select(name, mass) %>%        mutate(
       | mass2 = mass * 2,          mass2_squared = mass2 * mass2        )
       | 
       | This example's output is a dataframe with 4 columns and will
       | display first 5 rows.
       | 
       | On the other hand in siuba (a port of dplyr to python), I often
       | have to truncate the example output, because it's hard coded in
       | the docstring:                 (cars          >> mutate(
       | cyl2 = _.cyl * 2,            cyl4 = _.cyl2 * 2         )
       | >> head(2)       )          cyl   mpg   hp  cyl2  cyl4       0
       | 6  21.0  110    12   24       1    6  21.0  110    12   24
       | 
       | It's nice you can see the full example in the docstring in
       | python, but also very handy seeing complex examples on R doc
       | pages:
       | 
       | https://dplyr.tidyverse.org/reference/mutate.html#ref-exampl...
        
       | civilized wrote:
       | Couldn't people just add expect_* tests to their examples? What's
       | the benefit of adding all this new notation and magic?
       | 
       | Disclaimer: I'm an R programmer but not deeply familiar with
       | authoring packages.
        
         | vharuck wrote:
         | The idea in TFA is to keep a function's definition,
         | documentation, and unit tests next to each other in a single
         | file.
         | 
         | >Couldn't people just add expect_* tests to their examples?
         | 
         | Users can run examples with the `example` function. So if you
         | use the `testthat` package in examples, then you should add it
         | to your package's imports. Which means more to load with the
         | package, but only for a small benefit that's rarely used.
         | 
         | Also, raising warnings or errors in examples and not catching
         | them is a no-no. The CRAN package repository will not accept a
         | package like that.
         | 
         |  _Edit: I originally wrote that this wouldn 't create any
         | examples in the final manual pages, but I was wrong._
        
           | civilized wrote:
           | Ok, I think I might get it now? The tests are written down in
           | the example, but are only run by the package developer and
           | the results are hidden from the user? That seems like a good
           | thing. The user wants to see the example but doesn't care
           | about whether your test passed.
           | 
           | The magic makes sense now too. You already need a roxygen2
           | header to set up the auto-generated tests, so why not call it
           | @expect and then write equal instead of expect_equal, so as
           | not to repeat yourself?
        
             | masklinn wrote:
             | > Ok, I think I might get it now? The tests are written
             | down in the example, but are only run by the package
             | developer and the results are hidden from the user? That
             | seems like a good thing. The user wants to see the example
             | but doesn't care about whether your test passed.
             | 
             | But surely the user wants to see what the result of the
             | call is, if it's relevant? That's why rust examples (which
             | are also doctests) include the corresponding assertions.
             | 
             | You _can_ hide them, but usually you don 't, because e.g.
             | showing what the result of `str::len` is is the point of
             | having an example: https://doc.rust-
             | lang.org/std/primitive.str.html#method.len
             | 
             | Unless roxygen or Rd independently runs the code and embeds
             | the output independent of the doctests succeeding or
             | failing?
        
               | civilized wrote:
               | Right, the user wants to see the result, but doesn't care
               | about the developer's test that the result is the
               | expected result.
        
       | nonrandomstring wrote:
       | With tests, sometimes you want to embed test data, as a here-
       | document so that tests don't get separated from minimal datasets
       | needed. In perl it was customary to use <<"EOF";...EOF and """
       | triple quotes """ serve similar utility in Python. What's the
       | deal in R? Just make a vector in the test?
        
         | civilized wrote:
         | The strategies I usually see are (1) use a built-in dataset (2)
         | make the data at the beginning of the example.
        
       ___________________________________________________________________
       (page generated 2022-11-27 23:01 UTC)