hngopher.com

       [HN Gopher] Papermill: Parameterizing, executing, and analyzing ...
       ___________________________________________________________________
        
       Papermill: Parameterizing, executing, and analyzing Jupyter
       Notebooks
        
       Author : mooreds
       Score  : 79 points
       Date   : 2024-09-18 13:34 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | edublancas wrote:
       | Papermill is great but has quite some limitations because it
       | spins up a new process to run the notebook:
       | 
       | - You cannot extract live variables (needed for testing)
       | 
       | - Cannot use pdb for debugging
       | 
       | - Cannot profile memory usage
       | 
       | You can do all of that with ploomber-engine
       | (https://github.com/ploomber/ploomber-engine).
       | 
       | Disclaimer: I'm the author of this package
        
         | ziddoap wrote:
         | Not disclosed in this comment is that edublancas is
         | 
         | > _Ploomber (YC W22) co-founder._
        
           | Kalanos wrote:
           | who is a great technologist with a lot of hands on
           | experience. if it made sense to leverage papermill, he would
           | have done so and focused on something else.
        
             | ziddoap wrote:
             | What does any of this have to do with disclosure?
        
               | Kalanos wrote:
               | calling attention to disclosure suggests bias. i'm
               | obviously saying that i trust him not to be biased.
        
         | throwpoaster wrote:
         | iirc, a few years back I was able to do all of these things
         | with the Papermill IPython runtime.
         | 
         | Papermill is great, but yes: lots of room to hack on it and
         | make it better.
        
           | edublancas wrote:
           | has papermill deprecated the ipython runtime? I used
           | papermill extensively in the past and I never saw that in
           | their docs.
        
             | throwpoaster wrote:
             | It's been a while but you do it with a custom kernel and
             | maybe some entry point tweaks. IIRC.
        
       | mcpar-land wrote:
       | What is the benefit of parameterizing a jupyter notebook over
       | just writing python that's not in a jupyter notebook? I like
       | jupyter notebooks for rapid prototyping but once I want to dial
       | some logic in, I switch to just writing a .py file.
        
         | zhoujing204 wrote:
         | It might be a pretty useful tool for education. College courses
         | related to Python and AI on Coursera have heavily used Jupyter
         | Notebook for assignments and labs.
        
         | jsemrau wrote:
         | I used papermill a while ago to automate a long-running python-
         | based data aggregation task. Airflow would log in remotely to
         | the server, kick-off papermill and track it's progress.
         | Initially I wanted to use pure python, but the connection
         | disconnected frequently disallowing me to track the progress,
         | and also jupyter enabled quick debugging where something went
         | wrong.
         | 
         | Not one of my proudest moments, but it got the job done.
        
         | __MatrixMan__ wrote:
         | I think there are places where the figure-it-out-in-a-notebook
         | part is one person's job, and then including it in a pipeline
         | is another person's job.
         | 
         | If they can call the notebook like a function, the second
         | person's job becomes much easier.
        
           | crabbone wrote:
           | I've been that person, and no it doesn't. It makes my life
           | suck, if I have to include a notebook instead of an actual
           | program in a larger program. Notebooks don't compose well,
           | they are too dependent on the specifics of the environment in
           | which they were launched, they have excessive source code
           | that's also machine-generated and is hard to work with for
           | humans.
           | 
           | As a stop-gap solution, for cases like a single presentation
           | / proof-of-concept that doesn't need to live on and be reused
           | -- it would work. Anything that doesn't match this
           | description will accumulate technical debt very quickly.
        
             | __MatrixMan__ wrote:
             | I sort of suspected that adding parameters was not the end
             | of the story. My experience with this was just "make it
             | work with papermill", so the notebooks I tested with were
             | nice and self contained.
             | 
             | Although it does seem like packaging dependencies and
             | handling parameters are separate problems, so I'm not sure
             | if papermill is to be blamed for the fact that most
             | notebooks are not ready to be handled like a black box,
             | even after they're parameter-ready. Something like jupyenv
             | is needed also.
        
               | crabbone wrote:
               | Jupyter is not the end of the story here. There are
               | plenty of "extensions". These extensions go, generally,
               | down two different ways: kernels and magic.
               | 
               | It's not very common for Jupyter magic to be added ad hoc
               | by users, but it typically creates a huge dependency on
               | the environment, so no jupyenv is going to help (eg. all
               | the workload-manager related magic to launch jobs in
               | Slurm / OpenPBS).
               | 
               | Kernels... well, they can do all sorts of things...
               | beyond your wildest dreams and imagination. And, unlike
               | magic, they are readily available for the end-user to
               | mess with. And, of course, there are a bunch of pre-
               | packaged ones, supplied by all sorts of vendors who want,
               | in this way, to promote their tech. Say, stuff like
               | running Jupyter over Kubernetes with Ceph volumes exposed
               | to the notebook. There's no easy way of making this into
               | a "module" / "black box" that can be combined with some
               | other Python code. It needs a ton of infra code to
               | support this, if it's meant to be somewhat stand-alone.
        
         | jdiez17 wrote:
         | There are a lot of people who are not expert Python
         | programmers, but know enough to pull data from various sources
         | and make plots. Jupyter{Notebook,Lab} is great for that.
         | 
         | As you say, from a programmer's point of view the logical thing
         | to do is to convert the notebook to a Python module. But that's
         | an extra step that may not be necessary in some cases.
         | 
         | FWIW I used papermill in my Master's thesis to analyze a whole
         | bunch of calibration data from IMUs. This gave me a nicely
         | readable document with the test report, conclusions etc. for
         | each device pretty easily.
        
         | crystal_revenge wrote:
         | I agree. I was at a company where some DS was really excited
         | about Papermill, and I was trying to explain that this is an
         | excellent time to stop working in a notebook and start writing
         | reusable code.
         | 
         | I was aghast to learn that this person had _never_ written non-
         | notebook based code.
         | 
         | Code notebooks are great as _notebooks_ , but should in no way
         | replace libraries and well structured Python projects.
         | Papermill to me is a huge anti-pattern and a sign that your
         | team is using notebooks wrong.
        
           | jdiez17 wrote:
           | So you think it was a good move to scoff at someone for using
           | a computer for their work in a way that is different from
           | your preferences?
        
             | crystal_revenge wrote:
             | Notebooks are great as notebooks, but it's very well
             | established, even in the DS community, that they are a
             | terrible way to write maintainable, sharable, scalable
             | code.
             | 
             | It's not about preference, it's objectively a terrible idea
             | to build complex workflows with notebooks.
             | 
             | The "scoff" was in my head, the action that came out of my
             | mouth was to help them understand how to create reusable
             | Python modules to help them organize their code.
             | 
             | The answer is to help these teams build an understanding of
             | how to properly translate their notebook work into re-
             | useable packages. There is really no need for data
             | scientists to follow terrible practices, and I've worked on
             | plenty of teams that have successfully been able to onboard
             | DS as functioning software engineers. You just need a
             | process and a culture that notebooks cannot be the last
             | stage of a project.
        
               | fifilura wrote:
               | The thing with data pipelines is they have a linear
               | execution. You start from the top and work your way down.
               | 
               | Notebooks do that, and even leave a trace while doing it.
               | Table outputs, plots, etc.
               | 
               | It is not like a python backend that listens to events
               | and handle them as they come, sometimes even in parallel.
               | 
               | For data flow, the code has an inherent direction.
        
               | crystal_revenge wrote:
               | > Notebooks do that, and even leave a trace while doing
               | it.
               | 
               | Perhaps the largest critique against notebooks is that
               | they _don 't_ enforce a linear execution of cells. Every
               | data scientist I know has been bitten by this at least
               | once (not realizing they're in a stale cell that should
               | have been updated).
               | 
               | Sure you could solve this by automating the entire
               | notebook ensuring top-down execution order but then why
               | in the world are you using a notebook like this? There is
               | no case I can think of where this would be remotely
               | better than just pulling out the code into shared
               | libraries.
               | 
               | I've worked on a wide range of data science teams in my
               | career and _by far_ the most productive ones are the ones
               | that have large shared libraries and have a process in
               | place for getting code out of notebooks and into a proper
               | production pipeline.
               | 
               | Normally I'm the person _defending_ notebooks since there
               | 's a growing number of people who outright don't want to
               | see them used ever. But they do have their place, as
               | notebooks. I can't believe I'm getting down voted for
               | suggesting one shouldn't build complex workflows using
               | _notebooks_.
        
         | mooreds wrote:
         | It's the same tradeoff of turning an excel spreadsheet into a
         | proper program.
         | 
         | If you do so, you gain:
         | 
         | * the rigor of the SDLC
         | 
         | * reusability by other developers
         | 
         | * more flexible deployment
         | 
         | But you lose the ability for a non-programmer to make
         | significant changes. Every change needs to go through the
         | programmer now.
         | 
         | That is fine if the code is worth it, but not every bit of code
         | is.
        
           | fifilura wrote:
           | It also implies that an engineer has better understanding of
           | what is supposed to be done and can discover all the error
           | modes.
           | 
           | In my experience, most of the time the problem is in the
           | input and interpretation of the data. Not fixable by a unit
           | test.
        
         | gnulinux wrote:
         | It's a literate programming tool. If you find literate
         | programming useful (such as Donald Knuth's Latex) then you can
         | write a Jupyter notebook, add text, add latex, titles,
         | paragraphs, explanations, stories and attach code too. Then,
         | you can just run it. I know that this sounds pretty rare but
         | this is _mostly_ how I write code (not in Jupyter notebook, I
         | use Markdown instead and write code in a combination of
         | Obsidian and Emacs). To me, code is just writing, there is no
         | difference between prose, poetry, musical notation, or computer
         | programming. They 're just different languages that mean
         | something to human beings and I think they're done best when
         | they're treated like writing.
        
           | crabbone wrote:
           | I have to disagree... Literate programming is still
           | programming: it produces programs (but with an extra effort
           | of writing documentation up-front).
           | 
           | Jupyter is a tool to do some exploratory interactive
           | programming. Most notebooks I've seen in my life (probably
           | thousands at this point) are worthless as complete programs.
           | They are more akin to shell sessions, which, for the most
           | part, I wouldn't care storing for later.
           | 
           | Of course, Jupyter notebooks aren't the same as shell
           | sessions, and there's value in being able to re-run a
           | notebook, but they are so bad at being programs, that there's
           | a probably a number N in low two-digits, where if you expect
           | to have to run a notebook more than N times, you are better
           | off writing an actual program instead.
        
             | abdullahkhalids wrote:
             | > Don't get discouraged because there's a lot of mechanical
             | work to writing. There is, and you can't get out of it. I
             | rewrote A Farewell to Arms at least fifty times. You've got
             | to work it over. The first draft of anything is shit.
             | Ernest Hemingway
             | 
             | This is how all intellectual work proceeds. Most of the
             | stuff you write is crap. After many iterations you produce
             | one that is good enough for others. Should we take away the
             | typewriter from the novel writers too, along with Jupyter
             | notebooks from scientists, because most typed pages are
             | crap?
        
               | crabbone wrote:
               | I think, you completely missed the point... I compared
               | Jupyter notebooks to shell sessions: it doesn't make them
               | bad (they are, however, but for a different reason). I
               | don't think that shell sessions are bad. The point I'm
               | making is that Jupyter notebooks aren't suitable for
               | being independent modules inside a larger program (and so
               | are shell sessions). The alternative is obvious: just
               | write the program.
               | 
               | Can you possibly make Jupyter notebook act like a module
               | in a program? -- with a lot of effort and determination,
               | yes. Should you be doing this, especially since the
               | alternative is very accessible and produces far superior
               | results? -- Of course no.
               | 
               | Using your metaphor, I'm not arguing for taking the
               | typewriter away from the not-so-good writers. I'm arguing
               | that maybe they can use a computer with a word processor,
               | so that they don't waste so much paper.
        
             | gnulinux wrote:
             | Literate programming is not just "documentation + code" any
             | more than a textbook you read about Calculus is
             | "documentation + CalculusCode" or a novel is "documentation
             | + plot". It goes way beyond that, using literate
             | programming you can attach an arbitrary text that
             | accompanies the code such that _fragments_ of your code is
             | simply one part of the whole text. Literate programming is
             | not just commenting (or supercommenting), if it were, you
             | could use comments, it 's a practice of simply attaching
             | fragments of code in a separate text such that you can then
             | later utilize that separate text the same way you utilize
             | code. When you write a literate program, your end goal is
             | the text and the program, not just the program. You can
             | write a literate program, and publish it _as is_ as a
             | textbook, poem, blog post, documentation, website, fiction,
             | musical notation etc... Unless you think that _all human
             | writing is documentation_ then literate programming is not
             | _just_ documentation.
        
           | zelphirkalt wrote:
           | Does it support more of literate programming than the small
           | amount of features, that normal Jupyter notebook supports?
           | 
           | I always wish they would take a hint from Emacs org mode and
           | make notebooks more useful for development.
        
             | gnulinux wrote:
             | No it supports less actually. Obsidian is only a markdown
             | editor, it does allow you to edit code fragments like code
             | (so there is basic code highlighting, auto-tabbing etc) but
             | that's it. I personally find this a lot easier in some
             | cases. I find that sometimes if the code is too complicated
             | that you need anything more than just "seeing" you probably
             | need to break it further down to its atomic elements. For
             | certain kinds of development, I do find myself needing to
             | be in "programming groove" then I use Emacs. But other
             | times, I accompany the code with a story and/or technical
             | description so it feels like the end goal is to write the
             | document, and not the code. Executable code is just a
             | artifact that comes with it. It's definitely a niche
             | application as far e.g. the industry goes.
        
         | swalsh wrote:
         | My experience is more with Databricks, and their workflow
         | system... but the concept is exactly the same.
         | 
         | It let's data scientists work in the environment they work best
         | in, and it makes it easier to productionize work. If you
         | seperate them, then there's a translation process to move the
         | code into whatever the production format is which means extra
         | testing, and extra development.
        
         | z3c0 wrote:
         | Parameterizing notebooks is a feature common to modern data
         | platforms, and most of its usefulness comes from saving the
         | output. That makes it easier to debug ML pipelines and such,
         | cos the code, documentation, and last output are all in one
         | place. However I don't see any mention of what happens to the
         | outputs with this tool.
        
         | singhrac wrote:
         | We use papermill extensively, and our team is all good
         | programmers. The difference is plots. It is a lot easier to
         | write (and modify our existing template) to create a plot for X
         | vs Y than it is to build and test a script that outputs e.g. a
         | PDF.
         | 
         | For example, if your notebook runs into a bug, you can just run
         | all the cells and then examine the locals after it breaks. This
         | is extremely common when working with data (e.g. "data is
         | missing on date X for column Y... why?").
         | 
         | I think most of the "real" use cases for notebooks is data
         | analysis of various kinds, which is why a lot of people dismiss
         | them. I wrote a blog post about this a while ago:
         | https://rachitsingh.com/collaborating-jupyter/
        
         | reeboo wrote:
         | As an MLE who comes from backend web dev, I have flip-flopped
         | on notebooks. I initially felt that everything should be in a
         | python script. But I see the utility in notebooks now.
         | 
         | For notebooks in an ML pipeline, I find that data issues are
         | usually where things fail. Being able to run code "up to" a
         | certain cell and create plots is invaluable. Creating reports
         | by creating a data frame and displaying it as a cell is also
         | super-handy.
         | 
         | You say, "dial some logic in", which is begging the wrong
         | question (in my experience, at least). The logic in ML is
         | usually very strait forward. It's about the data coming into
         | your process and how your models are interacting with it.
        
           | jamesblonde wrote:
           | I agree completely with this. Papermill output is a notebook
           | - that is the log file. You can double click on it, it opens
           | in 1-2 seconds and you can see visually how far your notebook
           | progressed and any plots you added for debugging.
        
         | kremi wrote:
         | Some of the replies here are pretty good, I basically agree
         | with "if it works for your data scientists then why not".
         | 
         | I'm actually a software developer with 10 years experience and
         | also happen to do data science. And found myself in situations
         | where I parametrized a notebook to run in production. So it's
         | not that I can't turn it to plain python. The main reasons are
         | 
         | 1. I prototype in a notebook. Translating to python code
         | requires extra work. In this case there's no extra dev
         | involved, it's just me. Still it's extra work.
         | 
         | 2. You can isolate the code out of the notebook and in theory
         | you've just turned your notebook into plain py. You could even
         | log every cell output to your standard logging system. But you
         | loose context of every log. Some cells might output graphs. The
         | notebook just gives you a fast and complete picture that might
         | be tedious to put together otherwise.
         | 
         | 3. The saved notebook also acts as versioning. In DS work you
         | could end up with lots of parameters or small variations of the
         | same thing. In the end what has little variations I put in
         | plain python code. What's more experimental and subject to
         | change I put in the notebook. In certain cases it's easier than
         | going through commit logs.
         | 
         | 4. I've never done this but a notebook is just json so in
         | theory you could further process the output with prestodb or
         | similar.
        
       | notpushkin wrote:
       | > Do you want to run a notebook and depending on its results,
       | choose a particular notebook to run next?
       | 
       | Hell no. I want to rewrite all that as a proper script or Python
       | module.
        
         | p4ul wrote:
         | Indeed! I feel like we as a community have taken a wrong turn
         | with our use of notebooks. I think they have benefits in some
         | specific use cases (e.g., teaching, demos, etc.), but
         | otherwise, I think they mostly encourage bad practices for
         | software development.
        
       | morkalork wrote:
       | I once built an unholy combination of papermill and nbconvert to
       | mass produce monthly reports using a "template" notebook. All the
       | code was imported from a .py file so the template just took a
       | client ID as input and called out to render_xyz(...) in each
       | section. It was nice because it produced a bunch of self-
       | contained static files and wrote them to a network drive. It was
       | definitely _a_ solution to the problem.
        
       | pplonski86 wrote:
       | I have few ML pipelines that are simply using nbconvert to
       | execute notebooks. Regarding python script vs notebook debate I
       | think it all depends on your use case. I like that I can display
       | plots in notebooks without any additional work.
        
       | miohtama wrote:
       | I looked Papermill back in a day, but found it easier call
       | nbclient and nbconvert directly
       | 
       | https://github.com/tradingstrategy-ai/trade-executor/blob/ma...
        
       | v3ss0n wrote:
       | How hard it is copy , paste and run the note book code within
       | proper http server?
        
         | sa-code wrote:
         | There is some utility in seeing the code and the output right
         | below it.
        
       | Kalanos wrote:
       | Is this still being developed? The last commit to the main
       | library was 5 months ago and its tied to exceptions/tests.
        
         | barrrrald wrote:
         | Seems like it's mostly died off, most people I know have moved
         | to hosted solutions like Hex or Colab
        
         | reeboo wrote:
         | It's a thin wrapper around notebooks. Does it really need more
         | features? Not saying that it couldn't, but it is feature
         | complete for what its job is.
        
           | Kalanos wrote:
           | things break due to shifting dependencies.
           | 
           | also, if it isn't maintained by the company that made it,
           | then it is a good sign that they are no longer using it. it
           | suggests that there is a better solution elsewhere.
        
       | iamleppert wrote:
       | Jupyter notebooks are missing strict types, a linter and unit
       | tests. When can those features be added?
        
         | big-chungus4 wrote:
         | vscode jupyter uses the same extensions as vscode, so you can
         | get a linter and scrict type checking. Not sure about tests
         | though
        
         | ogrisel wrote:
         | With papermill you can parametrize a notebook and run it on
         | different inputs to check that it is not raising uncaught
         | exceptions. This can be wrapped to be part of a pytest test
         | suite, possibly via a some ad-hoc pytest fixture or plugin.
         | 
         | If the notebooks themselves contain assertions to check that
         | expectations on the outputs are met, then you have an automated
         | way to check that the notebooks behave the way you want on some
         | test inputs. For long notebooks, this is more like
         | integration/functional tests rather than unit tests, but I
         | think this is already an improvement over manually run
         | notebooks.
         | 
         | Note sure about strict types: you mean running mypy on a
         | notebook? Maybe this can be helpful:
         | 
         | - https://pypi.org/project/nb-mypy/
         | 
         | About linters, you can install `jupyterlab-lsp` and `python-
         | lsp-ruff` together for instance.
        
       | jamesblonde wrote:
       | For MLOps platforms, Papermill is the one of the reasons why we
       | don't include experiment tracking out of the box any longer in
       | Hopsworks. You can easily see the results of training runs as
       | notebooks - including loss curves, etc. Any models that completed
       | get registered in the model registry along with plots, a model
       | card, and model evaluation/validation metrics.
        
       | __mharrison__ wrote:
       | I teach a lot using Jupyter. It is certainly possible to use SWE
       | worst practices in Jupyter easily.
       | 
       | I am often in front of folks who "aren't computer programmers"
       | but need to use Python tools to be successful. One of my covert
       | goals is to teach SWE best practices inside of notebooks. It
       | requires a little more typing but eases the use of notebooks,
       | refactoring, testing, moving to scripts, and using tooling like
       | Papermill.
        
         | akshayka wrote:
         | Have you considered using marimo notebooks?
         | 
         | https://github.com/marimo-team/marimo
         | 
         | marimo notebooks are stored as pure Python (executable as
         | scripts, versionable with git), and they largely eliminate the
         | hidden state problem that affects Jupyter notebooks -- delete a
         | variable and it's automatically removed from program memory,
         | run a cell and all other cells that use its variables are
         | marked as stale.
         | 
         | marimo notebooks are also readily parametrized with CLI
         | arguments, so you can do: python notebook.py -- -foo 1 -bar 2
         | ...
         | 
         | Disclosure: I'm a marimo developer.
        
           | ThouYS wrote:
           | thanks mate, exactly what I've been looking for
        
           | cycomanic wrote:
           | There is also jupytext which converts Jupyter notebooks on
           | the fly to a number of different formats (Markdown,
           | python,...). It's at the core of the Jupyterbook project IIRC
           | and IMO the best method to use Jupyter with git.
        
             | __mharrison__ wrote:
             | I use Jupytext (and my own conversion utilities) all the
             | time. I write my books inside of Jupyter these days.
        
       | cmcconomy wrote:
       | In the past, I've used this to generate HTML outputs to reflect a
       | series of calculations and visualizations so that we can share
       | with clients.
        
       ___________________________________________________________________
       (page generated 2024-09-18 23:01 UTC)