[HN Gopher] XetCache: Improve Jupyter notebook reruns by caching...
___________________________________________________________________
XetCache: Improve Jupyter notebook reruns by caching cells
Author : skadamat
Score : 73 points
Date : 2023-12-19 15:18 UTC (7 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| morkalork wrote:
| Tempting but also sounds like a nightmare for writing notebooks
| with reproducible results?
| skadamat wrote:
| Using XetCache a bunch daily, I've honestly found the
| _opposite_. Here are a few situations:
|
| - My notebook queries a database for some data -> I run some
| downstream calculations. Re-querying returns slightly different
| data, but using cached outputs means I can replicate the work.
|
| - In another notebook, I was trying to replicate my coworkers
| work where he queried OpenAI a few thousand times. The results
| may be different when I run it, since OpenAI changed something.
|
| When notebooks are used well, it's in the context of internal
| data / ML project collaboration. So being able to flash-freeze
| a cell for your collaborators helps you juggle and pass around
| notebooks more easily with confidence.
|
| Shipping notebooks directly to production is a whole another
| set of challenges that I have my own, separate opinions on but
| that's a different beast entirely.
| ylow wrote:
| Author here.
|
| The goal in a way is _better_ reproducibility. The memo hashes
| the contents of the inputs, so if your cell is deterministic
| (and you cover all the inputs), the memo should give you the
| right answers. Of course if you want to rerun everything you
| can always just delete the memos, but properly used it should
| be a strict performance improvement.
|
| Of course there are many improvements that can be made
| (tracking if dependent functions have changed etc) and of
| course there are inherent limitations. This is very much "work
| in progress". But is quite useful right now!
| mirashii wrote:
| > The goal in a way is better reproducibility. The memo
| hashes the contents of the inputs, so if your cell is
| deterministic (and you cover all the inputs), the memo should
| give you the right answers.
|
| That's an enormously large if, and without any way to detect
| or manage when that's not the case, it's hard to see how this
| could aid reproducibility. This adds a new failure mode for
| reproducibility where an important side-effect is not rerun,
| altering the results.
| hedgehog wrote:
| Think of this like incremental builds via make etc. You
| don't necessarily trust the output 100% but it speeds up
| work so much it's worth it, you can always do a clean build
| at key points where you want more confidence.
| ltbarcly3 wrote:
| That only works for things entirely encapsulated by the value
| of the cell, which isn't a useful amount of stuff. This will
| only help with primative toy examples where you are slinging
| numpy arrays or other easily hashed data with no
| encapsulation or indetermancy or order independence.
|
| Sorry, if it were possible to just generically cache
| computations it would be done everywhere for everything. This
| is just going to help with toy examples.
|
| The only way this leads to "better reproducability" is if it
| fails to recompute things that it should have recomputed. If
| the computation was actually deterministic from its inputs
| the best and literally the only possible thing this can do is
| exactly what recomputing it would do, but faster. Frankly
| that you said it is more reproducible is enough evidence for
| me that you are doing something fundementally broken.
| juujian wrote:
| There is a similar feature built into r markdown that I have used
| a couple of times. That and eval=false which stops you from
| accidentally rerunning large queries is actually really helpful
| for figuring out complex computations of large datasets.
| krastanov wrote:
| This is very neat, but these days I am truly enamoured with
| "reactive" notebooks. They are rather orthogonal to the concern
| here. Reactive notebooks automatically build a dependency tree
| between cells. When you edit a cell, all cells depending on it
| get re-evaluated.
|
| It is extremely valuable for reproducibility.
|
| My favorite example is Pluto Notebooks in Julia (very Julia-
| specific), but I have seen (maybe less polished) similar tools in
| other languages too: https://plutojl.org/
|
| On the other hand, when it comes to caching, Mandala (in python),
| brings the best both from caching and reactivity. A truly amazing
| memoization tool for computational graphs, much more
| sophisticated and more powerful than alternatives:
| https://github.com/amakelov/mandala
| ylow wrote:
| Author here. Mandala looks really cool. Thanks for the
| recommendation!
| pgbovine wrote:
| very cool idea! i was also very interested in this problem
| during grad school ... prototyped an approach by hacking
| CPython, but the code (python 2.6? from 2010 era) has long
| bitrotted: https://pg.ucsd.edu/publications/IncPy-
| memoization-in-Python...
| https://pg.ucsd.edu/publications/IncPy-memoization-in-
| Python...
| pgbovine wrote:
| also, while i have your attention here, since you wrote that
| related post on (not) vector db's ... what would you
| recommend for a newbie to get started with RAG? let's say i
| have a large collection of text files on my computer that i
| want to use for RAG. the options out there seem bewildering.
| is there something simple akin to Ollama for RAG?
| zzleeper wrote:
| If you want to get something done quickly, try llama index.
|
| If you want to learn/hack, pick an easy vectordb, get an
| OpenAI API account, and do a quick attempt
|
| Then you can switch to a local LLM and embedder, and it
| helps a bit in learning what the pain points are
| jihadjihad wrote:
| There was an interesting post yesterday that captured some of
| these ideas:
|
| https://news.ycombinator.com/item?id=38681115
| westurner wrote:
| Dockerfile and Containerfile also cache outputs as layers.
|
| `docker build --layers` is the default:
| https://docs.podman.io/en/latest/markdown/podman-build.1.htm...
|
| container/common//docs/Containerfile.5.md:
| https://github.com/containers/common/blob/main/docs/Containe...
| westurner wrote:
| It may be better to just start with managing caching with code.
|
| In the standard library, the @functools.cache and
| @functools.cached_property function and method decorators do LRU
| caching in RAM only.
| https://docs.python.org/3/library/functools.html
|
| Dask docs > "Automatic Opportunistic Caching":
| https://docs.dask.org/en/stable/caching.html#automatic-oppor... ;
| dask.cache.Cache(size_bytes:int) ... "Cache tasks, not
| expressions"
|
| Pickles are not a safe way to deserialize data; there is not a
| data only pickling protocol.
|
| So caching arbitrary cell objects (or e.g. stack frames) to disk,
| as pickles at least, creates risk of code injection if the
| serialized data contains executable code objects.
|
| Similarly, the file permissions on e.g. rr traces.
| https://en.wikipedia.org/wiki/Rr_(debugging)
|
| Dataclasses in the standard library helps with object
| serialization, but not with caching cell outputs containing code
| objects.
|
| Apache Arrow and Parquet also require schema for efficient
| serialization.
|
| LRU: Least-Recently Used
|
| MRU: Most-Recently Used; most frequently accessed
|
| Out-of-order execution in notebooks may or may not have wasted
| cycles of human time and CPU time. If the prompt numbers aren't
| sequential, what you ran in a notebook is not necessarily what
| others will get when running that computation graph in order; and
| so it's best practice to "Restart and Run All" to test the actual
| control flow before committing or pushing.
|
| There are ways to run and test notebooks _in order_ on or before
| git commit (and compare their output with the output from the
| previous run) like software with tests.
| ylow wrote:
| The issue comes when cells take many minutes or even hours to
| run (intentionally or not). The ideal is indeed sequential, and
| this helps me a lot with maintaining the sequential ordering as
| it simplifies and speeds up the "restart and run all" process.
| westurner wrote:
| Oh I understand the issue.
|
| E.g. dask-labextension does not implicitly do dask.cache for
| you.
|
| How are the objects serialized, and are code objects
| serialized to files on disk, what is the permission umask on
| such files, and what directory (/var/cache) should they be
| selinux-labeled in (when it is running code from not memcache
| instead of the source) because if you can write to those
| cache files, you control the execution flow of the notebook
| (which is already unreproducibly out-of-ordered without
| consideration)
| IanCal wrote:
| I agree about being explicit with this, though I will warn
| about using the python functools caching, because it changes
| the logic of your program. Because it doesn't copy data, any
| return that's mutable is risky. It's also not _obvious_ this
| happens, and even less obvious if you 're not keenly aware of
| this kind of issue being likely.
| ltbarcly3 wrote:
| The inputs and outputs of anything that is slow enough to care
| about caching is so large in an ml context that caching is
| unreasonable. The one exception would be people recomputing
| entire notebooks over and over, but again caching in some generic
| way that can't tell if underlying data changed or not is going to
| break all kinds of stuff.
|
| Don't use notebooks for real work guys, I know you have a PHD but
| that's not an excuse to refuse to learn software engineering. We
| have lots of ways to define data dependencies and conditionally
| rebuild just the parts that need rebuilt. Look at any build
| system.
| skadamat wrote:
| Notebooks aren't necessarily for production but are a really
| great way to explore data & models and move at the speed of
| thought. It's for prototyping, collaboration, and feedback.
|
| I sense that most of the frustration teams have with notebooks
| is when they try to ship notebooks, which is likely a mistake
| (in many but not all cases) as you're pointing out!
| amakelov wrote:
| I see two concerns here:
|
| - inputs/outputs being high volume: the inputs/outputs that are
| large are often also things that don't change over the course
| of a project (e.g. a dataset or a model). So you don't really
| need to cache the object itself, just a (typically short
| string) immutable reference to it. As long as the object can be
| looked up at runtime, everything's fine;
|
| - detecting changes in data: content hashing is the general way
| in which you can tell if a result changed; using `joblib.dump`
| and then hashing the resulting string provides a good starting
| implementation, though certainly there are some corner cases to
| be aware of.
|
| Both of these approaches are available/used in mandala
| (https://github.com/amakelov/mandala; disclosure: I'm the
| author), which uses content hashing to tell when data (or even
| code/code dependencies) have changed, and gives you a generic
| caching decorator for functions which can then look up large
| objects by reference; this is the way I used it for e.g. my
| mechanistic interpretability work, which is often of the form
| one big model + lots of analyses producing tiny artifacts based
| on it.
| david_draco wrote:
| How does differ from the very neat joblib.Memory
| https://joblib.readthedocs.io/en/latest/generated/joblib.Mem...?
| bravura wrote:
| Man I want to love joblib, but I can't.
|
| One of the smarted ML researchers I know swears by it.
|
| But for whatever reason, in my workflow of prototyping new ML
| approaches, optimizing and unoptimizing different preprocessing
| steps, and sometimes migrating data across the cloud, I always
| seem to start with joblib and then shed it pretty quickly in
| favor of large JSONL.gz and sqlite3 etc checkpoints that I
| create after key steps.
| simon_acca wrote:
| Some prior art: https://github.com/nextjournal/clerk
|
| > Clerk notebooks always evaluate from top to bottom. Clerk
| builds a dependency graph of Clojure vars and only recomputes the
| needed changes to keep the feedback loop fast.
| amakelov wrote:
| This is neat and self-contained! But as someone running
| experiments with a high degree of interactivity, I often have an
| orthogonal requirement: add more computations to the _same_ cell
| without recomputing previous computations done in the cell (or in
| other cells).
|
| For a concrete example, often in an ML project you want to study
| how several quantities vary across several parameters. A
| straightforward workflow for this is: write some nested loops,
| collect results in python dictionaries, finally put everything
| together in a dataframe and compare (by plotting or otherwise).
|
| However, after looking at the results, maybe you spot some trend
| and wonder if it will continue if you tweak one of the parameters
| by using a new value for it; of course, you also want to look at
| the previous values and bring everything together in the same
| plot(s). You now have a problem: either re-run the cell (thus
| losing previous work, which is annoying even if you have to wait
| 1 minute - you know it's a wasted minute!), or write the new
| computation in a new cell, possibly with a lot of redundancy
| (which over time makes the notebook hard to navigate and keep
| consistent).
|
| So, this and other considerations eventually convinced me that
| the _function_ is more natural than the cell as an interface
| /boundary at which caching should be implemented, at least for my
| use cases (coming from ML research). I wrote a framework based on
| this idea, with lots of other features (some quite
| experimental/unusual) to turn this into a feasible experiment
| management tool - check it out at
| https://github.com/amakelov/mandala
|
| P.S.: I notice you use `pickle` for the hashing - `joblib.dump`
| is faster with objects containing numpy arrays, which covers a
| lot of useful ML things
___________________________________________________________________
(page generated 2023-12-19 23:00 UTC)