https://github.com/amakelov/mandala Skip to content Navigation Menu Toggle navigation Sign in * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + GitHub Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions By size + Enterprise + Teams + Startups By industry + Healthcare + Financial services + Manufacturing By use case + CI/CD & Automation + DevOps + DevSecOps * Resources Topics + AI + DevOps + Innersource + Open Source + Security + Software Development Explore + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Enterprise + Enterprise platform AI-powered developer platform Available add-ons + Advanced Security Enterprise-grade security features + GitHub Copilot Enterprise-grade AI features + Premium Support Enterprise-grade 24/7 support * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} amakelov / mandala Public * * Notifications You must be signed in to change notification settings * Fork 5 * Star 242 A simple & elegant experiment tracking framework that integrates persistence logic & best practices directly into Python License Apache-2.0 license 242 stars 5 forks Branches Tags Activity Star Notifications You must be signed in to change notification settings * Code * Issues 1 * Pull requests 0 * Discussions * Actions * Projects 0 * Security * Insights Additional navigation options * Code * Issues * Pull requests * Discussions * Actions * Projects * Security * Insights amakelov/mandala This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. master BranchesTags Go to file Code Folders and files Name Name Last commit Last commit message date Latest commit History 288 Commits .github .github _demos _demos assets assets docs/docs docs/docs docs_source docs_source mandala mandala .coveragerc .coveragerc .gitignore .gitignore LICENSE LICENSE README.md README.md c.py c.py console.py console.py mkdocs.yml mkdocs.yml output.svg output.svg requirements.txt requirements.txt runtime.txt runtime.txt setup.py setup.py View all files Repository files navigation * README * Apache-2.0 license logo Install | Quickstart Open In Colab | Tutorials | Docs | Blogs | FAQs Automatically save, query & version Python computations mandala eliminates the effort and code overhead of ML experiment tracking (and beyond) with two generic tools: 1. The @op decorator: + captures inputs, outputs and code (+dependencies) of Python function calls + automatically reuses past results & never computes the same call twice + designed to be composed into end-to-end persisted programs, enabling efficient iterative development in plain-Python, without thinking about the storage backend. 2. The ComputationFrame data structure: + automatically organizes executions of imperative code into a high-level computation graph of variables and operations. Detects patterns like feedback loops, branching/merging and aggregation /indexing + queries relationships between variables by Description extracting a dataframe where columns are variables and operations in the graph, and each row contains values/calls of a (possibly partial) execution of the graph + automates exploration and high-level operations over heterogeneous "webs" of @op calls Video demo A quick demo of running computations in mandala and simultaneously updating a view of the corresponding ComputationFrame and the dataframe extracted from it (code can be found here): output.mp4 Install pip install git+https://github.com/amakelov/mandala Tutorials * Quickstart: Open In Colab * ComputationFrames notebook: Open In Colab * Toy ML project: Open In Colab Blogs & papers * Tidy Computations: introduces the ComputationFrame data structure and its applications * Practical Dependency Tracking for Python Function Calls: describes the motivations and designs behind mandala's dependency tracking system * The paper, which is to appear in the SciPy 2024 proceedings. FAQs How is this different from other experiment tracking frameworks? Compared to popular tools like W&B, MLFlow or Comet, mandala: * is integrated with the actual Python code execution on a more granular level + the function call is the synchronized unit of persistence, versioning and querying, as opposed to an entire script or notebook, leading to more efficient reuse and incremental development. + going even further, Python collections (e.g. list, dict) can be made transparent to the storage system, so that individual elements are stored and tracked separately and can be reused across collections and calls. + since it's memoization-based as opposed to logging-based, you don't have to think about how to name any of the things you log. * provides the ComputationFrame data structure, a powerful & simple way to represent, query and manipulate complex saved computations. * automatically resolves the version of every @op call from the current state of the codebase and the inputs to the call. How is the @op cache invalidated? * given inputs for a call to an @op, e.g. f, it searches for a past call to f on inputs with the same contents (as determined by a hash function) where the dependencies accessed by this call (including f itself) have versions compatible with their current state. * compatibility between versions of a function is decided by the user: you have the freedom to mark certain changes as compatible with past results, though see the limitations about marking changes as compatible. * internally, mandala uses slightly modified joblib hashing to compute a content hash for Python objects. This is practical for many use cases, but not perfect, as discussed in the limitations section. Can I change the code of @ops, and what happens if I do? * a frequent use case: you have some @op you've been using, then want to extend its functionality in a way that doesn't invalidate the past results. The recommended way is to add a new argument a, and provide a default value for it wrapped with NewArgDefault(x). When a value equal to x is passed for this argument, the storage falls back on calls before Is it production-ready? * mandala is in alpha, and the API is subject to change. * moreover, there are known performance bottlenecks that may make working with storages of 10k+ calls slow. How self-contained is it? * mandala's core is a few kLoCs and only depends on pandas and joblib. * for visualization of ComputationFrames, you should have dot installed on the system level, and/or the Python graphviz library installed. Limitations * When using versioning and you mark a change as compatible with past results, you should be careful if the change introduced new dependencies that are not tracked by mandala. Changes to such "invisible" dependencies may remain unnoticed by the storage system, leading you to believe that certain results are up to date when they are not. * See the "gotchas" notebook for some limitations of the hashing used to invalidate the cache: Open In Colab Roadmap for future features Overall * [*] support for named outputs in @ops * [ ] support for renaming @ops and their inputs/outputs Memoization * [ ] add custom serialization for chosen objects * [ ] figure out a solution that ignores small numerical error in content hashing * [ ] improve the documentation on collections * [ ] support parallelization of @op execution via e.g. dask or ray * [ ] support for inputs/outputs to exclude from the storage Computation frames * [*] add support for cycles in the computation graph * [ ] improve heuristics for the expand_... methods * [ ] add tools for restricting a CF to specific subsets of variable values via predicates * [ ] improve support & examples for using collections * [ ] add support for merging or splitting nodes in the CF and similar simplifications Versioning * [ ] support restricting CFs by function versions * [ ] support ways to manually add dependencies to versions in order to avoid the "invisible dependency" problem Performance * [ ] improve performance of the in-memory cache * [ ] improve performance of ComputationFrame operations Galaxybrained vision Aspirationally, mandala is about much more than ML experiment tracking. The main goal is to make persistence logic & best practices a natural extension of Python. Once this is achieved, the purely "computational" code you must write anyway doubles as a storage interface. It's hard to think of a simpler and more reliable way to manage computational artifacts. A first-principles approach to managing computational artifacts What we want from our storage are ways to * refer to artifacts with short, unambiguous descriptions: "here's [big messy Python object] I computed, which to me means [human-readable description]" * save artifacts: "save [big messy Python object]" * refer to artifacts and load them at a later time: "give me [human-readable description] that I computed before" * know when you've already computed something: "have I computed [human-readable description]?" * query results in more complicated ways: "give me all the things that satisfy [higher-level human-readable description]", which in practice means some predicate over combinations of artifacts. * get a report of how artifacts were generated: "what code went into [human-readable description]?" The key observation is that execution traces can already answer ~all of these questions. Related work mandala combines ideas from, and shares similarities with, many technologies. Here are some useful points of comparison: * memoization: + standard Python memoization solutions are joblib.Memory and functools.lru_cache. mandala uses joblib serialization and hashing under the hood. + incpy is a project that integrates memoization with the python interpreter itself. + funsies is a memoization-based distributed workflow executor that uses an analogous notion of hashing to mandala to keep track of which computations have already been done. It works on the level of scripts (not functions), and lacks queriability and versioning. + koji is a design for an incremental computation data processing framework that unifies over different resource types (files or services). It also uses an analogous notion of hashing to keep track of computations. * computation frames: + computation frames are special cases of relational databases: each function node in the computation graph has a table of calls, where columns are all the input/output edge labels connected to the function. Similarly, each variable node is a single-column table of all the Refs in the variable. Foreign key constraints relate the functions' columns to the variables, and various joins over the tables express various notions of joint computational history of variables. + computation frames are also related to graph databases, in the sense that some of the relevant queries over computation frames, e.g. ones having to do with reachability along @ops, are special cases of queries over graph databases. The internal representation of the Storage is also closer to a graph database than a relational one. + computation frames are also related to some ideas from applied category theory, such as using functors from a finite category to the category of sets (copresheaves) as a blueprint for a "universal" in-memory data structure that is (again) equivalent to a relational database; see e.g. this paper, which describes this categorical construction. * versioning: + the revision history of each function in the codebase is organized in a "mini-git repository" that shares only the most basic features with git: it is a content-addressable tree, where each edge tracks a diff from the content at one endpoint to that at the other. Additional metadata indicates equivalence classes of semantically equivalent contents. + semantic versioning is another popular code versioning system. mandala is similar to semver in that it allows you to make backward-compatible changes to the interface and logic of dependencies. It is different in that versions are still labeled by content, instead of by "non-canonical" numbers. + the unison programming language represents functions by the hash of their content (syntax tree, to be exact). About A simple & elegant experiment tracking framework that integrates persistence logic & best practices directly into Python Topics data-science machine-learning incremental-computation experiment-tracking Resources Readme License Apache-2.0 license Activity Stars 242 stars Watchers 2 watching Forks 5 forks Report repository Releases 2 `ComputationFrame`s replace old queries, breaking interface changes, simpler internals Latest Jul 11, 2024 + 1 release Sponsor this project Sponsor Learn more about GitHub Sponsors Packages 0 No packages published Contributors 2 * @amakelov amakelov Alex Makelov * @nschiefer nschiefer Nicholas Schiefer Languages * Jupyter Notebook 68.1% * Python 31.9% Footer (c) 2024 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact * Manage cookies * Do not share my personal information You can't perform that action at this time.