https://github.com/amakelov/mandala

Skip to content

Navigation Menu

Toggle navigation
 
Sign in

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        GitHub Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    By size
      + Enterprise
      + Teams
      + Startups
    By industry
      + Healthcare
      + Financial services
      + Manufacturing
    By use case
      + CI/CD & Automation
      + DevOps
      + DevSecOps
  * Resources
    Topics
      + AI
      + DevOps
      + Innersource
      + Open Source
      + Security
      + Software Development
    Explore
      + Learning Pathways
      + White papers, Ebooks, Webinars
      + Customer Stories
      + Partners
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Enterprise
      +  
        Enterprise platform
        AI-powered developer platform
    Available add-ons
      +  
        Advanced Security
        Enterprise-grade security features
      +  
        GitHub Copilot
        Enterprise-grade AI features
      +  
        Premium Support
        Enterprise-grade 24/7 support
  * Pricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Search
[                    ]
Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

[                    ] [ ] Include my email address so I can be
contacted
Cancel Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name [                    ] 
Query [                    ]

To see all available qualifiers, see our documentation.

Cancel Create saved search
Sign in
Sign up
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session. You switched accounts on another tab or window. Reload
to refresh your session. Dismiss alert
{{ message }}
amakelov / mandala Public

  * 
  * Notifications You must be signed in to change notification
    settings
  * Fork 5
  * Star 242

A simple & elegant experiment tracking framework that integrates
persistence logic & best practices directly into Python

License

Apache-2.0 license
242 stars 5 forks Branches Tags Activity
Star
Notifications You must be signed in to change notification settings

  * Code
  * Issues 1
  * Pull requests 0
  * Discussions
  * Actions
  * Projects 0
  * Security
  * Insights

Additional navigation options

  * Code
  * Issues
  * Pull requests
  * Discussions
  * Actions
  * Projects
  * Security
  * Insights

amakelov/mandala

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
 master
BranchesTags
  
Go to file
Code

Folders and files

      Name              Name          Last commit       Last commit
                                        message            date
Latest commit

 

History

288 Commits
 
.github           .github                              
_demos            _demos                               
assets            assets                               
docs/docs         docs/docs                            
docs_source       docs_source                          
mandala           mandala                              
.coveragerc       .coveragerc                          
.gitignore        .gitignore                           
LICENSE           LICENSE                              
README.md         README.md                            
c.py              c.py                                 
console.py        console.py                           
mkdocs.yml        mkdocs.yml                           
output.svg        output.svg                           
requirements.txt  requirements.txt                     
runtime.txt       runtime.txt                          
setup.py          setup.py                             
View all files

Repository files navigation

  * README
  * Apache-2.0 license

                                  
                                logo
Install | Quickstart Open In Colab | Tutorials | Docs | Blogs | FAQs

Automatically save, query & version Python computations

 

mandala eliminates the effort and code overhead of ML experiment
tracking (and beyond) with two generic tools:

 1. The @op decorator:
      + captures inputs, outputs and code (+dependencies) of Python
        function calls
      + automatically reuses past results & never computes the same
        call twice
      + designed to be composed into end-to-end persisted programs,
        enabling efficient iterative development in plain-Python,
        without thinking about the storage backend.

 2. The ComputationFrame data structure:
      + automatically organizes executions of imperative
        code into a high-level computation graph of
        variables and operations. Detects patterns like
        feedback loops, branching/merging and aggregation
        /indexing
      + queries relationships between variables by        Description
        extracting a dataframe where columns are
        variables and operations in the graph, and each
        row contains values/calls of a (possibly partial)
        execution of the graph
      + automates exploration and high-level operations
        over heterogeneous "webs" of @op calls

Video demo

 

A quick demo of running computations in mandala and simultaneously
updating a view of the corresponding ComputationFrame and the
dataframe extracted from it (code can be found here):

output.mp4

Install

 

pip install git+https://github.com/amakelov/mandala

Tutorials

 

  * Quickstart: Open In Colab
  * ComputationFrames notebook: Open In Colab
  * Toy ML project: Open In Colab

Blogs & papers

 

  * Tidy Computations: introduces the ComputationFrame data structure
    and its applications
  * Practical Dependency Tracking for Python Function Calls:
    describes the motivations and designs behind mandala's dependency
    tracking system
  * The paper, which is to appear in the SciPy 2024 proceedings.

FAQs

 

How is this different from other experiment tracking frameworks?

 

Compared to popular tools like W&B, MLFlow or Comet, mandala:

  * is integrated with the actual Python code execution on a more
    granular level
      + the function call is the synchronized unit of persistence,
        versioning and querying, as opposed to an entire script or
        notebook, leading to more efficient reuse and incremental
        development.
      + going even further, Python collections (e.g. list, dict) can
        be made transparent to the storage system, so that individual
        elements are stored and tracked separately and can be reused
        across collections and calls.
      + since it's memoization-based as opposed to logging-based, you
        don't have to think about how to name any of the things you
        log.
  * provides the ComputationFrame data structure, a powerful & simple
    way to represent, query and manipulate complex saved
    computations.
  * automatically resolves the version of every @op call from the
    current state of the codebase and the inputs to the call.

How is the @op cache invalidated?

 

  * given inputs for a call to an @op, e.g. f, it searches for a past
    call to f on inputs with the same contents (as determined by a
    hash function) where the dependencies accessed by this call
    (including f itself) have versions compatible with their current
    state.
  * compatibility between versions of a function is decided by the
    user: you have the freedom to mark certain changes as compatible
    with past results, though see the limitations about marking
    changes as compatible.
  * internally, mandala uses slightly modified joblib hashing to
    compute a content hash for Python objects. This is practical for
    many use cases, but not perfect, as discussed in the limitations
    section.

Can I change the code of @ops, and what happens if I do?

 

  * a frequent use case: you have some @op you've been using, then
    want to extend its functionality in a way that doesn't invalidate
    the past results. The recommended way is to add a new argument a,
    and provide a default value for it wrapped with NewArgDefault(x).
    When a value equal to x is passed for this argument, the storage
    falls back on calls before

Is it production-ready?

 

  * mandala is in alpha, and the API is subject to change.
  * moreover, there are known performance bottlenecks that may make
    working with storages of 10k+ calls slow.

How self-contained is it?

 

  * mandala's core is a few kLoCs and only depends on pandas and
    joblib.
  * for visualization of ComputationFrames, you should have dot
    installed on the system level, and/or the Python graphviz library
    installed.

Limitations

 

  * When using versioning and you mark a change as compatible with
    past results, you should be careful if the change introduced new
    dependencies that are not tracked by mandala. Changes to such
    "invisible" dependencies may remain unnoticed by the storage
    system, leading you to believe that certain results are up to
    date when they are not.
  * See the "gotchas" notebook for some limitations of the hashing
    used to invalidate the cache: Open In Colab

Roadmap for future features

 

Overall

  * [*] support for named outputs in @ops
  * [ ] support for renaming @ops and their inputs/outputs

Memoization

  * [ ] add custom serialization for chosen objects
  * [ ] figure out a solution that ignores small numerical error in
    content hashing
  * [ ] improve the documentation on collections
  * [ ] support parallelization of @op execution via e.g. dask or ray
  * [ ] support for inputs/outputs to exclude from the storage

Computation frames

  * [*] add support for cycles in the computation graph
  * [ ] improve heuristics for the expand_... methods
  * [ ] add tools for restricting a CF to specific subsets of
    variable values via predicates
  * [ ] improve support & examples for using collections
  * [ ] add support for merging or splitting nodes in the CF and
    similar simplifications

Versioning

  * [ ] support restricting CFs by function versions
  * [ ] support ways to manually add dependencies to versions in
    order to avoid the "invisible dependency" problem

Performance

  * [ ] improve performance of the in-memory cache
  * [ ] improve performance of ComputationFrame operations

Galaxybrained vision

 

Aspirationally, mandala is about much more than ML experiment
tracking. The main goal is to make persistence logic & best practices
a natural extension of Python. Once this is achieved, the purely
"computational" code you must write anyway doubles as a storage
interface. It's hard to think of a simpler and more reliable way to
manage computational artifacts.

A first-principles approach to managing computational artifacts

 

What we want from our storage are ways to

  * refer to artifacts with short, unambiguous descriptions: "here's
    [big messy Python object] I computed, which to me means
    [human-readable description]"
  * save artifacts: "save [big messy Python object]"
  * refer to artifacts and load them at a later time: "give me
    [human-readable description] that I computed before"
  * know when you've already computed something: "have I computed
    [human-readable description]?"
  * query results in more complicated ways: "give me all the things
    that satisfy [higher-level human-readable description]", which in
    practice means some predicate over combinations of artifacts.
  * get a report of how artifacts were generated: "what code went
    into [human-readable description]?"

The key observation is that execution traces can already answer ~all
of these questions.

Related work

 

mandala combines ideas from, and shares similarities with, many
technologies. Here are some useful points of comparison:

  * memoization:
      + standard Python memoization solutions are joblib.Memory and
        functools.lru_cache. mandala uses joblib serialization and
        hashing under the hood.
      + incpy is a project that integrates memoization with the
        python interpreter itself.
      + funsies is a memoization-based distributed workflow executor
        that uses an analogous notion of hashing to mandala to keep
        track of which computations have already been done. It works
        on the level of scripts (not functions), and lacks
        queriability and versioning.
      + koji is a design for an incremental computation data
        processing framework that unifies over different resource
        types (files or services). It also uses an analogous notion
        of hashing to keep track of computations.
  * computation frames:
      + computation frames are special cases of relational databases:
        each function node in the computation graph has a table of
        calls, where columns are all the input/output edge labels
        connected to the function. Similarly, each variable node is a
        single-column table of all the Refs in the variable. Foreign
        key constraints relate the functions' columns to the
        variables, and various joins over the tables express various
        notions of joint computational history of variables.
      + computation frames are also related to graph databases, in
        the sense that some of the relevant queries over computation
        frames, e.g. ones having to do with reachability along @ops,
        are special cases of queries over graph databases. The
        internal representation of the Storage is also closer to a
        graph database than a relational one.
      + computation frames are also related to some ideas from
        applied category theory, such as using functors from a finite
        category to the category of sets (copresheaves) as a
        blueprint for a "universal" in-memory data structure that is
        (again) equivalent to a relational database; see e.g. this
        paper, which describes this categorical construction.
  * versioning:
      + the revision history of each function in the codebase is
        organized in a "mini-git repository" that shares only the
        most basic features with git: it is a content-addressable
        tree, where each edge tracks a diff from the content at one
        endpoint to that at the other. Additional metadata indicates
        equivalence classes of semantically equivalent contents.
      + semantic versioning is another popular code versioning
        system. mandala is similar to semver in that it allows you to
        make backward-compatible changes to the interface and logic
        of dependencies. It is different in that versions are still
        labeled by content, instead of by "non-canonical" numbers.
      + the unison programming language represents functions by the
        hash of their content (syntax tree, to be exact).

About

A simple & elegant experiment tracking framework that integrates
persistence logic & best practices directly into Python

Topics

data-science machine-learning incremental-computation 
experiment-tracking

Resources

Readme

License

Apache-2.0 license
Activity

Stars

242 stars

Watchers

2 watching

Forks

5 forks
Report repository

Releases 2

 
`ComputationFrame`s replace old queries, breaking interface changes,
simpler internals Latest
Jul 11, 2024
+ 1 release

Sponsor this project

 
Sponsor
Learn more about GitHub Sponsors

Packages 0

No packages published

Contributors 2

  * @amakelov amakelov Alex Makelov
  * @nschiefer nschiefer Nicholas Schiefer

Languages

  * Jupyter Notebook 68.1%
  * Python 31.9%

Footer

 (c) 2024 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact
  * Manage cookies
  * Do not share my personal information

You can't perform that action at this time.