[HN Gopher] Data Version Control
___________________________________________________________________
Data Version Control
Author : shcheklein
Score : 79 points
Date : 2024-10-19 16:56 UTC (6 hours ago)
(HTM) web link (dvc.org)
(TXT) w3m dump (dvc.org)
| jerednel wrote:
| It's not super clear to me how this interacts with data. If I
| have am using ADLS to store delta tables, and I cannot pull prod
| to my local can I still use this? Is there a point if I can just
| look at delta log to switch between past versions?
| riedel wrote:
| DVC is (at least as I use it) pretty much just git LFS with
| multiple backends (guess actually a more simple git annex). It
| further has some rather MLOps specific stuff. Is handy if you
| do versions model training with changing data on S3.
| starkparker wrote:
| I've used it for storing rasters alongside georeferencing
| data in small GIS projects, as an alternative to git LFS. It
| not only works like git but can integrate with git repos
| through commit and push/pull hooks, storing DVC pointers and
| managing .gitignore files while retaining directory structure
| of the DVC-managed files. It's neat, even if the initial
| learning curve was a little steep.
|
| We used Google Drive as a storage backend and had to grow out
| of it to a WebDAV backend, and it was nearly trivial to swap
| them out and migrate.
| haensi wrote:
| There's another thread from October 2022 on that topic.
|
| https://news.ycombinator.com/item?id=33047634
|
| What makes DVC especially useful for MLOps? Aren't MLFlow or
| W&B solving that in a way that's open source (the former) or
| just increases the speed and scale massively ( the latter)?
|
| Disclaimer: I work at W&B.
| riedel wrote:
| DVC is much more basic (feels more unix style), integrates
| really well with any simple CI/CD scripting with git
| versioning without the need to set up any additional
| servers.
|
| And it is not either or. People actually combine MLFlow and
| SVC [0]
|
| [0] https://data-ai.theodo.com/blog-technique/dvc-pipeline-
| runs-...
| matrss wrote:
| Speaking of git-annex, there is another project called
| DataLad (https://www.datalad.org/), which has some overlap
| with DVC. It uses git-annex under the hood and is domain-
| agnostic, compared to the ML focus that DVC has.
| shicholas wrote:
| What are the benefits of DVC over Apache Iceberg? If anyone used
| both, I'd be curious about your take. Thanks!
| andrew_lettuce wrote:
| I don't see any real benefits, as it feels like using the tool
| you already know even though it's not quite right. Iceberg is
| maybe geared towards slower changing models than this approach?
| foobarbecue wrote:
| username checks out
| bramathon wrote:
| I've used DVC for most of my projects for the past five years.
| The good things is that it works a lot like git. If your
| scientists understand branches, commits and diffs, they should be
| able to understand DVC. The bad thing is that it works like git.
| Scientists often do not, in fact, understand or use branches,
| commits and diffs. The best thing is that it essentially forces
| you to follow Ten Simple Rules for Reproducible Computational
| Research [1]. Reproducibility has been a huge challenge on teams
| I've worked on.
|
| [1]
| https://journals.plos.org/ploscompbiol/article?id=10.1371/jo...
| dmpetrov wrote:
| hi there! Maintainer and author here. Excited to see DVC on the
| front page!
|
| Happy to answer any questions about DVC and our sister project
| DataChain https://github.com/iterative/datachain that does data
| versioning with a bit different assumptions: no file copy and
| built-in data transformations.
| ajoseps wrote:
| if the data files are all just text files, what are the
| differences between DVC and using plain git?
| dmpetrov wrote:
| In this cases, you need DVC if:
|
| 1. File are too large for Git and Git LFS.
|
| 2. You prefer using S3/GCS/Azure as a storage.
|
| 3. You need to track transformations/piplines on the file -
| clean up text file, train mode, etc.
|
| Otherwise, vanilla Git may be sufficient.
| miki123211 wrote:
| DVC does a lot more than git.
|
| It essentially makes sure that your results can reproducibly
| be generated from your original data. If any script or data
| file is changed, the parts of your pipeline that depend on
| it, possibly recursively, get re-run and the relevant results
| get updated automatically.
|
| There's no chance of e.g. changing the structure of your
| original dataset slightly, forgetting to regenerate one of
| the intermediate models by accident, not noticing that the
| script to regenerate it doesn't work any more due to the new
| dataset structure, and then getting reminded a year later
| when moving to a new computer and trying to regen everything
| from scratch.
|
| It's a lot like Unix make, but with the ability to keep track
| of different git branches and the data / intermediates they
| need, which saves you from needing to regen everything every
| time you make a new checkout, lets you easily exchange large
| datasets with teammates etc.
|
| In theory, you could store everything in git, but then every
| time you made a small change to your scripts that e.g.
| changed the way some model works and slightly adjusted a
| score for each of ten million rows, your diff would be 10m
| LOC, and all versions of that dataset would be stored in your
| repo, forever, making it unbelievably large.
| causal wrote:
| This useful for large binaries?
___________________________________________________________________
(page generated 2024-10-19 23:00 UTC)