[HN Gopher] Data Version Control
       ___________________________________________________________________
        
       Data Version Control
        
       Author : shcheklein
       Score  : 79 points
       Date   : 2024-10-19 16:56 UTC (6 hours ago)
        
 (HTM) web link (dvc.org)
 (TXT) w3m dump (dvc.org)
        
       | jerednel wrote:
       | It's not super clear to me how this interacts with data. If I
       | have am using ADLS to store delta tables, and I cannot pull prod
       | to my local can I still use this? Is there a point if I can just
       | look at delta log to switch between past versions?
        
         | riedel wrote:
         | DVC is (at least as I use it) pretty much just git LFS with
         | multiple backends (guess actually a more simple git annex). It
         | further has some rather MLOps specific stuff. Is handy if you
         | do versions model training with changing data on S3.
        
           | starkparker wrote:
           | I've used it for storing rasters alongside georeferencing
           | data in small GIS projects, as an alternative to git LFS. It
           | not only works like git but can integrate with git repos
           | through commit and push/pull hooks, storing DVC pointers and
           | managing .gitignore files while retaining directory structure
           | of the DVC-managed files. It's neat, even if the initial
           | learning curve was a little steep.
           | 
           | We used Google Drive as a storage backend and had to grow out
           | of it to a WebDAV backend, and it was nearly trivial to swap
           | them out and migrate.
        
           | haensi wrote:
           | There's another thread from October 2022 on that topic.
           | 
           | https://news.ycombinator.com/item?id=33047634
           | 
           | What makes DVC especially useful for MLOps? Aren't MLFlow or
           | W&B solving that in a way that's open source (the former) or
           | just increases the speed and scale massively ( the latter)?
           | 
           | Disclaimer: I work at W&B.
        
             | riedel wrote:
             | DVC is much more basic (feels more unix style), integrates
             | really well with any simple CI/CD scripting with git
             | versioning without the need to set up any additional
             | servers.
             | 
             | And it is not either or. People actually combine MLFlow and
             | SVC [0]
             | 
             | [0] https://data-ai.theodo.com/blog-technique/dvc-pipeline-
             | runs-...
        
           | matrss wrote:
           | Speaking of git-annex, there is another project called
           | DataLad (https://www.datalad.org/), which has some overlap
           | with DVC. It uses git-annex under the hood and is domain-
           | agnostic, compared to the ML focus that DVC has.
        
       | shicholas wrote:
       | What are the benefits of DVC over Apache Iceberg? If anyone used
       | both, I'd be curious about your take. Thanks!
        
         | andrew_lettuce wrote:
         | I don't see any real benefits, as it feels like using the tool
         | you already know even though it's not quite right. Iceberg is
         | maybe geared towards slower changing models than this approach?
        
           | foobarbecue wrote:
           | username checks out
        
       | bramathon wrote:
       | I've used DVC for most of my projects for the past five years.
       | The good things is that it works a lot like git. If your
       | scientists understand branches, commits and diffs, they should be
       | able to understand DVC. The bad thing is that it works like git.
       | Scientists often do not, in fact, understand or use branches,
       | commits and diffs. The best thing is that it essentially forces
       | you to follow Ten Simple Rules for Reproducible Computational
       | Research [1]. Reproducibility has been a huge challenge on teams
       | I've worked on.
       | 
       | [1]
       | https://journals.plos.org/ploscompbiol/article?id=10.1371/jo...
        
       | dmpetrov wrote:
       | hi there! Maintainer and author here. Excited to see DVC on the
       | front page!
       | 
       | Happy to answer any questions about DVC and our sister project
       | DataChain https://github.com/iterative/datachain that does data
       | versioning with a bit different assumptions: no file copy and
       | built-in data transformations.
        
         | ajoseps wrote:
         | if the data files are all just text files, what are the
         | differences between DVC and using plain git?
        
           | dmpetrov wrote:
           | In this cases, you need DVC if:
           | 
           | 1. File are too large for Git and Git LFS.
           | 
           | 2. You prefer using S3/GCS/Azure as a storage.
           | 
           | 3. You need to track transformations/piplines on the file -
           | clean up text file, train mode, etc.
           | 
           | Otherwise, vanilla Git may be sufficient.
        
           | miki123211 wrote:
           | DVC does a lot more than git.
           | 
           | It essentially makes sure that your results can reproducibly
           | be generated from your original data. If any script or data
           | file is changed, the parts of your pipeline that depend on
           | it, possibly recursively, get re-run and the relevant results
           | get updated automatically.
           | 
           | There's no chance of e.g. changing the structure of your
           | original dataset slightly, forgetting to regenerate one of
           | the intermediate models by accident, not noticing that the
           | script to regenerate it doesn't work any more due to the new
           | dataset structure, and then getting reminded a year later
           | when moving to a new computer and trying to regen everything
           | from scratch.
           | 
           | It's a lot like Unix make, but with the ability to keep track
           | of different git branches and the data / intermediates they
           | need, which saves you from needing to regen everything every
           | time you make a new checkout, lets you easily exchange large
           | datasets with teammates etc.
           | 
           | In theory, you could store everything in git, but then every
           | time you made a small change to your scripts that e.g.
           | changed the way some model works and slightly adjusted a
           | score for each of ten million rows, your diff would be 10m
           | LOC, and all versions of that dataset would be stored in your
           | repo, forever, making it unbelievably large.
        
       | causal wrote:
       | This useful for large binaries?
        
       ___________________________________________________________________
       (page generated 2024-10-19 23:00 UTC)