[HN Gopher] Data Version Control
___________________________________________________________________
Data Version Control
Author : HerrMonnezza
Score : 131 points
Date : 2022-10-01 16:09 UTC (1 days ago)
(HTM) web link (dvc.org)
(TXT) w3m dump (dvc.org)
| throwawaybutwhy wrote:
| The package phones home. One has to set an env var or fix several
| lines of code to prevent that.
| shcheklein wrote:
| Hey, yes, we've decided to keep it opt-out for now and it
| collects fully anonymized basic statistics. Here is the full
| policy: https://dvc.org/doc/user-guide/analytics .
|
| It should be easy to opt-out though `dvc config core.analytics
| false` or an env variable `DVC_ANALYTICS=False`.
|
| Could you please clarify about the `several lines of code`? We
| were trying to make it very open and visible what we collect
| (it prints a large message when it starts) + make it easy to
| disable it.
| prepend wrote:
| This seems pretty anti user since most users prefer opt in.
| Seems pretty shady to keep in behavior that users don't like
| and potentially harms them (you think it's fully anonymized).
|
| That's your prerogative as it's your project but makes me
| think what else you're doing that's against users best
| interest and in your own.
| shcheklein wrote:
| We are fully aware that it raises concerns. Trust me it
| hurts my feelings as well. E.g. on the websites (dvc.org,
| cml.dev, etc) - we don't use any cookies, GA, etc.
|
| We've tried to make it as open as possible - code is
| available (its open source), we write openly about this at
| the very start, we have a policy online, made it easy to
| opt-out. If you have other ideas how to make it even more
| friendly, more visible, etc - let us know please.
|
| Still, we've preferred so far to keep it opt-out since it's
| crucial for us to see major product trends (which features
| are being used more, product growth MoM etc). Opt-in at
| this stage realistically won't give us this information.
| prepend wrote:
| Yet there are many successful projects that don't collect
| this information. So it's not crucial for them but is
| crucial for you.
|
| I think the challenge I have is that since you're getting
| IP address that will be an opportunity to abuse. And
| there seems to be some rule that any data that can be
| misused will eventually be misused.
|
| Since you're not willing to make it opt-in, I think
| perhaps the only other way would be to support an
| automated distro that doesn't include it so users are at
| least able to easily choose a version.
|
| I admire you for responding to this thread and me as it's
| definitely not easy. I just feel like one of the main
| benefits of open source is its alignment with user
| benefits so it's discouraging when an open source project
| chooses code that users don't want.
| shcheklein wrote:
| Right, many projects use opt-in, there are many that have
| opt-out though:
|
| https://docs.brew.sh/Analytics
| https://docs.npmjs.com/policies/privacy#how-does-npm-
| collect... VS Code, etc
|
| > I think the challenge I have is that since you're
| getting IP address that will be an opportunity to abuse.
|
| Yes! And we are migrating to the new package /
| infrastructure because of this -
| https://github.com/iterative/telemetry-python (DVC's
| sister tool MLEM is already on it and it's not sending
| (saving) IP addresses, nor using GA or any other third-
| party tools, data is saved into BigQuery and eventually
| we'll make publicly accessible -
| https://mlem.ai/doc/user-guide/analytics to be fully GDPR
| compatible). It's a legacy system that DVC had in place.
| There was no intention to use those IP addresses in some
| way.
|
| > I think perhaps the only other way would be to support
| an automated distro that doesn't include it so users are
| at least able to easily choose a version.
|
| Thanks. To some extent brew-like policy (not sending
| anything significant before there is a chance to disable
| it and there is clear explicit message) should be
| mitigating this, but I'll check if it works this way now
| and if it can be improved.
| [deleted]
| sva_ wrote:
| I wondered how they'll make money
|
| https://www.crunchbase.com/organization/iterative-ai/company...
| nerdponx wrote:
| I think their plan was/is to make money on corporate licenses
| and support, as well as SaaS/cloud products.
| machinekob wrote:
| They won't, they can make investor money back only from
| selling company to Amazon/Microsoft/Google but in this
| economy it won't happen.
| adhocmobility wrote:
| If you just want a git for large data files, and your files don't
| get updated too often (e.g. an ML model deployed in production
| which gets updated every month) then git-lfs is a nice solution.
| Bitbucket and Github both have support for it.
| kortex wrote:
| I've used both extensively. Git-lfs has always been a
| nightmare. Because each tracked large file can be in one of two
| states - binary, or "pointer" - it's super easy for the folder
| to get all fouled up. It would be unable to "clean" or
| "smudge", since either would cause some conflict. If you
| accidentally pushed in the wrong state, you could "infect" the
| remote and be really hosed. I had this happen numerous times
| over about 2 years of using lfs, and each time the only
| solution was some aggressive rewriting of history.
|
| That, combined with the nature of re-using the same filename
| for the metadata files, meant that it was common for folks to
| commit the binary and push it. Again, lots of history rewriting
| to get git sizes back down.
|
| Maybe there exist solutions to my problems but I had spent
| hours wrestling with it trying to fix these bad states, and it
| caused me much distress.
|
| Also configuring the backing store was generally more painful,
| especially if you needed >2GB.
|
| DVC was easy to use from the first moment. The separate meta
| files meant that it can't get into mixed clean/smudge states.
| If you aren't in a cloud workflow already, the backing store
| was a bit tricky, but even without AWS I made it work.
| kernelsanderz wrote:
| I do feel like git-lfs is a good solution. Once you have 10s or
| 100s of GB of files (eg. a computer vision project), this gets
| pretty pricey.
|
| Ideally I'd love to use git-lfs on top of S3, directly. I've
| looked into git-annex and various git-lfs proxies, but I'm not
| sure they're maintained well enough to be trusting it with
| long-term data storage.
|
| Huggingface datasets are built on git-lfs and it works really
| well for them for storage of large datasets. Ideally I'd love
| for AWS to offer this as a hosted thin layer on top of S3, or
| for some well funded or supported community effort to do the
| same, and in a performant way.
|
| If you know of any such solution, please let me know!
| simonw wrote:
| It seems to be the solution Hugging Face have picked too.
| LaserToy wrote:
| Can it be used for large and fast changing datasets?
|
| Example: 100 TB, write us every 10 mins.
|
| Or, 1tb, parquet, 40% is rewritten daily.
| nerdponx wrote:
| DVC is expressly for tracking artifacts that are files on disk,
| and only by comparing their MD5 hashes. So it can definitely
| track the parquet files, but you are not going to get row or
| field diffs or anything like that.
|
| Maybe Pachyderm or Dolt would be better tools here.
| bumblebritches5 wrote:
| AlotOfReading wrote:
| Why would you use MD5 in anything written in the last 5
| years? The SHA family is faster on modern hardware and there
| aren't trivial collisions floating around out there.
| kortex wrote:
| It was definitely a bad choice. I wasn't there so I can
| only speculate. My guess is because it is sort of
| ubiquitous and thus a low-hanging fruit and devs didn't
| know better, or the related corollary, it's what S3 uses
| for ETags, so it probably seemed logical. Either way, seems
| like someone did it and didn't know better, no one agrees
| on a fix or whether it's even necessary to change, and thus
| it's stuck for now.
|
| There's an ongoing discussion about replacing/configuring
| the hash function, and it looks like there might be some
| movement toward replacing the hash and other speedups in
| 3.0
|
| https://github.com/iterative/dvc/issues/3069
|
| > We not only want to switch to a different algorithm in
| 3.0, but to also provide better
| performance/ui/architecture/ecosystem for data management,
| and all of that while not seizing releases with new
| features (experiements, dvc machine, plots, etc) and bug
| fixes for 2.0, so we've been gradually rebuilding that and
| will likely be ready for 3.0 in the upcoming months. - http
| s://github.com/iterative/dvc/issues/3069#issuecomment-93...
| nerdponx wrote:
| Don't quote me on the specific hash algorithm, maybe it's
| SHA. Point is that it's just comparing modification times
| and hashes.
| snthpy wrote:
| What about Apache Iceberg for those?
| tomthe wrote:
| Can anyone compare this to DataLad [1], which someone introduced
| to me as "git for data"?
|
| [https://www.datalad.org/]
| remram wrote:
| Doesn't use git-annex like DataLad. That alone is a huge
| benefit given the state of that tool.
| imiric wrote:
| I'm curious, what's the problem with git-annex?
|
| I've considered using it before as an alternative to Git LFS.
| niccl wrote:
| things that I don't like about it:
|
| * git diff doesn't work in any sensible way
|
| * if you forget and do `git add` instead of `git annex
| add`, everything is fine, but you've now spoilt the nice
| thing that git annex does of de-duping files. (git annex
| only stores one copy of identical files)
|
| * for our use case (which I'm sure is the wrong way of
| doing things) it's possible to overwrite the single copy of
| a file that git annex stores, which rather spoils the point
| of the thing. I do think it's down to the way we use it,
| though, so not specifically a git annex problem
|
| The _great_ thing about git annex is it can be self-hosted.
| For various reasons we can't put our source data in one of
| the systems that uses git-lfs.
|
| We've got about 800 GB of data in git annex and I've been
| happy with it despite the limitations.
| hpfr wrote:
| If you configure annex.largefiles, git add should work
| with the annex. I start with something like
| git annex config --set annex.largefiles 'largerthan=1kb
| and (not (mimeencoding=us-ascii or mimeencoding=utf-8)'
|
| > By default, git-annex add adds all files to the annex
| (except dotfiles), and git add adds files to git (unless
| they were added to the annex previously). When
| annex.largefiles is configured, both git annex add and
| git add will add matching large files to the annex, and
| the other files to git. --https://git-
| annex.branchable.com/git-annex/
|
| Note that git add will add large files unlocked, though,
| since (as far as I understand) it's assumed you're still
| modifying them for safety:
|
| > If you use git add to add a file to the annex, it will
| be added in unlocked form from the beginning. This allows
| workflows where a file starts out unlocked, is modified
| as necessary, and is locked once it reaches its final
| version. --https://git-annex.branchable.com/git-annex-
| unlock/
| remram wrote:
| Yes it definitely serves a valid use-case, I feel like
| someone should try and bring some competition there. A
| modern equivalent with fewer gotchas, maybe in Rust/Go,
| maybe using a fuse mount and content-defined chunking
| (borg/restic/...-style) would be amazing.
| kernelsanderz wrote:
| I'd love to see a well-supported git-lfs compatible
| client/proxy (so you could more easily move backends)
| that could run on top of S3/object storage. Yes, and
| written in a modern language like golang/rust for
| performance / parallelism. There's some node.js and
| various other git-lfs proxies out there, but not well
| enough maintained that I could count on them being around
| and working in another 5 years. git-annex at least has
| been around for a while, even though it has its issues.
|
| Huggingface uses git-lfs for large datasets with good
| success. git-lfs on GitHub gets very pricey at higher
| volumes of data. Would love the affordability of object
| storage, just with a better git blob storage interface,
| that will be around in the future.
|
| Most of these systems do their own hash calculations and
| are not interchangeable with each other. I feel like git-
| lfs has the momentum at the momentum in data-science at
| the moment, but needs some better options for people who
| want a low cost storage option that they can control.
|
| Huggingface is great, but it's one more service to
| onboard if you're in an enterprise. And data
| privacy/retention/governance means that many people would
| liek their data to reside on their own infrastructure.
|
| If AWS were to give us a low cost git-lfs hosted service
| on top of S3 it would be very popular.
|
| If anyone knows of some good alternatives, please let us
| know!
| kernelsanderz wrote:
| Did some more research to see if anything had changed in
| this space. I found two interesting projects (haven't
| used them myself yet though):
|
| One in C# (with support for auth)
|
| https://github.com/alanedwardes/Estranged.Lfs
|
| One in Rust (but no Auth, have to run reverse proxy)
|
| https://github.com/jasonwhite/rudolfs
|
| Both seem interesting. Anyone use these?
| remram wrote:
| It lives in this weird wiki that seems to be read-only most
| of the time. I don't think it's alive. Its use of hard
| links also causes too many problems, of the silent
| corruption variety.
| hpfr wrote:
| Ikiwiki's definitely a bit weird, but I've been
| experimenting with git-annex recently and it worked fine
| every time I commented. Seems like it's chugging along:
| https://git-annex.branchable.com/recentchanges/
|
| When does it use hard links? As far as I remember it used
| symlinks unless you used something like annex.hardlink
| (described in the man page: https://git-
| annex.branchable.com/git-annex/)
| benhurmarcel wrote:
| And what about Dolt?
|
| https://docs.dolthub.com/introduction/what-is-dolt
| shcheklein wrote:
| Dolt is for tabular data. It's like SQLite but with
| branching, versioning of the DB level. DVC is file-based. It
| saves large files, directories, etc to one of the supported
| storages - S3, GCP, Azure, etc. It's more like Git-lfs in
| that sense.
|
| Another difference is that for DVC (surprisingly) data
| versioning itself is just one of the main fundamental layers
| that is needed to provide holistic ML experiments tracking
| and versioning. So, DVC has a layer to describe an ML
| project, run it, capture and version inputs/outputs. In that
| sense DVC becomes a more opinionated / high level tool if
| that makes sense.
| bs7280 wrote:
| What value does this provide that I can't get by versioning my
| data in partitioned parquet files on s3?
| shcheklein wrote:
| I think parquet won't help with images, video, ML models.
|
| Also, one thing is to physically provide a way to version data
| (e.g. partitioned parquet files, cloud versioning, etc, etc),
| but another one is to also have a mechanism of saving /
| codifying dataset version into the project. E.g. to answer the
| question which version of data this model was built with you
| would need to save some identifier / hash / list of files that
| were used. DVC takes care of that part as well.
|
| (it has mechanics to cache data that you download, make-file
| like pipelines, etc)
| smeagull wrote:
| I don't think this tool can encompass everything you need in
| managing ML models and data sets, even if you limit it to
| versioning data.
|
| I'd need such a tool to manage features, checkpoints and labels.
| This doesn't do any of that. Nor does it really handle merging
| multiple versions of data.
|
| And I'd really like the code to be handled separately from the
| data. Git is not the place to do this. Because the choice of
| picking pairs of code and data should happen at a higher level,
| and be tracked along with the results - that's not going in a
| repo - MLFlow or Tensorboard handles it better.
| lizen_one wrote:
| DVC has had the following problems, when I tested it (half a year
| ago):
|
| I gets super slow (waiting minutes) when there are a few thousand
| files tracked. Thousands files have to be tracked, if you have
| e.g. a 10GB file per day and region and artifacts generated from
| it.
|
| You are encouraged (it only can track artifacts) if you model
| your pipeline in DVC (think like make). However, it cannot run
| tasks it parallel. So it takes a lot of time to run a pipeline
| while you are on a beefy machine and only one core is used.
| Obviously, you cannot run other tools (e.g. snakemake) to
| distribute/parallelize on multiple machines. Running one (part of
| a) stage has also some overhead, because it does commit/checks
| after/before running the executable of the task.
|
| Sometimes you get merge conflicts, if you run a (partial
| parmaretized) stage on one machine and the other part on the
| other machine manually. These are cumbersome to fix.
|
| Currently, I think they are more focused on ML features like
| experiment tracking (I prefer other mature tools here) instead of
| performance and data safety.
|
| There is an alternative implementation from a single developer (I
| cannot find it right now) that fixes some problems. However, I do
| not use this because it propably will not have the same
| development progress and testing as DVC.
|
| This sounds negative but I think it is currently the one of the
| best tools in this space.
| DougBTX wrote:
| What's best if parallel step processing is required?
| jdoliner wrote:
| DVC is great for use cases that don't get to this scale or have
| these needs. And the issues here are non-trivial to solve. I've
| spent a lot of time figuring out how to solve them in Pachyderm
| which is good for use cases where you do need higher levels of
| scale or might run into merge conflicts with DVC. There's
| trade-offs though. DVC is definitely easier for a single
| developer / data scientist to get up and running with.
| nerdponx wrote:
| I think it's worth noting that DVC can be used to track
| artifacts that have been generated by other tools. For
| example, you could use MLFlow to run several model
| experiments, but at the end track the artifacts with DVC.
| Personally I think that this is the best way to use it.
|
| However I agree that in general it's best for smaller
| projects and use cases. for example, it still shares the
| primary deficiency of Make in that it can only track files on
| the file system, and now things like ensuring a database
| table has been created (unless you 'touch' your own sentinel
| files).
| bagavi wrote:
| The alternative tool you are referring to is `Dud` I believe
|
| Dvc is the best tool (I found) inspite of being dead slow and
| complex (trying to do many things).
|
| What alternatives would you recommend?
| remram wrote:
| > You are encouraged if you model your pipeline in DVC.
|
| Encouraged to do what?
|
| You might want to slow down on the use of parentheses, we are
| both getting lost in them.
| nerdponx wrote:
| I assume they meant to say "you are encouraged to use DVC to
| run your model and experiment pipeline". They want to
| encourage you to do this because they are trying to build a
| business around being a data science ops ecosystem. But the
| truth is that DVC is not a great tool for running
| "experiments" searching over a parameter space. it could be
| improved in that regard, but that's just not what I use it
| for nor is it what I recommend it to other people for.
|
| However it's fantastic for _tracking_ artifacts throughout an
| project that have been generated by other means, and for
| keeping those artifacts tightly in sync with Git, and for
| making it easy to share those artifacts without forcing
| people to re-run expensive pipelines.
| shcheklein wrote:
| > But the truth is that DVC is not a great tool for running
| "experiments" searching over a parameter space.
|
| Would love your feedback what's missing there! We've been
| improving it lately - e.g.
|
| - Hydra support https://dvc.org/doc/user-guide/experiment-
| management/hydra
|
| - VS Code extension - https://marketplace.visualstudio.com/
| items?itemName=Iterativ...
| kernelsanderz wrote:
| Last I checked it wasn't easy to use something like
| optuna to do hyperparameter tuning with hydra/DVC.
|
| Ideally I'd like the tool I use for data versioning
| (DVC/git-lfs/gif-annex) to be orthogonal to that which I
| use for hyperparameter sweeping (DVD/optuna/SageMaker
| experiments), and orthogonal to that which I use for
| configuration management (DVC/Hydra/Plain YAML), to that
| what I use for experimental DAG management (DVC/Makefile)
|
| Optuna is becoming very popular in the data-science/deep
| learning ecosystem at the moment. It would be great to
| see more composable tools, rather than having to opt all-
| in into a given ecosystem.
|
| Love the work that DVC is doing though to tackle these
| difficult problems though!
___________________________________________________________________
(page generated 2022-10-02 23:00 UTC)