[HN Gopher] Ask HN: How do your ML teams version datasets and mo...
___________________________________________________________________
Ask HN: How do your ML teams version datasets and models?
Git worked until we hit a few gigabytes. S3 scales super well but
version control, documentation, and change management isn't great
(we just did lots of "v1" or "vsep28_2023" names). DVC felt very
clunky (now I need git AND s3 AND dvc) by the team. What best
practices and patterns have you seen work or have you implemented
yourself?
Author : skadamat
Score : 43 points
Date : 2023-09-28 19:41 UTC (3 hours ago)
| warkdarrior wrote:
| I use five version tags, after that I just rename the dataset.
|
| v1
|
| v2
|
| v2_<iso-date>
|
| v3_final
|
| FINAL_final
| snovv_crash wrote:
| CSV file in git with paths to all of the files, all the training
| settings, and the path to the training artifacts (snapshots, loss
| stats etc). The training artifacts get filled in by CI when you
| commit. Files can be anywhere, for us it was a NAS due to PII in
| the data we were training on so "someone else's computer" AKA
| cloud wasn't an option.
| hhh wrote:
| Why would having PII rule out cloud?
| michaelt wrote:
| Most cloud providers are "secure" in the sense that they lock
| up your data and leave the key in the door so you can access
| it easily. A salesman will swear, hand on heart, that they'd
| never abuse this. An auditor has also certified that they
| meet the highest standard of the check clearing.
|
| This is enough to meet the legal requirements, as I
| understand things.
|
| Some people are not credulous enough to take the salesman at
| his word.
| wendyshu wrote:
| https://dvc.org/
|
| https://github.com/dolthub/dolt
|
| https://www.pachyderm.com/
| herodoturtle wrote:
| We use MLflow's model registry:
|
| https://mlflow.org/docs/latest/model-registry.html
| gschoeni wrote:
| We have been working on an open source tool called "Oxen" that
| aims to tackle this problem! Would love for you to kick the tires
| and see if it works for your use case. We have a free version of
| the CLI, python library, and server on github, and a free hosted
| version you can kick around at Oxen.ai.
|
| Website: https://oxen.ai
|
| Dev Docs: https://docs.oxen.ai
|
| GitHub: https://github.com/Oxen-AI/oxen-release
|
| Feel free to reach out on the repo issues if you run into
| anything!
| cuteboy19 wrote:
| Haphazardly, with commit# + timestamp of training
| vinni2 wrote:
| https://dvc.org/ https://huggingface.co/
| kvnhn wrote:
| I've used DVC in the past and generally liked its approach. That
| said, I wholeheartedly agree that it's clunky. It does a lot of
| things implicitly, which can make it hard to reason about. It was
| also extremely slow for medium-sized datasets (low 10s of GBs).
|
| In response, I created a command-line tool that addresses these
| issues[0]. To reduce the comparison to an analogy: Dud : DVC ::
| Flask : Django. I have a longer comparison in the README[1].
|
| [0]: https://github.com/kevin-hanselman/dud
|
| [1]: https://github.com/kevin-
| hanselman/dud/blob/main/README.md#m...
| prashp wrote:
| Git LFS
| smfjaw wrote:
| ML Flow solves most of these issues for models, I haven't used it
| in relation to data versioning but it solves most model
| versioning and deployment management things I can think of
| AJRF wrote:
| MLFlow
| zxexz wrote:
| I think a decent solution is coming up with a system for storing
| the models and datasets, checkpoints, etc. in S3 - store the
| metadata, references, etc. in a well structures postgres table
| (schema versioning, audit logs, etc. with snapshots). Also,
| embedding the metadata in the model/dataset as well, in a way you
| could always reconstruct the database from the artifacts (in
| Arrow and Parquet files, you can embed arbitrary metadata at the
| file-level and the field level).
|
| But perhaps the best solution is to just use something like
| MlFlow or WandB that handles this for you, if you use the API
| correctly!
| plonk wrote:
| Models that actually get deployed get a random GUID. Our docs
| tell us which is which (release date, intended use, etc.)
|
| Models are then stored in an S3 bucket. But since the IDs are
| unique, they can be exchanged and cached and copied with next to
| no risk of confusion.
| axpy906 wrote:
| Is the bucket versioned?
| janalsncm wrote:
| We have a task name, major version, description and commit hash.
| So the model name will be something like my_task_
| v852_pairwise_refactor_0123ab. Ugly but it works.
|
| Don't store your data in git, store your training code there and
| your data in s3. And you can add metadata to the bucket so you
| know what's in there/how it was generated.
| speedgoose wrote:
| Have you use git or git lfs to store the large files?
___________________________________________________________________
(page generated 2023-09-28 23:01 UTC)