[HN Gopher] Ask HN: How do your ML teams version datasets and mo...
       ___________________________________________________________________
        
       Ask HN: How do your ML teams version datasets and models?
        
       Git worked until we hit a few gigabytes. S3 scales super well but
       version control, documentation, and change management isn't great
       (we just did lots of "v1" or "vsep28_2023" names).  DVC felt very
       clunky (now I need git AND s3 AND dvc) by the team.  What best
       practices and patterns have you seen work or have you implemented
       yourself?
        
       Author : skadamat
       Score  : 43 points
       Date   : 2023-09-28 19:41 UTC (3 hours ago)
        
       | warkdarrior wrote:
       | I use five version tags, after that I just rename the dataset.
       | 
       | v1
       | 
       | v2
       | 
       | v2_<iso-date>
       | 
       | v3_final
       | 
       | FINAL_final
        
       | snovv_crash wrote:
       | CSV file in git with paths to all of the files, all the training
       | settings, and the path to the training artifacts (snapshots, loss
       | stats etc). The training artifacts get filled in by CI when you
       | commit. Files can be anywhere, for us it was a NAS due to PII in
       | the data we were training on so "someone else's computer" AKA
       | cloud wasn't an option.
        
         | hhh wrote:
         | Why would having PII rule out cloud?
        
           | michaelt wrote:
           | Most cloud providers are "secure" in the sense that they lock
           | up your data and leave the key in the door so you can access
           | it easily. A salesman will swear, hand on heart, that they'd
           | never abuse this. An auditor has also certified that they
           | meet the highest standard of the check clearing.
           | 
           | This is enough to meet the legal requirements, as I
           | understand things.
           | 
           | Some people are not credulous enough to take the salesman at
           | his word.
        
       | wendyshu wrote:
       | https://dvc.org/
       | 
       | https://github.com/dolthub/dolt
       | 
       | https://www.pachyderm.com/
        
       | herodoturtle wrote:
       | We use MLflow's model registry:
       | 
       | https://mlflow.org/docs/latest/model-registry.html
        
       | gschoeni wrote:
       | We have been working on an open source tool called "Oxen" that
       | aims to tackle this problem! Would love for you to kick the tires
       | and see if it works for your use case. We have a free version of
       | the CLI, python library, and server on github, and a free hosted
       | version you can kick around at Oxen.ai.
       | 
       | Website: https://oxen.ai
       | 
       | Dev Docs: https://docs.oxen.ai
       | 
       | GitHub: https://github.com/Oxen-AI/oxen-release
       | 
       | Feel free to reach out on the repo issues if you run into
       | anything!
        
       | cuteboy19 wrote:
       | Haphazardly, with commit# + timestamp of training
        
       | vinni2 wrote:
       | https://dvc.org/ https://huggingface.co/
        
       | kvnhn wrote:
       | I've used DVC in the past and generally liked its approach. That
       | said, I wholeheartedly agree that it's clunky. It does a lot of
       | things implicitly, which can make it hard to reason about. It was
       | also extremely slow for medium-sized datasets (low 10s of GBs).
       | 
       | In response, I created a command-line tool that addresses these
       | issues[0]. To reduce the comparison to an analogy: Dud : DVC ::
       | Flask : Django. I have a longer comparison in the README[1].
       | 
       | [0]: https://github.com/kevin-hanselman/dud
       | 
       | [1]: https://github.com/kevin-
       | hanselman/dud/blob/main/README.md#m...
        
       | prashp wrote:
       | Git LFS
        
       | smfjaw wrote:
       | ML Flow solves most of these issues for models, I haven't used it
       | in relation to data versioning but it solves most model
       | versioning and deployment management things I can think of
        
       | AJRF wrote:
       | MLFlow
        
       | zxexz wrote:
       | I think a decent solution is coming up with a system for storing
       | the models and datasets, checkpoints, etc. in S3 - store the
       | metadata, references, etc. in a well structures postgres table
       | (schema versioning, audit logs, etc. with snapshots). Also,
       | embedding the metadata in the model/dataset as well, in a way you
       | could always reconstruct the database from the artifacts (in
       | Arrow and Parquet files, you can embed arbitrary metadata at the
       | file-level and the field level).
       | 
       | But perhaps the best solution is to just use something like
       | MlFlow or WandB that handles this for you, if you use the API
       | correctly!
        
       | plonk wrote:
       | Models that actually get deployed get a random GUID. Our docs
       | tell us which is which (release date, intended use, etc.)
       | 
       | Models are then stored in an S3 bucket. But since the IDs are
       | unique, they can be exchanged and cached and copied with next to
       | no risk of confusion.
        
         | axpy906 wrote:
         | Is the bucket versioned?
        
       | janalsncm wrote:
       | We have a task name, major version, description and commit hash.
       | So the model name will be something like my_task_
       | v852_pairwise_refactor_0123ab. Ugly but it works.
       | 
       | Don't store your data in git, store your training code there and
       | your data in s3. And you can add metadata to the bucket so you
       | know what's in there/how it was generated.
        
       | speedgoose wrote:
       | Have you use git or git lfs to store the large files?
        
       ___________________________________________________________________
       (page generated 2023-09-28 23:01 UTC)