[HN Gopher] Determined: Deep Learning Training Platform
___________________________________________________________________
Determined: Deep Learning Training Platform
Author : petemir
Score : 48 points
Date : 2023-03-24 08:14 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| ipsum2 wrote:
| Owned by HPE: https://www.hpe.com/us/en/newsroom/press-
| release/2021/06/hew...
|
| Looking through the documentation, the API looks brittle. I'll
| stick with slurm for large jobs and run things locally for
| testing/debugging.
| neilc wrote:
| > Looking through the documentation, the API looks brittle.
|
| Thanks for the feedback! Can you elaborate on the parts of the
| API you felt were brittle?
| petemir wrote:
| I'm a PhD student and we currently have a DL server at my lab
| that I manage. Looking for a way to administer loads and
| environments to create reproducible models for undergraduate
| students and other researchers I arrived to determined.ai. It
| felt interesting to share with the HN crowd.
| rsfern wrote:
| That's cool. I was wondering how this compares to ray (which I
| use with my institutions slurm-based clusters). The scheduler
| system that determined.ai has seems a lot more granular which
| suits the workloads you get with a team of people doing a bunch
| of deep learning model prototyping. Our debug queue has a five
| minute preempt time which sometimes adds a lot of friction for
| quick debugging iteration when utilization is maxed out
| complex1314 wrote:
| I'm in about the same situation as OP. We have a small cluster
| of Power9 and it's been unmaintained and unused for a while so
| I will set it up from scratch. Been looking into solutions that
| would be a good fit, for the moment we are just a few
| students/postdoc, so manual scheduling is feasible, but
| eventually we would like to make it available to other students
| at the institution.
|
| My candidates are also - slurm + ray/lightning/etc. -
| determined.ai (maybe together with slurm)
|
| Some advertise a kubernetes setup with kubeflow but I would
| imagine that is a bit too complex for a small cluster.
|
| Anyone else with experience in this? Any other suggestions?
|
| To make the environments as reproducible as possible it would
| be great to also have a setup based on docker containers and
| maybe nix, but not sure if it is feasible on ppc64. Guix and
| Spack have also come up in my searches.
|
| edit: typo
___________________________________________________________________
(page generated 2023-03-25 23:01 UTC)