[HN Gopher] Determined: Deep Learning Training Platform
       ___________________________________________________________________
        
       Determined: Deep Learning Training Platform
        
       Author : petemir
       Score  : 48 points
       Date   : 2023-03-24 08:14 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | ipsum2 wrote:
       | Owned by HPE: https://www.hpe.com/us/en/newsroom/press-
       | release/2021/06/hew...
       | 
       | Looking through the documentation, the API looks brittle. I'll
       | stick with slurm for large jobs and run things locally for
       | testing/debugging.
        
         | neilc wrote:
         | > Looking through the documentation, the API looks brittle.
         | 
         | Thanks for the feedback! Can you elaborate on the parts of the
         | API you felt were brittle?
        
       | petemir wrote:
       | I'm a PhD student and we currently have a DL server at my lab
       | that I manage. Looking for a way to administer loads and
       | environments to create reproducible models for undergraduate
       | students and other researchers I arrived to determined.ai. It
       | felt interesting to share with the HN crowd.
        
         | rsfern wrote:
         | That's cool. I was wondering how this compares to ray (which I
         | use with my institutions slurm-based clusters). The scheduler
         | system that determined.ai has seems a lot more granular which
         | suits the workloads you get with a team of people doing a bunch
         | of deep learning model prototyping. Our debug queue has a five
         | minute preempt time which sometimes adds a lot of friction for
         | quick debugging iteration when utilization is maxed out
        
         | complex1314 wrote:
         | I'm in about the same situation as OP. We have a small cluster
         | of Power9 and it's been unmaintained and unused for a while so
         | I will set it up from scratch. Been looking into solutions that
         | would be a good fit, for the moment we are just a few
         | students/postdoc, so manual scheduling is feasible, but
         | eventually we would like to make it available to other students
         | at the institution.
         | 
         | My candidates are also - slurm + ray/lightning/etc. -
         | determined.ai (maybe together with slurm)
         | 
         | Some advertise a kubernetes setup with kubeflow but I would
         | imagine that is a bit too complex for a small cluster.
         | 
         | Anyone else with experience in this? Any other suggestions?
         | 
         | To make the environments as reproducible as possible it would
         | be great to also have a setup based on docker containers and
         | maybe nix, but not sure if it is feasible on ppc64. Guix and
         | Spack have also come up in my searches.
         | 
         | edit: typo
        
       ___________________________________________________________________
       (page generated 2023-03-25 23:01 UTC)