[HN Gopher] Ask HN: How does deploying a fine-tuned model work
       ___________________________________________________________________
        
       Ask HN: How does deploying a fine-tuned model work
        
       If I've managed to build my own model, say a fine-tuned version of
       Llama and trained it on some GPUs, how do I then deploy it and use
       it in an app. Does it need to be running on the GPUs all the time
       or can I host the model on a web server or something. Sorry if this
       is an obvious/misinformed question, I'm a beginner in this space
        
       Author : FezzikTheGiant
       Score  : 11 points
       Date   : 2024-04-23 06:48 UTC (16 hours ago)
        
       | tikkun wrote:
       | TLDR you'll probably serve it on gpus
       | 
       | If it's a small model you might be able to host it on a regular
       | server with CPU inference (see llama.cpp)
       | 
       | Or a big model on cpu but really slowly
       | 
       | But realistically you'll probably want to use gpu inference
       | 
       | Either running on gpus all the time (no cold start times) or on
       | serverless gpus (but then the downside is the instances need to
       | start up when needed, which might take 10 seconds)
        
       ___________________________________________________________________
       (page generated 2024-04-23 23:00 UTC)