[HN Gopher] Ask HN: How does deploying a fine-tuned model work
___________________________________________________________________
Ask HN: How does deploying a fine-tuned model work
If I've managed to build my own model, say a fine-tuned version of
Llama and trained it on some GPUs, how do I then deploy it and use
it in an app. Does it need to be running on the GPUs all the time
or can I host the model on a web server or something. Sorry if this
is an obvious/misinformed question, I'm a beginner in this space
Author : FezzikTheGiant
Score : 11 points
Date : 2024-04-23 06:48 UTC (16 hours ago)
| tikkun wrote:
| TLDR you'll probably serve it on gpus
|
| If it's a small model you might be able to host it on a regular
| server with CPU inference (see llama.cpp)
|
| Or a big model on cpu but really slowly
|
| But realistically you'll probably want to use gpu inference
|
| Either running on gpus all the time (no cold start times) or on
| serverless gpus (but then the downside is the instances need to
| start up when needed, which might take 10 seconds)
___________________________________________________________________
(page generated 2024-04-23 23:00 UTC)