[HN Gopher] Torchchat: Chat with LLMs Everywhere
___________________________________________________________________
Torchchat: Chat with LLMs Everywhere
Author : constantinum
Score : 226 points
Date : 2024-08-01 03:48 UTC (19 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| gleenn wrote:
| This looks awesome, the instructions are basically a one-liner to
| get a Python program to start up a chat program, and it's
| optimized for a lot of hardware you can run locally like if you
| have an Nvidia GPU or Apple M processor. Super cool work bringing
| this functionality to local apps and to just play with a lot of
| popular models. Great work
| daghamm wrote:
| Does pytorch have better acceleration on x64 CPUs nowadays?
|
| Last time I played with LLMs on CPU with pytorch you had to
| replace some stuff with libraries from Intel otherwise your
| performance would be really bad.
| gleenn wrote:
| I can't find it again in this doc but pretty sure it supports
| MKL which at least is Intel's faster math library. Better than
| a stick in the eye. Also certainly faster than plain CPUs but
| almost certainly way slower than something with more massively
| parallel matrix processing.
| sva_ wrote:
| x86_64*
| jiratemplates wrote:
| looks great
| fbuilesv wrote:
| I'm not well versed in LLMs, can someone with more experience
| share how this compares to Ollama (https://ollama.com/)? When
| would I use this instead?
| jerrygenser wrote:
| Olamma currently has only one "supported backend" which is
| llama.cpp. It enables downloading and running models on CPU.
| And might have more mature server.
|
| This allows running models on GPU as well.
| darkteflon wrote:
| Ollama runs on GPUs just fine - on Macs, at least.
| Kelteseth wrote:
| Forks fine on Windows with an AMD 7600XT
| amunozo wrote:
| I use it in Ubuntu and works fine too.
| ekianjo wrote:
| it runs on GPUs everywhere. On Linux, on Windows...
| Zambyte wrote:
| I have been running Ollama on AMD GPUs (which support for
| came after NVIDIA GPUs) since February. Llama.cpp has
| supported it even longer.
| tarruda wrote:
| How well does it run in AMD GPUs these days compared to
| Nvidia or Apple silicon?
|
| I've been considering buying one of those powerful Ryzen
| mini PCs to use as an LLM server in my LAN, but I've read
| before that the AMD backend (ROCm IIRC) is kinda buggy
| SushiHippie wrote:
| I have an RTX 7900 XTX and never had AMD specific issues,
| except that I needed to set some environment variable.
|
| But it seems like integrated GPUs are not supported
|
| https://github.com/ollama/ollama/issues/2637
| JackYoustra wrote:
| Probably if you have any esoteric flags that pytorch supports.
| Flash attention 2, for example, was supported way earlier on pt
| than llama.cpp, so if flash attention 3 follows the same path
| it'll probably make more sense to use this when targeting
| nvidia gpus.
| sunshinesfbay wrote:
| It would appear that Flash-3 is already something that exists
| for PyTorch based on this joint blog between Nvidia,
| Together.ai and Princeton about enabling Flash-3 for PyTorch:
| https://pytorch.org/blog/flashattention-3/
| JackYoustra wrote:
| Right - my point about "follows the same path" mostly
| revolves around llama.cpp's latency in adopting it.
| Star_Ship_1010 wrote:
| Best answer to this is from Reddit
|
| "how does a smart car compare to a ford f150? its different in
| its intent and intended audience.
|
| Ollama is someone who goes to walmart and buys a $100 huffy
| mountain bike because they heard bikes are cool. Torchchat is
| someone who built a mountain bike out of high quality
| components chosen for a specific task/outcome with the
| understanding of how each component in the platform functions
| and interacts with the others to achieve an end goal."
| https://www.reddit.com/r/LocalLLaMA/comments/1eh6xmq/comment...
|
| Longer Answer with some more details is
|
| If you don't care about which quant you're using, only use
| ollama and want easy integration with desktop/laptop based
| projects use Ollama. If you want to run on mobile, integrate
| into your own apps or projects natively, don't want to use
| GGUF, want to do quantization, or want to extend your PyTorch
| based solution use torchchat
|
| Right now Ollama (based on llama.cpp) is a faster way to get
| performance on a laptop desktop and a number of projects are
| pre-integrated with Ollama thanks to the OpenAI spec. It's also
| more mature with more fit and polish. That said the commands
| that make everything easy use 4bit quant models and you have to
| do extra work to go find a GGUF model with a higher (or lower)
| bit quant and load it into Ollama. Also worth noting is that
| Ollama "containerizes" the models on disk so you can't share
| them with other projects without going through Ollama which is
| a hard pass for any users and usecases since duplicating model
| files on disk isn't great.
| https://www.reddit.com/r/LocalLLaMA/comments/1eh6xmq/comment...
| dagaci wrote:
| If you running windows anywhere then you better off using
| ollama, lmstudio, and or LLamaSharp for coding these are all
| cross-platform too.
| sunshinesfbay wrote:
| Pretty cool! What are the steps to use these on mobile?
| Stoked about using ollama on my iPhone!
| aklgh wrote:
| A new PyTorch feature. Who knew!
|
| How about making libtorch a first class citizen without crashes
| and memory leaks? What happened to the "one tool, one job"
| philosophy?
|
| As an interesting thought experiment: Should PyTorch be
| integrated into systemd or should systemd be integrated into
| PyTorch? Both seem to absorb everything else like a black hole.
| smhx wrote:
| it's not a new PyTorch feature.
|
| It's just a showcase of existing PyTorch features (including
| libtorch) as an end-to-end example.
|
| On the server-side it uses libtorch, and on mobile, it uses
| PyTorch's executorch runtime (that's optimized for edge)
| BaculumMeumEst wrote:
| Did not know executorch existed! That's so cool! I have it on
| my bucket list to tinker with running LLMs on wearables after
| I'm a little further along in learning, great to see official
| tooling for that!
|
| https://github.com/pytorch/executorch
| sunshinesfbay wrote:
| I think this is not about new Pytorch features, although it
| requires the latest Pytorch and Executorch making me think
| that some features in pytorch and executorch got extended
| optimized for this use case?
|
| What makes this cool is that you can use the same model and
| the same library and apply to server, desktop, laptop and
| mobile on iOS and Android, with a variety of quantization
| schemes and other features.
|
| Definitely still some rough edges as I'd expect from any
| first software release!
| ipunchghosts wrote:
| I have been using ollama and generally not that impressed with
| these models for doing real work. I can't be the only person who
| thinks this.
| diggan wrote:
| Same conclusion here so far. Tested out various open source
| models, maybe once or twice per month, comparing them against
| GPT-4, nothing has come close so far. Even closed source models
| seems to not far very well, so far maybe Claude got the closest
| to GPT-4, but yet to find something that could surpass GPT-4
| for coding help.
|
| Of course, could be that I've just got used to GPT-4 and my
| prompting been optimized for GPT-4, and I try to apply the same
| techniques to other models where those prompts wouldn't work as
| great.
| ekianjo wrote:
| > various open source models
|
| what models did you try? There's a ton of new ones every
| month these days.
| diggan wrote:
| Most recently: Llama-3.1, Codestral, Gemma 2, Mistral NeMo.
| codetrotter wrote:
| Which parameter counts, and which quantization levels?
| wongarsu wrote:
| They won't beat Claude or GPT-4. If you want a model that
| writes code or answers complex questions use one of those.
| But for many simpler tasks like summarization, sentiment
| analysis, data transformation, text completion, etc, self-
| hosted models are perfectly suited.
|
| And if you work on something where the commercial models are
| trained to refuse answers and lecture the user instead, some
| of the freely available models are much more pleasant to work
| with. With 70B models you even get decent amounts of
| reasoning capabilities
| bboygravity wrote:
| I wrote an automated form-filling Firefox extension and tested
| it with Ollama 3.1. Not perfect, quite slow, but better than
| any other form fillers I tested.
|
| I also tried to hook it up to Claude and so far its flawless
| (didn't do a lot of testing though).
| Dowwie wrote:
| Can you share what kind of real work you're trying?
| derefr wrote:
| What's your example of "real work"?
|
| Most "well-known-name" open-source ML models, are very much
| "base models" -- they are meant to be flexible and generic, so
| that they can be _fine-tuned_ with additional training for
| task-specific purposes.
|
| Mind you, you don't have to do that work yourself. There are
| open-source fine-tunes as well, for all sorts of specific
| purposes, that can be easily found on HuggingFace / found
| linked on applicable subreddits / etc -- but these don't "make
| the news" like the releases of new open-source base models do,
| so they won't be top-of-mind when doing a search for a model to
| solve a task. You have to actively look for them.
|
| Heck, even focusing on the proprietary-model Inference-as-a-
| Service space, it's only really OpenAI that purports to have a
| "general" model that can be set to _every_ task with only
| prompting. All the other proprietary-model Inf-aaS providers
| also sell Fine-Tuning-as-a-Service of their models, because
| they know people will need it.
|
| ---
|
| Also, if you're comparing e.g. ChatGPT-4o (~200b) with a local
| model you can run _on your PC_ (probably 7b, or maybe 13b if
| you have a 4090) then obviously the latter is going to be
| "dumber" -- it's (either literally, or effectively) had 95+% of
| its connections stripped out!
|
| For production deployment of an open-source model with "smart
| thinking" requirements (e.g. a customer-support chatbot), the
| best-practice open-source-model approach would be to pay for
| dedicated and/or serverless hosting where the instances have
| direct-attached dedicated server-class GPUs, that can then
| therefore host the _largest_ -parameter-size variants of the
| open-source models. Larger-parameter-size open-source models
| fare much better against the proprietary hosted models.
|
| IMHO the models in the "hostable on a PC" parameter-size range,
| mainly exist for two use-cases:
|
| * doing local development and testing of LLM-based backend
| systems (Due to the way pruning+quantizing parameters works, a
| smaller spin of a larger model will be _probabilistically_
| similar in behavior to its larger cousin -- giving you the
| "smart" answer some percentage of the time, and a "dumb" answer
| the rest of the time. For iterative development, this is no
| problem -- regenerate responses until it works, and if it never
| does, then you've got the wrong model/prompt.)
|
| * "shrinking" an AI system that _doesn 't_ require so much
| "smart thinking", to decrease its compute requirements and thus
| OpEx. You start with the largest spin of the model; then you
| keep taking it down in size until it stops producing acceptable
| results; and then you take one step back.
|
| The models of this size-range _don 't_ exist to "prove out" the
| applicability of a model family to a given ML task. You _can_
| do it with them -- especially if there 's an existing fine-
| tuned model perfectly suited to the use-case -- but it'll be
| frustrating, because "the absence of evidence is not evidence
| of absence." You won't know whether you've chosen a bad model,
| or your prompt is badly structured, or your prompt is
| impossible for any model, etc.
|
| When proving out a task, test with the largest spin of each
| model you can get your hands on, using e.g. a serverless Inf-
| aaS like Runpod. Once you know the model _family_ can do that
| task to your satisfaction, _then_ pull a local model spin from
| that family for development.
| simonw wrote:
| "There are open-source fine-tunes as well, for all sorts of
| specific purposes"
|
| Have you had good results from any of these? I've not tried a
| model that's been fine-tuned for a specific purpose yet, I've
| just worked with the general purpose ones.
| boringg wrote:
| Can someone explain the use case? Is it so that I can run LLMs
| more readily in terminal instead of having to use a chat
| interface?
|
| I'm not saying it isn't impressive being able to swap but I have
| trouble understanding how this integrates into my workflow and I
| don't really want to put much effort into exploring given that
| there are so many things to explore these days.
| sunshinesfbay wrote:
| It's an end to end solution that supports the same model from
| server (including OpenAI API!) to mobile. To the extent that
| you just want to run on one specific platform, other solutions
| might work just as well?
| suyash wrote:
| This is cool, how can I go about using this for my own dataset -
| .pdf, .html files etc?
| ein0p wrote:
| Selling it as a "chat" is a mistake imo. Chatbots require very
| large models with a lot of stored knowledge about the world.
| Small models are useful for narrow tasks, but they are not, and
| will never be, useful for general domain chat
___________________________________________________________________
(page generated 2024-08-01 23:01 UTC)