[HN Gopher] Torchchat: Chat with LLMs Everywhere
       ___________________________________________________________________
        
       Torchchat: Chat with LLMs Everywhere
        
       Author : constantinum
       Score  : 226 points
       Date   : 2024-08-01 03:48 UTC (19 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | gleenn wrote:
       | This looks awesome, the instructions are basically a one-liner to
       | get a Python program to start up a chat program, and it's
       | optimized for a lot of hardware you can run locally like if you
       | have an Nvidia GPU or Apple M processor. Super cool work bringing
       | this functionality to local apps and to just play with a lot of
       | popular models. Great work
        
       | daghamm wrote:
       | Does pytorch have better acceleration on x64 CPUs nowadays?
       | 
       | Last time I played with LLMs on CPU with pytorch you had to
       | replace some stuff with libraries from Intel otherwise your
       | performance would be really bad.
        
         | gleenn wrote:
         | I can't find it again in this doc but pretty sure it supports
         | MKL which at least is Intel's faster math library. Better than
         | a stick in the eye. Also certainly faster than plain CPUs but
         | almost certainly way slower than something with more massively
         | parallel matrix processing.
        
         | sva_ wrote:
         | x86_64*
        
       | jiratemplates wrote:
       | looks great
        
       | fbuilesv wrote:
       | I'm not well versed in LLMs, can someone with more experience
       | share how this compares to Ollama (https://ollama.com/)? When
       | would I use this instead?
        
         | jerrygenser wrote:
         | Olamma currently has only one "supported backend" which is
         | llama.cpp. It enables downloading and running models on CPU.
         | And might have more mature server.
         | 
         | This allows running models on GPU as well.
        
           | darkteflon wrote:
           | Ollama runs on GPUs just fine - on Macs, at least.
        
             | Kelteseth wrote:
             | Forks fine on Windows with an AMD 7600XT
        
             | amunozo wrote:
             | I use it in Ubuntu and works fine too.
        
             | ekianjo wrote:
             | it runs on GPUs everywhere. On Linux, on Windows...
        
           | Zambyte wrote:
           | I have been running Ollama on AMD GPUs (which support for
           | came after NVIDIA GPUs) since February. Llama.cpp has
           | supported it even longer.
        
             | tarruda wrote:
             | How well does it run in AMD GPUs these days compared to
             | Nvidia or Apple silicon?
             | 
             | I've been considering buying one of those powerful Ryzen
             | mini PCs to use as an LLM server in my LAN, but I've read
             | before that the AMD backend (ROCm IIRC) is kinda buggy
        
               | SushiHippie wrote:
               | I have an RTX 7900 XTX and never had AMD specific issues,
               | except that I needed to set some environment variable.
               | 
               | But it seems like integrated GPUs are not supported
               | 
               | https://github.com/ollama/ollama/issues/2637
        
         | JackYoustra wrote:
         | Probably if you have any esoteric flags that pytorch supports.
         | Flash attention 2, for example, was supported way earlier on pt
         | than llama.cpp, so if flash attention 3 follows the same path
         | it'll probably make more sense to use this when targeting
         | nvidia gpus.
        
           | sunshinesfbay wrote:
           | It would appear that Flash-3 is already something that exists
           | for PyTorch based on this joint blog between Nvidia,
           | Together.ai and Princeton about enabling Flash-3 for PyTorch:
           | https://pytorch.org/blog/flashattention-3/
        
             | JackYoustra wrote:
             | Right - my point about "follows the same path" mostly
             | revolves around llama.cpp's latency in adopting it.
        
         | Star_Ship_1010 wrote:
         | Best answer to this is from Reddit
         | 
         | "how does a smart car compare to a ford f150? its different in
         | its intent and intended audience.
         | 
         | Ollama is someone who goes to walmart and buys a $100 huffy
         | mountain bike because they heard bikes are cool. Torchchat is
         | someone who built a mountain bike out of high quality
         | components chosen for a specific task/outcome with the
         | understanding of how each component in the platform functions
         | and interacts with the others to achieve an end goal."
         | https://www.reddit.com/r/LocalLLaMA/comments/1eh6xmq/comment...
         | 
         | Longer Answer with some more details is
         | 
         | If you don't care about which quant you're using, only use
         | ollama and want easy integration with desktop/laptop based
         | projects use Ollama. If you want to run on mobile, integrate
         | into your own apps or projects natively, don't want to use
         | GGUF, want to do quantization, or want to extend your PyTorch
         | based solution use torchchat
         | 
         | Right now Ollama (based on llama.cpp) is a faster way to get
         | performance on a laptop desktop and a number of projects are
         | pre-integrated with Ollama thanks to the OpenAI spec. It's also
         | more mature with more fit and polish. That said the commands
         | that make everything easy use 4bit quant models and you have to
         | do extra work to go find a GGUF model with a higher (or lower)
         | bit quant and load it into Ollama. Also worth noting is that
         | Ollama "containerizes" the models on disk so you can't share
         | them with other projects without going through Ollama which is
         | a hard pass for any users and usecases since duplicating model
         | files on disk isn't great.
         | https://www.reddit.com/r/LocalLLaMA/comments/1eh6xmq/comment...
        
         | dagaci wrote:
         | If you running windows anywhere then you better off using
         | ollama, lmstudio, and or LLamaSharp for coding these are all
         | cross-platform too.
        
           | sunshinesfbay wrote:
           | Pretty cool! What are the steps to use these on mobile?
           | Stoked about using ollama on my iPhone!
        
       | aklgh wrote:
       | A new PyTorch feature. Who knew!
       | 
       | How about making libtorch a first class citizen without crashes
       | and memory leaks? What happened to the "one tool, one job"
       | philosophy?
       | 
       | As an interesting thought experiment: Should PyTorch be
       | integrated into systemd or should systemd be integrated into
       | PyTorch? Both seem to absorb everything else like a black hole.
        
         | smhx wrote:
         | it's not a new PyTorch feature.
         | 
         | It's just a showcase of existing PyTorch features (including
         | libtorch) as an end-to-end example.
         | 
         | On the server-side it uses libtorch, and on mobile, it uses
         | PyTorch's executorch runtime (that's optimized for edge)
        
           | BaculumMeumEst wrote:
           | Did not know executorch existed! That's so cool! I have it on
           | my bucket list to tinker with running LLMs on wearables after
           | I'm a little further along in learning, great to see official
           | tooling for that!
           | 
           | https://github.com/pytorch/executorch
        
           | sunshinesfbay wrote:
           | I think this is not about new Pytorch features, although it
           | requires the latest Pytorch and Executorch making me think
           | that some features in pytorch and executorch got extended
           | optimized for this use case?
           | 
           | What makes this cool is that you can use the same model and
           | the same library and apply to server, desktop, laptop and
           | mobile on iOS and Android, with a variety of quantization
           | schemes and other features.
           | 
           | Definitely still some rough edges as I'd expect from any
           | first software release!
        
       | ipunchghosts wrote:
       | I have been using ollama and generally not that impressed with
       | these models for doing real work. I can't be the only person who
       | thinks this.
        
         | diggan wrote:
         | Same conclusion here so far. Tested out various open source
         | models, maybe once or twice per month, comparing them against
         | GPT-4, nothing has come close so far. Even closed source models
         | seems to not far very well, so far maybe Claude got the closest
         | to GPT-4, but yet to find something that could surpass GPT-4
         | for coding help.
         | 
         | Of course, could be that I've just got used to GPT-4 and my
         | prompting been optimized for GPT-4, and I try to apply the same
         | techniques to other models where those prompts wouldn't work as
         | great.
        
           | ekianjo wrote:
           | > various open source models
           | 
           | what models did you try? There's a ton of new ones every
           | month these days.
        
             | diggan wrote:
             | Most recently: Llama-3.1, Codestral, Gemma 2, Mistral NeMo.
        
               | codetrotter wrote:
               | Which parameter counts, and which quantization levels?
        
           | wongarsu wrote:
           | They won't beat Claude or GPT-4. If you want a model that
           | writes code or answers complex questions use one of those.
           | But for many simpler tasks like summarization, sentiment
           | analysis, data transformation, text completion, etc, self-
           | hosted models are perfectly suited.
           | 
           | And if you work on something where the commercial models are
           | trained to refuse answers and lecture the user instead, some
           | of the freely available models are much more pleasant to work
           | with. With 70B models you even get decent amounts of
           | reasoning capabilities
        
         | bboygravity wrote:
         | I wrote an automated form-filling Firefox extension and tested
         | it with Ollama 3.1. Not perfect, quite slow, but better than
         | any other form fillers I tested.
         | 
         | I also tried to hook it up to Claude and so far its flawless
         | (didn't do a lot of testing though).
        
         | Dowwie wrote:
         | Can you share what kind of real work you're trying?
        
         | derefr wrote:
         | What's your example of "real work"?
         | 
         | Most "well-known-name" open-source ML models, are very much
         | "base models" -- they are meant to be flexible and generic, so
         | that they can be _fine-tuned_ with additional training for
         | task-specific purposes.
         | 
         | Mind you, you don't have to do that work yourself. There are
         | open-source fine-tunes as well, for all sorts of specific
         | purposes, that can be easily found on HuggingFace / found
         | linked on applicable subreddits / etc -- but these don't "make
         | the news" like the releases of new open-source base models do,
         | so they won't be top-of-mind when doing a search for a model to
         | solve a task. You have to actively look for them.
         | 
         | Heck, even focusing on the proprietary-model Inference-as-a-
         | Service space, it's only really OpenAI that purports to have a
         | "general" model that can be set to _every_ task with only
         | prompting. All the other proprietary-model Inf-aaS providers
         | also sell Fine-Tuning-as-a-Service of their models, because
         | they know people will need it.
         | 
         | ---
         | 
         | Also, if you're comparing e.g. ChatGPT-4o (~200b) with a local
         | model you can run _on your PC_ (probably 7b, or maybe 13b if
         | you have a 4090) then obviously the latter is going to be
         | "dumber" -- it's (either literally, or effectively) had 95+% of
         | its connections stripped out!
         | 
         | For production deployment of an open-source model with "smart
         | thinking" requirements (e.g. a customer-support chatbot), the
         | best-practice open-source-model approach would be to pay for
         | dedicated and/or serverless hosting where the instances have
         | direct-attached dedicated server-class GPUs, that can then
         | therefore host the _largest_ -parameter-size variants of the
         | open-source models. Larger-parameter-size open-source models
         | fare much better against the proprietary hosted models.
         | 
         | IMHO the models in the "hostable on a PC" parameter-size range,
         | mainly exist for two use-cases:
         | 
         | * doing local development and testing of LLM-based backend
         | systems (Due to the way pruning+quantizing parameters works, a
         | smaller spin of a larger model will be _probabilistically_
         | similar in behavior to its larger cousin -- giving you the
         | "smart" answer some percentage of the time, and a "dumb" answer
         | the rest of the time. For iterative development, this is no
         | problem -- regenerate responses until it works, and if it never
         | does, then you've got the wrong model/prompt.)
         | 
         | * "shrinking" an AI system that _doesn 't_ require so much
         | "smart thinking", to decrease its compute requirements and thus
         | OpEx. You start with the largest spin of the model; then you
         | keep taking it down in size until it stops producing acceptable
         | results; and then you take one step back.
         | 
         | The models of this size-range _don 't_ exist to "prove out" the
         | applicability of a model family to a given ML task. You _can_
         | do it with them -- especially if there 's an existing fine-
         | tuned model perfectly suited to the use-case -- but it'll be
         | frustrating, because "the absence of evidence is not evidence
         | of absence." You won't know whether you've chosen a bad model,
         | or your prompt is badly structured, or your prompt is
         | impossible for any model, etc.
         | 
         | When proving out a task, test with the largest spin of each
         | model you can get your hands on, using e.g. a serverless Inf-
         | aaS like Runpod. Once you know the model _family_ can do that
         | task to your satisfaction, _then_ pull a local model spin from
         | that family for development.
        
           | simonw wrote:
           | "There are open-source fine-tunes as well, for all sorts of
           | specific purposes"
           | 
           | Have you had good results from any of these? I've not tried a
           | model that's been fine-tuned for a specific purpose yet, I've
           | just worked with the general purpose ones.
        
       | boringg wrote:
       | Can someone explain the use case? Is it so that I can run LLMs
       | more readily in terminal instead of having to use a chat
       | interface?
       | 
       | I'm not saying it isn't impressive being able to swap but I have
       | trouble understanding how this integrates into my workflow and I
       | don't really want to put much effort into exploring given that
       | there are so many things to explore these days.
        
         | sunshinesfbay wrote:
         | It's an end to end solution that supports the same model from
         | server (including OpenAI API!) to mobile. To the extent that
         | you just want to run on one specific platform, other solutions
         | might work just as well?
        
       | suyash wrote:
       | This is cool, how can I go about using this for my own dataset -
       | .pdf, .html files etc?
        
       | ein0p wrote:
       | Selling it as a "chat" is a mistake imo. Chatbots require very
       | large models with a lot of stored knowledge about the world.
       | Small models are useful for narrow tasks, but they are not, and
       | will never be, useful for general domain chat
        
       ___________________________________________________________________
       (page generated 2024-08-01 23:01 UTC)