[HN Gopher] Ollama now supports AMD graphics cards
       ___________________________________________________________________
        
       Ollama now supports AMD graphics cards
        
       Author : tosh
       Score  : 383 points
       Date   : 2024-03-15 17:47 UTC (5 hours ago)
        
 (HTM) web link (ollama.com)
 (TXT) w3m dump (ollama.com)
        
       | sofixa wrote:
       | This is great news. The more projects do this, the less of a moat
       | CUDA is, and the less of a competitive advantage Nvidia has.
        
         | anonymous-panda wrote:
         | What does performance look like?
        
           | sevagh wrote:
           | Spoiler alert: not good enough to break CUDA's moat
        
             | qeternity wrote:
             | This is not CUDA's moat. That is on the R&D/training side.
             | 
             | Inference side is partly about performance, but mostly
             | about cost per token.
             | 
             | And given that there has been a ton of standardization
             | around LLaMA architectures, AMD/ROCm can target this much
             | more easily, and still take a nice chunk of the inference
             | market for non-SOTA models.
        
             | bornfreddy wrote:
             | Not sure why you're downvoted, but as far as I've heard AMD
             | cards can't beat 4090 - yet.
             | 
             | Still, I think AMD will catch or overtake NVidia in
             | hardware soon, but software is a bigger problem. Hopefully
             | the opensource strategy will pay off for them.
        
               | arein3 wrote:
               | Really hope so, maybe this time will catch and last
               | 
               | Usually when corps open source stuff to get adoption,
               | they stuff the adopters after they gain enough market
               | share and the cycle repeats again
        
               | nerdix wrote:
               | A RTX 4090 is about twice the price of and 50%-ish faster
               | than AMD's most expensive consumer card so I'm not sure
               | anyone really expects it to ever surpass a 4090.
               | 
               | A 7900 XTX beating a RTX 4080 at inference is probably a
               | more realistic goal though I'm not sure how they compare
               | right now.
        
           | Zambyte wrote:
           | I get 35tps on Mistral:7b-Instruct-Q6_K with my 6650 XT.
        
         | ixaxaar wrote:
         | Hey I did, and sorry for the self promo,
         | 
         | Please check out https://github.com/geniusrise - tool for
         | running llms and other stuff, behaves like docker compose,
         | works with whatever is supported by underlying engines:
         | 
         | Huggingface - MPS, cuda VLLM - cuda, ROCm llama.cpp,
         | whisper.cpp - cuda, mps, rocm
         | 
         | Also coming up integration with spark (TorchDistributor), kafka
         | and airflow.
        
       | Symmetry wrote:
       | Given the price of top line NVidia cards, if they can be had at
       | all, there's got to be a lot of effort going on behind the scenes
       | to improve AMD support in various places.
        
       | reilly3000 wrote:
       | I'm curious as to how they pulled this off. OpenCL isn't that
       | common in the wild relative to Cuda. Hopefully it can become
       | robust and widespread soon enough. I personally succumbed to the
       | pressure and spent a relative fortune on a 4090 but wish I had
       | some choice in the matter.
        
         | Apofis wrote:
         | I'm surprised they didn't speak about the implementation at
         | all. Anyone got more intel?
        
           | refulgentis wrote:
           | They're open source and based on llama.cpp so nothings
           | secret.
           | 
           | My money, looking at nothing, would be on one of the two
           | Vulkan backends added in Jan/Feb.
           | 
           | I continue to be flummoxed by a mostly-programmer-forum
           | treating ollama like a magical new commercial entity breaking
           | new ground.
           | 
           | It's a CLI wrapper around llama.cpp so you don't have to
           | figure out how to compile it
        
             | washadjeffmad wrote:
             | I tried it recently and couldn't figure out why it existed.
             | It's just a very feature limited app that doesn't require
             | you to know anything or be able to read a model card to "do
             | AI".
             | 
             | And that more or less answered it.
        
               | dartos wrote:
               | It's because most devs nowadays are new devs and probably
               | aren't very familiar with native compilation.
               | 
               | So compiling the correct version of llama.cpp for their
               | hardware is confusing.
               | 
               | Compound that with everyone's relative inexperience with
               | configuring any given model and you have prime grounds
               | for a simple tool to exist.
               | 
               | That's what ollama and their Modelfiles accomplish.
        
               | tracerbulletx wrote:
               | It's just because it's convenient. I wrote a rich text
               | editor front end for llama.cpp and I originally wrote a
               | quick go web server with streaming using the go bindings,
               | but now I just use ollama because it's just simpler and
               | the workflow for pulling down models with their registry
               | and packaging new ones in containers is simpler. Also
               | most people who want to play around with local models
               | aren't developers at all.
        
               | mypalmike wrote:
               | Eh, I've been building native code for decades and hit
               | quite a few roadblocks trying to get llama.cpp building
               | with cuda support on my Ubuntu box. Library version
               | issues and such. Ended up down a rabbit hole related to
               | codenames for the various Nvidia architectures... It's a
               | project on hold for now.
               | 
               | Weirdly, the Python bindings built without issue with
               | pip.
        
               | refulgentis wrote:
               | Edited it out of my original comment because I didn't
               | want to seem ranty/angry/like I have some personal
               | vendatta, as opposed to just being extremely puzzled, but
               | it legit took me months to realize it wasn't a GUI
               | because of how it's discussed on HN, i.e. as key to
               | democratizing, as a large, unique, entity, etc.
               | 
               | Hadn't thought about it recently. After seeing it again
               | here, and being gobsmacked by the # of genuine, earnest,
               | comments assuming there's extensive independent
               | development of large pieces going on in it, I'm going
               | with:
               | 
               | - "The puzzled feeling you have is simply because
               | llama.cpp is a challenge on the best of days, you need to
               | know a lot to get to fully accelerated on ye average
               | MacBook. and technical users don't want a GUI for an LLM,
               | they want a way to call an API, so that's why there isn't
               | content extalling the virtues of GPT4All*. So TL;DR
               | you're old and have been on computer too much :P"
               | 
               | but I legit don't know and still can't figure it out.
               | 
               | * picked them because they're the most recent example of
               | a genuinely democratizing tool that goes far beyond
               | llama.cpp and also makes large contributions back to
               | llama.cpp, ex. GPT4All landed 1 of the 2 vulkan backends
        
           | j33zusjuice wrote:
           | Ahhhh, I see what you did there.
        
           | harwoodr wrote:
           | ROCm: https://github.com/ollama/ollama/commit/6c5ccb11f993ccc
           | 88c47...
        
             | skipants wrote:
             | Another giveaway that it's ROCm is that it doesn't support
             | the 5700 series...
             | 
             | I'm really salty because I "upgraded" to a 5700XT from a
             | Nvidia GTX 1070 and can't do AI on the GPU anymore, purely
             | because the software is unsupported.
             | 
             | But, as a dev, I suppose I should feel some empathy that
             | there's probably some really difficult problem causing
             | 5700XT to be unsupported by ROCm.
        
               | JonChesterfield wrote:
               | I wrote a bunch of openmp code on a 5700XT a couple of
               | years ago, if you're building from source it'll probably
               | run fine
        
         | programmarchy wrote:
         | Apple killed off OpenCL for their platforms when they created
         | Metal which was disappointing. Sounds like ROCm will keep it
         | alive but the fragmentation sucks. Gotta support CUDA, OpenCL,
         | and Metal now to be cross-platform.
        
           | jart wrote:
           | What is OpenCL? AMD GPUs support CUDA. It's called HIP. You
           | just need a bunch of #define statements like this:
           | #ifndef __HIP__         #include <cuda_fp16.h>
           | #include <cuda_runtime.h>         #else         #include
           | <hip/hip_fp16.h>         #include <hip/hip_runtime.h>
           | #define cudaSuccess hipSuccess         #define cudaStream_t
           | hipStream_t         #define cudaGetLastError hipGetLastError
           | #endif
           | 
           | Then your CUDA code works on AMD.
        
             | jiggawatts wrote:
             | Can you explain why nobody knows this trick, for some
             | values of "nobody"?
        
               | jart wrote:
               | No idea. My best guess is their background is in graphics
               | and games rather than machine learning. When CUDA is all
               | you've ever known, you try just a little harder to find a
               | way to keep using it elsewhere.
        
               | wmf wrote:
               | People know; it just hasn't been reliable.
        
               | jart wrote:
               | What's not reliable about it? On Linux hipcc is about as
               | easy to use as gcc. On Windows it's a little janky
               | because hipcc is a perl script and there's no perl
               | interpreter I'll admit. I'm otherwise happy with it
               | though. It'd be nice if they had a shell script installer
               | like NVIDIA, so I could use an OS that isn't a 2 year old
               | Ubuntu. I own 2 XTX cards but I'm actually switching back
               | to NVIDIA on my main workstation for that reason alone.
               | GPUs shouldn't be choosing winners in the OS world. The
               | lack of a profiler is also a source of frustration. I
               | think the smart thing to do is to develop on NVIDIA and
               | then distribute to AMD. I hope things change though and I
               | plan to continue doing everything I can do to support AMD
               | since I badly want to see more balance in this space.
        
               | wmf wrote:
               | The compilation toolchain may be reliable but then you
               | get kernel panics at runtime.
        
               | jart wrote:
               | I've heard geohot is upset about that. I haven't tortured
               | any of my AMD cards enough to run into that issue yet. Do
               | you know how to make it happen?
        
         | moffkalast wrote:
         | OpenCL is as dead as OpenGL and the inference implementations
         | that exist are very unperformant. The only real options are
         | CUDA, ROCm, Vulkan and CPU. And Vulkan is a proper pain too,
         | takes forver to build compute shaders and has to do so for each
         | model. It only makes sense on Intel Arc since there's nothing
         | else there.
        
           | zozbot234 wrote:
           | SYCL is a fairly direct successor to the OpenCL model and is
           | not quite dead, Intel seems to be betting on it more than
           | others.
        
           | mpreda wrote:
           | ROCm includes OpenCL. And it's a very performant OpenCL
           | implementation.
        
           | taminka wrote:
           | why though? except for apple, most vendors still actively
           | support it and newer versions of OpenCL are released...
        
         | karmakaze wrote:
         | It would serve Nvidia right if their insistence on only running
         | CUDA workloads on their hardware results in adoption of
         | ROCm/OpenCL.
        
           | KeplerBoy wrote:
           | OpenCL is fine on Nvidia Hardware. Of course it's a second
           | class citizen next to CUDA, but then again everything is a
           | second class citizen on AMD hardware.
        
           | aseipp wrote:
           | You can use OpenCL just fine on Nvidia, but CUDA is just a
           | superior compute programming model overall (both in features
           | and design.) Pretty much every vendor offers something
           | superior to OpenCL (HIP, OneAPI, etc), because it simply
           | isn't very nice to use.
        
             | karmakaze wrote:
             | I suppose that's about right. The implementors are busy
             | building on a path to profit and much less concerned about
             | any sort-of lock-in or open standards--that comes much
             | later in the cycle.
        
         | shmerl wrote:
         | May be Vulkan compute? But yeah, interesting how.
        
       | deadalus wrote:
       | I wish AMD did well in the Stable Diffusion front because AMD is
       | never greedy on VRAM. The 4060Ti 16GB(minimum required for Stable
       | Diffusion in 2024) starts at $450.
       | 
       | AMD with ROCm is decent on Linux but pretty bad on Windows.
        
         | choilive wrote:
         | They bump up VRAM because they can't compete on raw compute.
        
           | risho wrote:
           | it doesn't matter how much compute you have if you don't have
           | enough vram to run the model.
        
             | Zambyte wrote:
             | Exactly. My friend was telling me that I was making a
             | mistake for getting a 7900 XTX to run language models, when
             | the fact of the matter is the cheapest NVIDIA card with 24
             | GB of VRAM is over 50% more expensive than the 7900 XTX.
             | Running a high quality model at like 80 tps is way more
             | important to me than running a way lower quality model at
             | like 120 tps.
        
           | wongarsu wrote:
           | Or rather Nvidia is purposefully restricting VRAM to avoid
           | gaming cards canibalizing their supremely profitable
           | professional/server cards. AMD has no relevant server cards,
           | so they have no reason to hold back on VRAM in consumer cards
        
           | api wrote:
           | They lag on software a lot more than the lag on silicon.
        
         | Adverblessly wrote:
         | I run A1111, ComfyUI and kohya-ss on an AMD (6900XT which has
         | 16GB, the minimum required for Stable Diffusion in 2024 ;)),
         | though on Linux. Is it a Windows specific Issue for you?
         | 
         | Edit to add: Though apparently I still don't run ollama on AMD
         | since it seems to disagree with my setup.
        
       | simon83 wrote:
       | Does anyone know how the AMD consumer GPU support on Linux has
       | been implemented? Must use something else than ROCm I assume?
       | Because ROCm only supports the 7900 XTX on Linux[1], while on
       | Windows[2] support is from RX 6600 and upwards.
       | 
       | [1]:
       | https://rocblas.readthedocs.io/en/rocm-6.0.0/about/compatibi...
       | [2]:
       | https://rocblas.readthedocs.io/en/rocm-6.0.0/about/compatibi...
        
         | Symmetry wrote:
         | The newest release, 6.0.2, supports a number of other cards[1]
         | and in general people are able to get a lot more cards to work
         | than are officially supported. My 7900 XT worked on 6.0.0 for
         | instance.
         | 
         | [1]https://rocm.docs.amd.com/projects/install-on-
         | linux/en/lates...
        
         | bravetraveler wrote:
         | I wouldn't read too much into support. It's more in terms of
         | business/warranty/promises than what can actually do things
         | 
         | I've had a 6900XT since launch and this is the first I'm
         | hearing _" unsupported"_, having played with ROCM plenty over
         | the years with Fedora Linux.
         | 
         | I think, at most, it's taken a couple key environment variables
        
           | HarHarVeryFunny wrote:
           | How hard would it be for AMD just to document the levels of
           | support of different cards the way NVIDIA does with their
           | "compute capability" numbers ?!
           | 
           | I'm not sure what is worse from AMD - the ML software support
           | they provide for their cards, or the utterly crap
           | documentation.
           | 
           | How about one page documenting AMD's software stack compared
           | to NVIDIA, one page documenting what ML frameworks support
           | AMD cards, and another documenting "compute capability" type
           | numbers to define the capabilities of different cards.
        
             | londons_explore wrote:
             | And _almost_ looks like they 're deliberately trying to not
             | win any market share.
             | 
             | It's as if the CEO is mates with NVidias CEO and has an
             | unwritten agreement not to try too hard to topple the
             | applecart...
             | 
             | Oh wait... They're cousins!
        
       | KronisLV wrote:
       | Feels like all of this local LLM stuff is definitely pushing
       | people in the direction of getting new hardware, since nothing
       | like RX 570/580 or other older cards sees support.
       | 
       | On one hand, the hardware nowadays is better and more powerful,
       | but on the other, the initial version of CUDA came out in 2007
       | and ROCm in 2016. You'd think that compute on GPUs wouldn't
       | require the latest cards.
        
         | hugozap wrote:
         | I'm a happy user of Mistral on my Mac Air M1.
        
           | isoprophlex wrote:
           | How many gbs of RAM do you have in your M1 machine?
        
             | hugozap wrote:
             | 8gb
        
               | isoprophlex wrote:
               | Thanks, wow, amazing that you can already run a small
               | model with so little ram. I need to buy a new laptop,
               | guess more than 16 gb on a macbook isn't really needed
        
               | SparkyMcUnicorn wrote:
               | I would advise getting as much RAM as you possibly can.
               | You can't upgrade later, so get as much as you can
               | afford.
               | 
               | Mine is 64GB, and my memory pressure goes into the red
               | when running a quantized 70B model with a dozen Chrome
               | tabs open.
        
               | dartos wrote:
               | Mistral is _very_ small when quantized.
               | 
               | I'd still go with 16gbs
        
               | TylerE wrote:
               | I've run LLMs and some of the various image models on my
               | M1 Studio 32GB without issue. Not as fast as my old 3080
               | card, but considering the Mac all in has about a 5th the
               | power draw, it's a _lot_ closer than I expected. I 'm not
               | sure of the exact details but there is clearly some
               | secret sauce that allows it to leverage the onboard NN
               | hardware.
        
               | evilduck wrote:
               | I use several LLM models locally for chat UIs and IDE
               | autocompletions like copilot (continue.dev).
               | 
               | Between Teams, Chrome, VS Code, Outlook, and now LLMs my
               | RAM usage sits around 20-22GB. 16GB will be a bottleneck
               | to utility.
        
           | jonplackett wrote:
           | Is it easy to set this up?
        
             | LoganDark wrote:
             | Super easy. You can just head down to https://lmstudio.ai
             | and pick up an app that lets you play around. It's not
             | particularly advanced, but it works pretty well.
             | 
             | It's mostly optimized for M-series silicon, but it also
             | technically works on Windows, and isn't too difficult to
             | trick into working on Linux either.
        
               | glial wrote:
               | Also, https://jan.ai is open source and worth trying out
               | too.
        
               | LoganDark wrote:
               | Looks super cool, though it seems to be missing a good
               | chunk of features, like the ability to change the prompt
               | format. (Just installed it myself to check out all the
               | options.) All the other missing stuff I can see though is
               | stuff that LM Studio doesn't have either (such as a
               | notebook mode). If it has a good chat mode then that's
               | good enough for most!
        
             | hugozap wrote:
             | It is, it doesn't require any setup.
             | 
             | After installation:
             | 
             | > ollama run mistral:latest
        
         | mysteria wrote:
         | The Ollama backend llama.cpp definitely supports those older
         | cards with the OpenCL and Vulkan backends, though performance
         | is worse than ROCm or CUDA. In their Vulkan thread for instance
         | I see people getting it working with Polaris and even Hawaii
         | cards.
         | 
         | https://github.com/ggerganov/llama.cpp/pull/2059
         | 
         | Personally I just run it on CPU and several tokens/s is good
         | enough for my purposes.
        
         | bradley13 wrote:
         | No new hardware needed. I was shocked that Mixtral runs well on
         | my laptop, which has a so-so mobile GPU. Mixtral isn't hugely
         | fast, but definitely good enough!
        
         | superkuh wrote:
         | llama.cpp added first class support for the RX 580 by
         | implementing the vulkan backend. There are some issues on older
         | kernel amdgpu code where a llm process VRAM is never reloaded
         | if it gets kicked out to GTT (in 5.x kernels) but overall it's
         | much faster than the clBLAST opencl implementation.
        
         | jmorgan wrote:
         | The compatibility matrix is quite complex for both AMD and
         | NVIDIA graphics cards, and completely agree: there is a lot of
         | work to do, but the hope is to gracefully fall back to older
         | cards.. they still speed up inference quite a bit when they do
         | work!
        
       | Zambyte wrote:
       | It's pretty funny to see this blog post, when I have been running
       | Ollama on my AMD RX 6650 for weeks :D
       | 
       | They have shipped ROCm containers since 0.1.27 (21 days ago).
       | This blog post seems to be published along with the latest
       | release, 0.1.29. I wonder what they actually changed in this
       | release with regards to AMD support.
       | 
       | Also: see this issue[0] that I made where I worked through
       | running Ollama on an AMD card that they don't "officially"
       | support yet. It's just a matter of setting an environment
       | variable.
       | 
       | [0] https://github.com/ollama/ollama/issues/2870
       | 
       | Edit: I did notice one change, now the starcoder2[1] model works
       | now. Before that would crash[2].
       | 
       | [1] https://ollama.com/library/starcoder2
       | 
       | [2] https://github.com/ollama/ollama/issues/2953
        
         | throwaway5959 wrote:
         | I mean, it was 21 days ago. What's the difference?
        
           | yjftsjthsd-h wrote:
           | 2 versions, apparently
        
         | mchiang wrote:
         | While the PRs went in slightly earlier, much of the time was
         | spent on testing the integrations, and working with AMD
         | directly to resolve issues.
         | 
         | There were issues that we resolved prior to cutting the
         | release, and many reported by the community as well.
        
       | SushiHippie wrote:
       | Can anyone (maybe ollama contributors) explain to me the
       | relationship between llama.cpp and ollama?
       | 
       | I always thought that ollama basically just was a wrapper (i.e.
       | not much changes to inference code, and only built on top) around
       | llama.cpp, but this makes it seem like it is more than that?
        
         | Zambyte wrote:
         | llama.cpp also supports AMD cards:
         | https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#hi...
        
       | renewiltord wrote:
       | Wow, that's a huge feature. Thank you, guys. By the way, does
       | anyone have a preferred case where they can put 4 AMD 7900XTX?
       | There's a lot of motherboards and CPUs that support 128 lanes.
       | It's the physical arrangement that I have trouble with.
        
         | nottorp wrote:
         | Used crypto mining parts not available any more?
        
           | duskwuff wrote:
           | Crypto mining didn't require significant bandwidth to the
           | card. Mining-oriented motherboards typically only provisioned
           | a single lane of PCIe to each card, and often used anemic
           | host CPUs (like Celeron embedded parts).
        
             | renewiltord wrote:
             | Exactly. They'd use PCIe x1 to PCIe x16 risers with power
             | adapters. These require high-bandwidth.
        
               | nottorp wrote:
               | Oh. Shows I wasn't into that.
               | 
               | I did once work with a crypto case, but yes, it was one
               | motherboard with a lot of wifis and we still didn't need
               | the pcie lanes.
        
       | kodarna wrote:
       | I wonder why they aren't supporting RX 6750 XT and lower yet, are
       | there architectural differences between these and RX 6800+?
        
         | Zambyte wrote:
         | They don't support it, but it works if you set an environment
         | variable.
         | 
         | https://github.com/ollama/ollama/issues/2870#issuecomment-19...
        
         | slavik81 wrote:
         | Those are Navi 22/23/24 GPUs while the RX 6800+ GPUs are Navi
         | 21. They have different ISAs... however, the ISAs are identical
         | in all but name.
         | 
         | LLVM has recently introduced a unified ISA for all RDNA 2 GPUs
         | (gfx10.3-generic), so the need for the environment variable
         | workaround mentioned in the other comment should eventually
         | disappear.
        
       | latchkey wrote:
       | If anyone wants to run some benchmarks on MI300x, ping me.
        
       | freedomben wrote:
       | I'm thrilled to see support for RX 6800/6800 XT / 6900 XT. I
       | bought one of those for an outrageous amount during the post-
       | covid shortage in hopes that I could use it for ML stuff, and
       | thus far it hasn't been very successful, which is a shame because
       | it's a beast of a card!
       | 
       | Many thanks to ollama project and llama.cpp!
        
         | mey wrote:
         | Sad to see that the cut off is just after 6700 XT which is what
         | is in my desktop. They indicate more devices are coming,
         | hopefully that includes some of the more modern all in one
         | chips with RDNA 2/3 from AMD as well.
        
           | mey wrote:
           | It appears that the cut off lines up with HIP SDK support
           | from AMD, https://rocm.docs.amd.com/projects/install-on-
           | windows/en/lat...
        
       | rcarmo wrote:
       | Curious to see if this will work on APUs. Have a 7840HS to test,
       | will give it a go ASAP.
        
       | airocker wrote:
       | I heard "Nvidia for LLM today is similar to how Sun Microsystems
       | was for the web"
        
         | api wrote:
         | ... for a very brief period of time until Linux servers and
         | other options caught up.
        
       | rahimnathwani wrote:
       | Here is the commit that added ROCm support to llama.cpp back in
       | August:
       | 
       | https://github.com/ggerganov/llama.cpp/commit/6bbc598a632560...
        
       | tarruda wrote:
       | Does this work with integrated Radeon graphics? If so it might be
       | worth getting one of those Ryzen mini PCs to use as a local LLM
       | server.
        
         | stebalien wrote:
         | Not yet: https://github.com/ollama/ollama/issues/2637
        
       | eclectic29 wrote:
       | I'm not sure why Ollama garners so much attention. It has limited
       | value - used for only experimenting with models + cannot support
       | more than 1 model at a time. It's not meant for production
       | deployments. Granted that it makes the experimentation process
       | super easy but for something that relies on llama.cpp completely
       | and whose main value proposition is easy model management I'm not
       | sure it deserves the brouhaha people are giving it.
       | 
       | Edit: what do you do after the initial experimentation? you need
       | to deploy these models eventually to production. I'm not even
       | talking about giving credit to llama.cpp, just mentioning that
       | this product is gaining disproportionate attention and kudos
       | compared to the value it delivers. Not denying that it's a great
       | product.
        
         | reustle wrote:
         | Even for just running a model locally, Ollama provided a much
         | simpler "one click install" earlier than most tools. That in
         | itself is worth the support.
        
           | Aka456 wrote:
           | Koboldcpp is also very, very good, plug and play, nice UI,
           | nice api, vulkan accelerated, have an AMD fork...
        
         | crooked-v wrote:
         | > Granted that it makes the experimentation process super easy
         | 
         | That's the answer to your question. It may have less space than
         | a Zune, but the average person doesn't care about technically
         | superior alternatives that are much harder to use.
        
         | davidhariri wrote:
         | As it turns out, making it faster and better to manage things
         | tends to get people's attention. I think it's well deserved.
        
         | vikramkr wrote:
         | It's nice for personal use which is what I think it was built
         | for, has some nice frontend options too. The tooling around it
         | is nice, and there are projects building in rag etc. I don't
         | think people are intending to deploy days services through
         | these tools
        
         | andrewstuart wrote:
         | Sounds like you are dismissing Ollama as a "toy".
         | 
         | Refer:
         | 
         | https://paulgraham.com/startupideas.html
        
         | nerdix wrote:
         | The answer to your question is:
         | 
         | ollama run mixtral
         | 
         | That's it. You're running a local LLM. I have no clue how to
         | run llama.cpp
         | 
         | I got Stable Diffusion running and I wish there was something
         | like ollama for it. It was painful.
        
         | cbhl wrote:
         | In my opinion, pre-built binaries and an easy-to-use front-end
         | are things that should exist and are valid as a separate
         | project unto themselves (see, e.g., HandBrake vs ffmpeg).
         | 
         | Using the name of the authors or the project you're building on
         | can also read like an endorsement, which is not _necessarily_
         | desirable for the original authors (it can lead to ollama bugs
         | being reported against llama.cpp instead of to the ollama devs
         | and other forms of support request toil). Consider the third
         | clause of BSD 3-Clause for an example used in other projects
         | (although llama.cpp is licensed under MIT).
        
         | airocker wrote:
         | More than one is easy: put it behind a load balancer. Put one
         | ollama in one container or one port.
        
         | Karrot_Kream wrote:
         | I mean, it takes something difficult like an LLM and makes it
         | easy to run. It's bound to get attention. If you've tried to
         | get other models like BERT based models to run you'll realize
         | just how big the usability gains are running ollama than
         | anything else in the space.
         | 
         | If the question you're asking is why so many folks are focused
         | on experimentation instead of productionizing these models,
         | then I see where you're coming from. There's the question of
         | how much LLMs are actually being used in prod scenarios right
         | now as opposed to just excited people chucking things at them;
         | that maybe LLMs are more just fun playthings than tools for
         | production. But in my experience as HN has gotten bigger, the
         | number of posters talking about productionizing anything has
         | really gone down. I suspect the userbase has become more
         | broadly "interested in software" rather than "ships production
         | facing code" and the enthusiasm in these comments reflects
         | those interests.
         | 
         | FWIW we use some LLMs in production and we do _not_ use ollama
         | at all. Our prod story is very different than what folks are
         | talking about here and I 'd love to have a thread that focuses
         | more on language model prod deployments.
        
           | jart wrote:
           | Well you would be one of the few hundred people on the planet
           | doing that. With local LLMs we're just trying to create a way
           | for everyone else to use AI that doesn't require sharing all
           | their data with them. First thing everyone asks for of course
           | is how to turn the open source local llms into their own
           | online service.
        
         | evilduck wrote:
         | I am 100% uninterested in your production deployment of rent
         | seeking behavior for tools and models I can run myself. Ollama
         | empowers me to do more of that easier. That's why it's popular.
        
         | brucethemoose2 wrote:
         | Interestingly, Ollama is not popular at all in the "localllama"
         | community (which also extends to related discords and repos).
         | 
         | And I think thats because of capabilities... Ollama is somewhat
         | restrictive compared to other frontends. I have a littany of
         | reasons I personally wouldn't run it over exui or koboldcpp,
         | both for performance and output quality.
         | 
         | This is a necessity of being stable and one-click though.
        
         | elwebmaster wrote:
         | You are not making any sense. I am running ollama and Open
         | WebUI (which takes care of auth) in production.
        
       | observationist wrote:
       | There's a thing somewhat conspicuous in its absence - why isn't
       | llama.cpp more directly credited and thanked for providing the
       | base technology powering this tool?
       | 
       | All the other cool "run local" software seems to have the
       | appropriate level of credit. You can find llama.cpp references in
       | the code, being set up in a kind of "as is" fashion such that it
       | might be OK as far as MIT licensing goes, but it seems kind of
       | petty to have no shout out or thank you anywhere in the
       | repository or blog or ollama website.
       | 
       | GPT4All - https://gpt4all.io/index.html
       | 
       | LM Studio - https://lmstudio.ai/
       | 
       | Both of these projects credit and attribute appropriately, Ollama
       | seems to bend over backwards so they don't have to?
        
         | jart wrote:
         | ollama has made a lot of nice contributions of their own. It's
         | a good look to give a hat tip to the great work llama.cpp is
         | also doing, but they're strictly speaking not required to do
         | that in their advertising any more than llama.cpp is required
         | to give credit to Google Brain, and I think that's because
         | llama.cpp has pulled off tricks in the execution that Brain
         | never could have accomplished, just as ollama has had great
         | success focusing on things that wouldn't make sense for
         | llama.cpp. Besides everyone who wants to know what's up can
         | read the source code, research papers, etc. then make their own
         | judgements about who's who. It's all in the open.
        
         | dheera wrote:
         | > why isn't llama.cpp more directly credited
         | 
         | Why don't people credit Bjarne Stroustrup every time a piece of
         | .cpp code is released?
        
           | observationist wrote:
           | Honestly, they probably should keep a little Bjarne shrine on
           | their desk and light a little votive candle every time they
           | release to production.
        
             | ronsor wrote:
             | Realistically that's the only way to reduce the amount of
             | bugs in your C++ code.
        
               | jon_richards wrote:
               | Do you want machine spirits? Because that's how you get
               | machine spirits.
        
       | jjice wrote:
       | Just downloaded this and gave it a go. I have no experience with
       | running any local models, but this just worked out of the box on
       | my 7600 on Ubuntu 22. This is fantastic.
        
       ___________________________________________________________________
       (page generated 2024-03-15 23:00 UTC)