[HN Gopher] Ollama is now available on Windows in preview
       ___________________________________________________________________
        
       Ollama is now available on Windows in preview
        
       Author : pentagrama
       Score  : 351 points
       Date   : 2024-02-17 14:23 UTC (1 days ago)
        
 (HTM) web link (ollama.com)
 (TXT) w3m dump (ollama.com)
        
       | Klaster_1 wrote:
       | As usual, no AMD GPU support mentioned. What a sad state of
       | affair, I regret going with AMD this time.
        
         | freedomben wrote:
         | Same. I really want AMD to succeed because as a long time Linux
         | user I have strong distaste for Nvidia and the hell they put me
         | through. I paid _a lot_ for a beastly AMD card in the hopes
         | that it would be shortly behind Nvidia and that has most
         | definitely not been the case, and I blame AMD for not putting
         | the resources behind it.
         | 
         | AMD, you can change, but you need to start NOW.
        
           | mchiang wrote:
           | Hi, we've been working to support AMD GPUs directly via ROCm.
           | It's still under development but if you build from source it
           | does work:
           | 
           | https://github.com/ollama/ollama/blob/main/docs/development..
           | ..
        
             | Filligree wrote:
             | Every time I try to run anything through ROCm, my machine
             | kernel-panics.
             | 
             | I'm not blaming you for this, but I'm also sticking with
             | nvidia.
        
               | mchiang wrote:
               | Really sorry about this. Do you happen to have logs for
               | us to look into? This is definitely not the way we want
        
               | Filligree wrote:
               | To be clearer, it isn't Ollama-specific. I first
               | encountered the issue with Stable Diffusion, and it's
               | remained since, but the GPU that causes it isn't
               | currently inside any machine; I replaced it with a 3090 a
               | few days ago.
        
               | weebull wrote:
               | I'd recommend trying stuff that exhausts the VRAM. That
               | seems to be where thinks get flakey for me (RX 7600 -
               | 8GB), especially if running a desktop too.
        
               | superkuh wrote:
               | And you're the lucky one getting the chance to kernel
               | panic with ROCm. AMD drops ROCm support for their
               | consumer GPUs so fast it'll make your head spin. I bought
               | my GPU for $230 in 2020 and by 2021 AMD had dropped
               | support for it. Just a bit under 4 years after the card's
               | release on market.
        
             | agartner wrote:
             | Working well for me on a 7900XT with ROCm 6 and Linux 6.7.5
             | thanks!
        
               | antman wrote:
               | What is the speedup vs cpu?
        
           | visarga wrote:
           | Ollama is a model-management app that runs on top of
           | llama.cpp so you should ask there about AMD support.
        
             | progman32 wrote:
             | I've been running llama.cpp with full GPU acceleration on
             | my AMD card, using the text-generation-webui install script
             | on kubuntu. Same with stable diffusion using a1111. AMD's
             | compute stack is indeed quite broken and is more fragile,
             | but it does work using most modern cards.
             | 
             | The kernel panics though... Yeah, I had those on my Radeon
             | vii before I upgraded.
        
             | 65a wrote:
             | llama.cpp has had ROCm support for a long time
        
           | michaelmrose wrote:
           | What problems have you had with AMD and in what fashion do
           | they fall short of Nvidia?
        
             | freedomben wrote:
             | I've had no end of difficulty installing the Pro drivers
             | and/or ROCm. The "solution" that was recommended was to
             | install a different distro (I use Fedora and installing
             | CentOS or Ubuntu was recommended). When I finally _could_
             | get it installed, I got kernel panics and my system
             | frequently became unbootable. Then once it was installed,
             | getting user space programs to recognize it was the next
             | major pain point.
        
               | michaelmrose wrote:
               | I've been using Nvidia and it stopped being challenging
               | in about 2006. I hear perpetually that Nvidia is horrible
               | and I should try AMD. The 2 times I did admitted a long
               | time ago it was... not great.
        
               | freedomben wrote:
               | Do you use Ubuntu LTS? If so, then indeed Nvidia is not a
               | problem.
               | 
               | But if you run a distro that has anywhere near new
               | kernels such as Fedora and Arch, you'll be constantly in
               | fear of receiving new kernel updates. And every so often
               | the packages will be broken and you'll have to use
               | Nvidia's horrible installer. Oh and every once in a while
               | they'll subtly drop support for older cards and you'll
               | need to move to the legacy package, but the way you'll
               | find out is that your system suddenly doesn't boot and
               | you just happen to think about it being the old Nvidia
               | card so you Kagi that and discover the change.
        
               | 65a wrote:
               | I found it much easier to make ROCm/AMD work for AI
               | (including on an laptop) than getting nvidia work with
               | Xorg on an optimus laptop with an intel iGPU/nvidia dGPU.
               | I swore off nvidia at that point.
        
               | michaelmrose wrote:
               | Changing kernels automatically as new releases came out
               | was never an optimal strategy even if its what you get by
               | default in Arch. Notably arch has linux-lts presently at
               | 6.6 whereas mainline is 6.7.
               | 
               | Instead of treating it like a dice roll and living in
               | existential dread at the entirely predictable peril of
               | Linus cutting releases that necessarily occasionally
               | front run NVIDIA which releases less frequently I simply
               | don't install kernels first released yesterday, pull in
               | major kernel version updates daily, don't remove the old
               | kernel automatically when the new one is installed, and
               | automatically make snapshots on update against any sort
               | of issue that might obtain.
               | 
               | If that seems like too much work one could simply at
               | least keep the prior kernel version around and reboot and
               | your only out 45 seconds of your life. This actually
               | seems like a good idea no matter what.
               | 
               | I don't think I have used nvidia's installer since 2003
               | on Fedora "Core"-as the nomenclature used to be--One. One
               | simply doesn't need to. Also generally speaking one
               | doesn't need to use a legacy package until a card is over
               | 10 years old. For instance the oldest consumer card
               | unsupported right now is a 600 series from 2012.
               | 
               | If you still own a 2012 GPU you should probably put it
               | where it belongs in the trash but when you get to the
               | sort of computers that require legacy support which is
               | 2009-2012 you are apt to need to worry about other
               | matters like distros that still support 32 bit, simple
               | environments like xfce, software that works well in ram
               | constrained environments. Needing to install a slightly
               | different driver seems tractable.
        
               | spookie wrote:
               | Try to use the runfile provided by Nvidia and use DKMS.
               | The biggest issue is just that flatpaks aren't really
               | updated for CUDA drivers, but you can just not use them
               | if your distro isn't old or niche.
        
               | slavik81 wrote:
               | On Fedora 40, I believe you can install llama.cpp's ROCm
               | dependencies with:                   dnf install hipcc
               | rocm-hip-devel rocblas-devel hipblas-devel
        
               | slavik81 wrote:
               | So, after a bit of experimentation, it seems that Fedora
               | is built primarily for RDNA 3 while Debian is built for
               | RDNA 2 and earlier. These are llama-cpp build
               | instructions for Fedora: https://gist.github.com/cgmb/bb6
               | 61fccaf041d3649f9a90560826eb.... These are llama-cpp
               | build instructions for Debian: https://gist.github.com/cg
               | mb/be113c04cd740425f637aa33c3e4ea3....
        
           | karolist wrote:
           | What hell specifically, do you mean loading binary blob
           | drivers in the past?
        
           | zare_st wrote:
           | Curious how different a long time FreeBSD user feels. I have
           | a strong distaste for anything not nvidia.
           | 
           | Official nvidia drivers have been added to FreeBSD repository
           | 21 years ago. I can't count the number of different types of
           | drivers used for ATi/AMD in these two decades. And none had
           | the performance or stability.
        
         | accelbred wrote:
         | Ollama has a opencl backend. I'm on Linux and clblast works
         | great with AMD cards. As far as I remember opencl on Windows
         | did not have that much issues, but its been a while.
        
         | gerwim wrote:
         | Maybe there's proper support soon in AI landscape [0].
         | 
         | [0]: https://news.ycombinator.com/item?id=39344815
        
         | RealStickman_ wrote:
         | I've had success using my AMD GPU with the OpenCL backend for
         | llamacpp. The ROCm backend had pretty bad performance though.
        
         | chown wrote:
         | As others have mentioned, Ollama uses Llama.CPP under the hood
         | and they recently released Vulkan support which is supposed to
         | work with AMD GPUs. I was able to use llama.cpu compiled with
         | Vulkan support with my app [1] and make it run on an AMD laptop
         | but I was unable to make it work with Ollama as it makes some
         | assumptions about how it goes about searching for available
         | GPUs on a machine.
         | 
         | [1]: https://msty.app
        
           | Kelteseth wrote:
           | I got a Windows defender Virus alert after executing your
           | app.
        
             | chown wrote:
             | Ugh! Probably because it's an exe app? Not sure how to go
             | around about that. I am looking into getting it signed just
             | like the counterpart MacOS app. Thank you for the heads up
             | and sorry about the false positive.
        
               | rezonant wrote:
               | Ironically Ollama is also struggling with this sort of
               | thing, see https://github.com/ollama/ollama/issues/2519
               | 
               | Code signing helps by having an avenue by which you can
               | establish reliable reputation, and then using VirusTotal
               | to check for AV flags and using the AV vendor's whitelist
               | request form is the second part, over time your
               | reputation increases and you don't get flagged as
               | malware.
               | 
               | It seems to be much more likely with AI stuff, apparently
               | due to use of CUDA or something (/shrug)
        
           | 65a wrote:
           | ROCm is preferred over vulkan for AMD GPUs, performance wise.
           | Using OpenCL or Vulkan should only be for older cards or
           | weird setups.
        
             | chown wrote:
             | That's good to know. Thank you!
        
         | peppermint_gum wrote:
         | AMD clearly believes that this newfangled "GPU compute" fad
         | will pass soon, so there's no point to invest in it.
         | 
         | This is one of the worst acts of self-sabotage I have ever seen
         | in the tech business.
        
           | jart wrote:
           | Zen4 AVX512 must be really good then.
        
             | imtringued wrote:
             | To be fair a lot of the GPU edge comes from fast memory. A
             | GPU with 20tflops running a 30 billion parameter model has
             | a compute budget of 700flops per parameter. Meanwhile the
             | sheer size of the model prevents you from loading it more
             | than 20 times from memory per second.
        
         | vdaea wrote:
         | AMD is the underdog, and that's what happens when you choose
         | the underdog.
        
           | Dalewyn wrote:
           | I would argue we are well past the point of calling AMD an
           | underdog.
        
         | jart wrote:
         | llamafile has amd gpu support. on windows, it only depends on
         | the graphics driver, thanks to our tinyBLAS library.
         | 
         | https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.6.2
         | 
         | By default it opens a browser tab with a chat gui. You can run
         | it as a cli chatbot like ollama as follows:
         | 
         | https://justine.lol/oneliners/#chat
        
           | antman wrote:
           | Which amd cpus are supported by tinyBLAS/llamafile?
        
         | jmorgan wrote:
         | AMD GPU support is definitely an important part of the project
         | roadmap (sorry this isn't better published in a ROADMAP.md or
         | similar for the project - will do that soon).
         | 
         | A few of the maintainers of the project are from the Toronto
         | area, the original home of ATI technologies [1], and so we
         | personally want to see Ollama work well on AMD GPUs :).
         | 
         | One of the test machines we use to work on AMD support for
         | Ollama is running a Radeon RX 7900XT, and it's quite fast.
         | Definitely comparable to a high-end GeForce 40 series GPU.
         | 
         | [1]: https://en.wikipedia.org/wiki/ATI_Technologies
        
           | FirmwareBurner wrote:
           | What about AMD APUs with RDNA graphics? ANy chance of getting
           | Olama for them?
        
             | spookie wrote:
             | I suppose it comes down to ROCm support. https://docs.amd.c
             | om/en/docs-5.7.1/release/windows_support.h...
        
       | trelane wrote:
       | Looks like it's already available on Linux & Mac. The change is
       | that they're adding Windows: https://github.com/ollama/ollama
        
       | visarga wrote:
       | I am running this on my desktop, using Open-WebUI for the front-
       | end. I have a collection of a dozen or so fine-tunes of Mistral
       | and a few other models. They are good enough for chatting and
       | doing some information extraction tasks. The Open-WebUI app looks
       | a lot like chatGPT. You can even search your conversations.
       | 
       | https://github.com/open-webui/open-webui
        
         | lolinder wrote:
         | For anyone else who missed the announcement a few hours ago,
         | open-webui is the rebranding of the project formerly known as
         | ollama-webui [0].
         | 
         | I can vouch for it as a solid frontend for Ollama. It works
         | really well and has had an astounding pace of development.
         | Every few weeks I pull the latest docker images and am always
         | surprised by how much has improved.
         | 
         | [0] https://github.com/open-webui/open-webui/discussions/764
        
         | scratchyone wrote:
         | Out of curiosity, what're you using the fine-tunes for? Do you
         | fine-tune them on your own data or are they just publicly
         | available models you use for different tasks?
        
           | visarga wrote:
           | I am just loading GGUF models from HuggingFace that have good
           | scores in the benchmarks, and running my private eval set
           | from my current project. Some of the merged models are
           | surprisingly good compared with simple fine-tunes.
        
       | thrdbndndn wrote:
       | This is the first time I use llama2 (I used the 13B one) -- and
       | it absolutely struggles with Chinese despite it says "I am
       | capable of understanding Chinese and English" when asking "what
       | languages do you understand".
       | 
       | Is this expected?
       | 
       | See an example below, it can't stay in Chinese at all.
       | >>> Ni Zhi Dao Hai Tun Ma               Ah, Hai Tun  (haitun) is
       | a type of dolphin!  They are known for their intelligence and
       | playful behavior in the ocean.         Is there anything else you
       | would like to know or discuss?              >>> Qing Yong Zhong
       | Wen Hui Da               Ah, I see!  As a 13b model, I can only
       | communicate in Chinese. Here's my answer:              Hai Tun
       | Shi Yi Chong Zhi Neng He Huo Po De  marine mammalTa Men Zhu Yao
       | Sheng Huo Zai Hai Yang Zhong . Ta Men You Zhao Rou Ruan De Pi Fu
       | , Yuan Run De Lian He Xiao De Er Duo . Ta Men Shi  ocean Zhong De
       | Yi  Chong Mei Li He  interesDe Sheng Wu Hen Duo Ren Xi Huan Qu
       | Kan Ta Men De Biao Yan .
        
         | mchiang wrote:
         | There is the qwen 1.5 model from Alibaba team.
         | 
         | https://ollama.com/library/qwen
         | 
         | ollama run qwen:0.5b ollama run qwen:1.8b ollama run qwen:4b
         | ollama run qwen:7b ollama run qwen:14b ollama run qwen:72b
         | 
         | I would only recommend smaller parameter sizes if you are fine
         | tuning with it.
        
         | visarga wrote:
         | get yourself a proper Chinese model from China, they are hosted
         | in the Ollama model zoo as well
        
           | thrdbndndn wrote:
           | That's true. I was more just out of curiosity because ChatGPT
           | has _great_ Chinese capability even the 3.5 version.
        
         | Eisenstein wrote:
         | Give Yi a shot.
        
         | xanderlewis wrote:
         | Its Japanese ability is even worse... and by that I mean it's
         | basically nonexistent. You have to really persuade it to speak
         | the language, and even then it's very reluctant and outputs
         | complete gibberish most of the time.
         | 
         | Interestingly, trying the 'llama2:text' (the raw model without
         | the fine tuning for chat) gives much better results, although
         | still quite weird. Maybe the fine tuning process -- since it
         | presumably focuses on English -- destroys what little Japanese
         | ability was in there to begin with.
         | 
         | (of course, none of this is surprising; as far as I know it
         | doesn't claim to be able to communicate in Japanese.)
        
         | charcircuit wrote:
         | >Is this expected?
         | 
         | Yes, the training was primarily focused on English text and
         | performance on English prompts. Only 0.13% of the training data
         | was Chinese.
         | 
         | >Does Llama 2 support other languages outside of English?
         | 
         | >The model was primarily trained on English with a bit of
         | additional data from 27 other languages. We do not expect the
         | same level of performance in these languages as in English.
         | 
         | https://llama.meta.com/llama2/
        
         | int_19h wrote:
         | Never, ever make assumptions about what the model can and
         | cannot do based on what it tells you itself. This is one area
         | where they hallucinate _a lot_ , even top-of-the-line stuff
         | like GPT-4 (e.g. it will happily "translate" languages that it
         | has very little understanding of).
        
           | thrdbndndn wrote:
           | Well I don't assume, I asked it explicitly what language it
           | understands.
        
             | codingdave wrote:
             | Taking any answer from an LLM as accurate is in and of
             | itself an assumption.
        
             | int_19h wrote:
             | You assumed that it knows which languages it understands,
             | or that it knows that it doesn't know. This assumption is
             | generally incorrect.
        
         | cyp0633 wrote:
         | Mistral-7B answers in Chinese only when I explicitly tells it
         | to do so
        
       | hat_tr1ck wrote:
       | Had no idea Windows users had no access to Ollama, feels like
       | only a few years ago we Mac users would have been the ones having
       | to wait
        
         | mil22 wrote:
         | It has worked just fine under WSL for many months now,
         | including full GPU support, though that's not as convenient for
         | most. Native Windows support is icing on the cake.
        
           | jmorgan wrote:
           | Indeed, WSL has surprisingly good GPU passthrough and AVX
           | instruction support, which makes running models fast albeit
           | the virtualization layer. WSL comes with it's own setup steps
           | and performance considerations (not to mention quite a few
           | folks are still using WSL 1 in their workflow), and so a lot
           | of folks asked for a pre-built Windows version that runs
           | natively!
        
         | hu3 wrote:
         | I've been running Ollama in Windows WSL for some time now.
         | 
         | It's x86 Linux after all. Everything just works.
        
           | baq wrote:
           | There's some magic with wsl gpu drivers.
        
       | chown wrote:
       | If anyone is looking for a nice Chat UI on top of Ollama that
       | supports both online models and local models, I've been working
       | on an app [1] that is offline and privacy focused. I just
       | released Windows support this morning.
       | 
       | [1]: https://msty.app
        
         | haliskerbas wrote:
         | Off topic but what did you use to make your landing page?
        
           | chown wrote:
           | It's a Nuxt static app but all hand rolled, no builder or
           | anything like that if that's what you are asking.
        
             | haliskerbas wrote:
             | That's what I was looking for, I couldn't find any elements
             | or class names that I recognized in the code from the
             | popular frameworks. Looks great!
        
               | chown wrote:
               | Thank you! I probably spent way too much time tweaking
               | it. Haha! I am glad someone liked all the small details I
               | was sweating on.
        
               | user_7832 wrote:
               | Chiming in to say the page looks really good and
               | professional! It's definitely "enterprise level" in a
               | good way, I was initially searching for a "pricing" tab
               | lol.
               | 
               | Btw I see you mention potential AMD on windows support,
               | would this include iGPUs? I'd love to use your app on my
               | ryzen 7 laptop on its 780m. Thanks!
        
         | vorticalbox wrote:
         | Any plans for a Linux client?
        
           | chown wrote:
           | Yes. I already have a local build that I am testing. Going to
           | release it in a week or so.
        
         | vdaea wrote:
         | I'm trying the Windows version. What really sticks out is that
         | buttons don't have tooltips. It's impossible to know what they
         | do if you don't click them.
         | 
         | Also in the conversation view you have two buttons "New Chat"
         | and "Add Chat" which do two different things but they both have
         | the same keybind ^T
        
           | chown wrote:
           | Thanks for the feedback. I will get them resolved soon.
        
             | wlesieutre wrote:
             | I'm getting a lot of jank with the hovery-sidebar:
             | https://imgur.com/a/VXZXL94
             | 
             | Personally I'd rather have the sidebar be toggled on click,
             | instead of having such a huge animation every time my mouse
             | passes by. And if it's such an important part of the UI
             | that requiring a click is too much of a barrier, then it'd
             | be better to build that functionality into a permanent
             | sidebar rather than a buried under a level of sidebar
             | buttons.
             | 
             | The sidebar on my Finder windows for example are about
             | 150px wide, always visible, and fit more content than all
             | three of Msty's interchanging sidebars put together.
             | 
             | If I had a lot of previous conversations that might not be
             | true anymore, but a single level sidebar with subheadings
             | still works fine for things like Music where I can have a
             | long list of playlists. If it's too many conversations to
             | reasonably include in an always visible list then maybe
             | they go into a [More] section.
             | 
             | Current UI feels like I had to think a bit too much to
             | understand how it's organized.
        
               | chown wrote:
               | Ugh! That's not the experience I wanted people to have.
               | Sorry about that and I will be working on making the
               | experience better. Feedback like yours really helps so
               | thank you very much.
        
         | attentive wrote:
         | adding gemini API?
        
           | chown wrote:
           | Yes. I signed up for the API a couple of days ago and I am in
           | the waitlist.
        
             | attentive wrote:
             | gemini-pro is available without a wait list. Go to AI
             | studio to get a key.
        
               | BOOSTERHIDROGEN wrote:
               | Wait how to get that ? I just submit email on waitlist
        
               | attentive wrote:
               | waitlist is for ultra or 1.5, whatever, 1.0 pro is
               | available
               | 
               | aistudio.google.com
        
         | attentive wrote:
         | btw, it triggers Program:Win32/Wacapew.C!ml detection
        
           | chown wrote:
           | Hmmm... it's a false positive. Is it Windows Defender or
           | something else? Is it when you open the app or when you setup
           | local AI? Not sure where I would send a request for it to be
           | not flagged.
        
             | js4ever wrote:
             | It's probably an issue with the tool you used to create the
             | installer.
             | 
             | Few options: use another tool like the one included in
             | visual studio, sign your exe with a certificate. Or publish
             | it on the windows marketplace.
             | 
             | Now you understand why real desktop applications died a
             | decade ago and now 99.99% of apps are using a web UI
        
               | chown wrote:
               | That's true and unfortunate. The MacOS installer is
               | signed and I will be looking into signing the Windows
               | installer. Thank you for your suggestions. My last
               | experience of getting and signing a Windows installer was
               | awful and goes back to what you were saying about desktop
               | app dying a decade ago.
        
               | dom96 wrote:
               | I'd say it's more an issue with the anti-virus that is
               | flagging this as a virus when it isn't one. We should
               | expect better out of AV software. I've personally seen
               | many instances of false positives across various software
               | that was definitely not a virus.
        
               | rezonant wrote:
               | There seems to be something about what these AI apps do
               | that causes the false positives, because Ollama itself
               | _also_ triggers Windows defender
               | https://github.com/ollama/ollama/issues/2519
        
         | sumedh wrote:
         | Is this similar to LLM Studio?
        
       | tydunn wrote:
       | I've been playing around with it for the last couple days on my
       | Windows machine, using it for local tab-autocomplete in VS Code,
       | and it's been just as good as it is on my Mac
        
         | notsylver wrote:
         | What do you use for tab-autocomplete in VS Code? I've been
         | trying to find something that can replace copilot, just because
         | it sounds fun. Everyhing I've found seems more aimed at
         | entering a prompt and having it refactor code, not completing
         | as you write with no other input.
        
           | tydunn wrote:
           | I use Continue.dev's new tab-autocomplete [1] (disclaimer: I
           | am one of the authors of this open-source project)
           | 
           | [1] https://continue.dev/docs/walkthroughs/tab-autocomplete
        
           | sqs wrote:
           | Cody (https://github.com/sourcegraph/cody) supports using
           | Ollama for autocomplete in VS Code. See the release notes at
           | https://sourcegraph.com/blog/cody-vscode-1.1.0-release for
           | instructions. And soon it'll support Ollama for
           | chat/refactoring as well
           | (https://twitter.com/sqs/status/1750045006382162346/video/1).
           | 
           | Disclaimer: I work on Cody and hacked on this feature.
        
       | orion138 wrote:
       | Is this people's preferred way to run and develop on top of local
       | models?
       | 
       | Any other must learn tools?
        
       | jameshart wrote:
       | What is the rationale for so many of these 'run it locally' AI
       | ports to run _as a server_?
       | 
       | Have developers forgotten that it's actually possible to run code
       | inside your UI process?
       | 
       | We see the same thing with stable diffusion runners as well as
       | LLM hosts.
       | 
       | I don't like running background services locally if I don't need
       | to. Why do these implementations all seem to operate that way?
        
         | mattnewton wrote:
         | It doesn't make sense to load the weights on the fly- that is
         | gigabits of memory that has to be shuffled around. Instead, you
         | have a long running process that serves up lots of predictions
         | 
         | (edit: someday soon, probably to multiple clients too!)
        
           | jameshart wrote:
           | Local UI applications _are_ long running processes normally
        
           | nightfly wrote:
           | So better to have GiBs of memory consumed by it constantly?
        
             | mattnewton wrote:
             | If you don't have that memory to spare you can't run this
             | locally anyways, and keeping it in memory is the only way
             | to have a fast experience. Paying the model loading cost
             | repeatedly sucks.
        
               | jameshart wrote:
               | Why would linking llama.cpp into a UI application lead to
               | incurring the model loading cost repeatedly?
        
               | mattnewton wrote:
               | It would be loaded repeatedly if the ui is opened and
               | closed repeatedly. You can achieve the same "long running
               | server + short running ui window" with multiple threads
               | or processes all linked into one binary if you want of
               | course. This way (with a separate server) seems simpler
               | to me (and has the added benefit that multiple
               | applications could easily call into the "server" if
               | needed)
        
         | jmorgan wrote:
         | This is a really interesting question. I think there's
         | definitely a world for both deployment models. Maybe a good
         | analogy is database engines: both SQLite (a library) and
         | Postgres (a long-running service) have widespread use cases
         | with tradeoffs.
        
           | jameshart wrote:
           | But these are typically filling the usecases of
           | _productivity_ applications, not 'engines'.
           | 
           | Microsoft Word doesn't run its grammar checker as an external
           | service and shunt JSON over a localhost socket to get
           | spelling and style suggestions.
           | 
           | Photoshop doesn't install a background service to host
           | filters.
           | 
           | The closest pattern I can think of is the 'language servers'
           | model used by IDEs to handle autosuggest - see
           | https://microsoft.github.io/language-server-protocol/ - but
           | the point of that is to enable many to many interop -
           | multiple languages supporting multiple IDEs. Is that the
           | expected usecase for local language assistants and image
           | generators?
        
             | bri3d wrote:
             | Funny choice of example. You've always been able to use
             | Word as a remote spellchecker over COM, and as of Windows
             | 8, spellchecking is available system wide and runs in a
             | separate process (again over COM) for sandboxing reasons.
             | 
             | JSON over TCP is perhaps a silly IPC mechanism for local
             | services, but this kind of composition doesn't seem
             | unreasonable to me.
        
               | jameshart wrote:
               | > use Word as a remote spellchecker over COM
               | 
               | That's not how COM works. You can load Word's
               | spellchecker _into your process_.
               | 
               | Windows added a spellchecking API in Windows 8. I've not
               | dug into the API in detail, but don't see any indication
               | that spellchecker providers run in a separate process
               | (you can probably build one that works that way, but it's
               | not intrinsic to the provider model).
        
               | bri3d wrote:
               | Are you not familiar with out of process COM servers? A
               | lot of Office automation is out of process, even inside
               | of Office itself. Admittedly I'm not sure about the
               | grammar checker specifically.
               | 
               | As for the Spellcheck API, external providers are
               | explicitly out of proc: https://learn.microsoft.com/en-
               | us/windows/win32/intl/about-t...
               | 
               | Anyway, my point still stands - building desktop apps
               | using composition over RPC is neither new nor a bad idea,
               | although HTTP might not be the best RPC mechanism
               | (although... neither was COM...)
        
             | pseudosavant wrote:
             | The language server pattern is actually a very good
             | comparison. The web service + web UI approach enables you
             | do use different local and/or cloud AI services. That is
             | why most of these servers/services support the OpenAI API.
        
               | jameshart wrote:
               | Which means most of these servers limit themselves to the
               | capabilities exposed by the OpenAI API.
        
         | api wrote:
         | The main reason I see is to use the same AI engine for multiple
         | things like VSCode plugins, UI apps, etc.
         | 
         | That being said I use LM Studio which runs as a UI and allows
         | you to start a local server for coding and editor plugins.
         | 
         | I can run Deepseek Coder in VSCode locally on an M1 Max and
         | it's actually useful. It'll just eat the battery quickly if
         | it's not plugged in since it really slams the GPU. It's about
         | the only thing I use that will make the M1 make audible fan
         | noise.
        
         | imiric wrote:
         | This is a good thing IMO. I don't have a very powerful laptop
         | or workstation, but do have a multi-GPU headless server. These
         | projects allow me to experiment with LLMs on my server, and
         | expose an API and web UI to my LAN.
        
         | sgt101 wrote:
         | Because running it locally really means running it on a cloud
         | server that you own and is called by other server that you own.
         | This gives you the ability to make the interfaces lightweight
         | and most importantly to not pay premiums to model servers.
        
           | jameshart wrote:
           | No, running it locally means running it on my laptop.
           | 
           | My Mac M2 is quite capable of running stable diffusion XL
           | models and 30M parameter. LLMs under llama.cpp.
           | 
           | What I don't like is the trend towards the way to do that
           | being to open up network listeners with no authentication on
           | them.
        
             | teaearlgraycold wrote:
             | Bind to localhost then
        
             | karolist wrote:
             | lmstudio my friend
        
         | psytrx wrote:
         | In addition to the initial loading time noted by the other
         | posters:
         | 
         | You may want to use the same inference engine or even the same
         | LLM for multiple purposes in multiple applications.
         | 
         | Also, which is a huge factor in my opinion, is getting your
         | machine, environment and OS into a state that can't run the
         | models efficiently. It wasn't trivial to me. Putting all this
         | complexity inside a container (and therefore "server") helps
         | tremendously, a) in setting everything up initially and b)
         | keeping up with the constant improvements and updates that are
         | happening regularly.
        
         | lolinder wrote:
         | In addition to everything that everyone else has said: I run
         | Ollama on a large gaming PC for speed but want to be able to
         | use the models from elsewhere in the house. So I run Open-WebUI
         | at chat.domain.example and Ollama at api.chat.domain.example
         | (both only accessible within my local network).
         | 
         | With this setup I can use my full-speed local models from both
         | my laptop and my phone with the web UI, and my raspberry pi
         | that's running my experimental voice assistant can query Ollama
         | through the API endpoints, all at the full speed enabled by my
         | gaming GPU.
         | 
         | The same logic goes for my Stable Diffusion setup.
        
         | justsomehnguy wrote:
         | > I don't like running background services locally if I don't
         | need to. Why do these implementations all seem to operate that
         | way?
         | 
         | Because it's now a simple REST-like query to interact with that
         | server.
         | 
         | Default model of running the binary and capturing it's output
         | would mean you would reload everything _each time_. Of course,
         | you can write a master process what would actually perform the
         | queries and have a separate executable for querying that master
         | process... wait, you just invented a server.
        
           | jameshart wrote:
           | I'm not sure what this 'default model of running a binary and
           | capturing its output' is that you're talking about.
           | 
           | Aren't people mostly running browser frontends in front of
           | these to provide a persistent UI - a chat interface or an
           | image workspace or something?
           | 
           | sure, if you're running a lot of little command line tools
           | that need access to an LLM a server makes sense but what I
           | don't understand is why that isn't a _niche_ way of
           | distributing these things - instead it seems to be the
           | default.
        
             | justsomehnguy wrote:
             | > I'm not sure what this 'default model of running a binary
             | and capturing its output' is that you're talking about.
             | 
             | Did you ever used a computer?                   PS
             | C:\Users\Administrator\AppData\Local\Programs\Ollama>
             | ./ollama.exe run llama2:7b "say hello" --verbose
             | Hello! How can I help you today?              total
             | duration:       35.9150092s         load duration:
             | 1.7888ms         prompt eval duration: 1.941793s
             | prompt eval rate:     0.00 tokens/s         eval count:
             | 10 token(s)         eval duration:        16.988289s
             | eval rate:            0.59 tokens/s
             | 
             | But I feel like you are here just to troll around without a
             | merit or a target.
        
               | jameshart wrote:
               | If you just check out
               | https://github.com/ggerganov/llama.cpp and run make,
               | you'll wind up with an executable called 'main' that lets
               | you run any gguf language model you choose. Then:
               | 
               | ./main -m ./models/30B/llama-30b.Q4_K_M.gguf --prompt
               | "say hello"
               | 
               | On my M2 MacBook, the first run takes a few seconds
               | before it produces anything, but after that subsequent
               | runs start outputting tokens immediately.
               | 
               | You can run LLM models right inside a short lived
               | process.
               | 
               | But the _majority_ of humans don't want to use a single
               | execution of a command line to access LLM completions.
               | They want to run a program that lets them interact with
               | an LLM. And to do that they will likely start and leave
               | running a long-lived process with UI state - which can
               | also serve as a host for a longer lived LLM context.
               | 
               | Neither usecase particularly seems to need a server to
               | function. My curiosity about why people are packaging
               | these things up like that is completely genuine.
               | 
               | Last run of llama.cpp main off my command line:
               | llama_print_timings:        load time =     871.43 ms
               | llama_print_timings:      sample time =      20.39 ms /
               | 259 runs   (    0.08 ms per token, 12702.31 tokens per
               | second)        llama_print_timings: prompt eval time =
               | 397.77 ms /     3 tokens (  132.59 ms per token,     7.54
               | tokens per second)        llama_print_timings:
               | eval time =   20079.05 ms /   258 runs   (   77.83 ms per
               | token,    12.85 tokens per second)
               | llama_print_timings:       total time =   20534.77 ms /
               | 261 tokens
        
         | kaliqt wrote:
         | Heavy compute. Often you might need to outsource the model to
         | another PC and also because it's heavy compute and general
         | models, multiple apps use the same model at the same time.
        
         | taneq wrote:
         | You have a beefy computer with lots of vram for testing
         | locally, and then once that's running you want to use the same
         | thing from other computers or from web servers etc. that can't
         | run the models themselves.
        
         | vunderba wrote:
         | Because it adds flexibility. By decoupling the frontend from
         | the backend it's much easier for other devs not directly
         | affiliated with the server repo (e.g. Ollama) to design new
         | frontends that can connect to it.
         | 
         | I also think it allows experts to focus on what they are good
         | at. Some people have a really keen eye for aesthetics and can
         | design amazing front and experiences, and some people are the
         | exact opposite and prefer to work on the backend.
         | 
         | Additionally, since it runs as a server, I can place it on a
         | powerful headless machine that I have and can access that
         | easily from significantly less powerful devices such as my
         | phone and laptop.
        
         | ijustlovemath wrote:
         | Not mentioned yet: you can "mitm" existing APIs, like OpenAI,
         | so that you can use existing applications with Ollama without
         | changing your code.
         | 
         | Really clever, IMO! I was also mystified by the choice until I
         | saw that use case.
        
         | Kuinox wrote:
         | I'll probably uses that because the Rust binding to llamacpp
         | doesn't works on windows (well, cpu only works, so not usable).
         | Python is broken (can't install the deps)
         | 
         | Also mind that loading theses models take dozens of seconds,
         | and you can only load one at a time on your machine, so if you
         | have multiple programs that want to run theses models, it make
         | sense to delegate this job to another program that the user can
         | control.
        
         | andersa wrote:
         | I personally find it very useful, because it allows me to run
         | the inference server on a powerful remote server while running
         | the UI locally on a laptop or tablet.
        
         | imtringued wrote:
         | You want electron? That is how you get electron!
        
       | crooked-v wrote:
       | I'm curious what people think of the non-open-source LM Studio
       | (https://lmstudio.ai) compared to Ollama.
        
         | Eisenstein wrote:
         | Likes:
         | 
         | * Super easy setup
         | 
         | * One-click download and load models/weights
         | 
         | * Works great
         | 
         | Dislikes:
         | 
         | * throws weights (in Windows) in /users/username/.cache in a
         | proprietary directory structure, eating up tens of gigs without
         | telling you or letting you share them with other clients
         | 
         | * won't let you import models you download yourself
         | 
         | * Search function is terrible
         | 
         | * I hate how it deals with instance settings
        
           | attentive wrote:
           | > * won't let you import models you download yourself
           | 
           | you can drop GGUF in the models folder following its
           | structure and LM Studio will pick it up.
           | 
           | What I wish LMS and others improve on is downloading models.
           | At the very least they should support resume and retry of
           | failed downloads. Also multistream would help. Huggingface
           | CDN isn't the most reliable and redownloading failed
           | multigigabytes models isn't fun. Of course I could do it
           | manually but then it's not "one-click download".
        
           | spamfilter247 wrote:
           | Does anyone know where it stores GGUFs on macOS?
        
       | xanderlewis wrote:
       | I just ran this on my new Mac Mini (installing the llama2 model)
       | and got a full-blown kernel panic. What?!
        
         | asabla wrote:
         | This may happen if you chose a model which is larger then your
         | available unified memory.
         | 
         | Which version of llama2 did you choose? And how much unified
         | memory do you have?
        
           | xanderlewis wrote:
           | Just the default (7B parameters?), which apparently requires
           | at least 8GB of RAM, which is how much I have. It seems to be
           | running fine now... it's just the installation process that
           | caused it to completely die.
        
             | risho wrote:
             | if it requires 8gb of ram and you have 8gb of ram that
             | leaves nothing left for the other processes that need to
             | run and the gpu which also shares the ram.
        
               | xanderlewis wrote:
               | I'm just surprised that the process can successfully eat
               | all of the memory so badly that the OS actually dies. I
               | thought there was protection for stuff like that.
        
               | karolist wrote:
               | There's not, mlock() will crash your machine from
               | userland happily
        
       | justsomehnguy wrote:
       | JUST as I wanted to dabble on that and try myself installing all
       | those ... requirements.
       | 
       | And now this article.
       | 
       | Tested, yes, it's amusing on how simple it is and it works.
       | 
       | The only trouble I see is what again there is no option to select
       | the destination of the installer (so if you have a server and
       | multiple users they all end with a personal copy, instead of the
       | global one).
        
       | waldrews wrote:
       | Can't seem to get it to see my NVIDIA Quadro P3000 GPU. Is it too
       | old to make a difference?
        
         | jmorgan wrote:
         | Ollama should support anything CUDA compute capability 5+
         | (P3000 is 6.1) https://developer.nvidia.com/cuda-gpus. Possible
         | to shoot me an email? (in my HN bio). The `server` logs should
         | have information regarding GPU detection in the first 10-20
         | lines or so that can help debug. Sorry!
        
       | ralphc wrote:
       | I'm an Elixir fanboy so I'll use that as an example. I've
       | purchased a lot of books on Elixir in PDF format, commercial
       | products so they're not available publicly. Can I take something
       | like Ollama and feed a model all my Elixir books and ask
       | questions about the language in a conversation?
        
       | dsign wrote:
       | I'm running Ollama with the hopes of putting together some
       | grammar/typo checking workflow for my writing. Not directly
       | related to Ollama, which is working great so far, but does
       | anybody know of any place I can turn to for questions? Like, some
       | sort of stackoverflow for LLMs?
        
       ___________________________________________________________________
       (page generated 2024-02-18 23:02 UTC)