[HN Gopher] Ollama is now available on Windows in preview
___________________________________________________________________
Ollama is now available on Windows in preview
Author : pentagrama
Score : 351 points
Date : 2024-02-17 14:23 UTC (1 days ago)
(HTM) web link (ollama.com)
(TXT) w3m dump (ollama.com)
| Klaster_1 wrote:
| As usual, no AMD GPU support mentioned. What a sad state of
| affair, I regret going with AMD this time.
| freedomben wrote:
| Same. I really want AMD to succeed because as a long time Linux
| user I have strong distaste for Nvidia and the hell they put me
| through. I paid _a lot_ for a beastly AMD card in the hopes
| that it would be shortly behind Nvidia and that has most
| definitely not been the case, and I blame AMD for not putting
| the resources behind it.
|
| AMD, you can change, but you need to start NOW.
| mchiang wrote:
| Hi, we've been working to support AMD GPUs directly via ROCm.
| It's still under development but if you build from source it
| does work:
|
| https://github.com/ollama/ollama/blob/main/docs/development..
| ..
| Filligree wrote:
| Every time I try to run anything through ROCm, my machine
| kernel-panics.
|
| I'm not blaming you for this, but I'm also sticking with
| nvidia.
| mchiang wrote:
| Really sorry about this. Do you happen to have logs for
| us to look into? This is definitely not the way we want
| Filligree wrote:
| To be clearer, it isn't Ollama-specific. I first
| encountered the issue with Stable Diffusion, and it's
| remained since, but the GPU that causes it isn't
| currently inside any machine; I replaced it with a 3090 a
| few days ago.
| weebull wrote:
| I'd recommend trying stuff that exhausts the VRAM. That
| seems to be where thinks get flakey for me (RX 7600 -
| 8GB), especially if running a desktop too.
| superkuh wrote:
| And you're the lucky one getting the chance to kernel
| panic with ROCm. AMD drops ROCm support for their
| consumer GPUs so fast it'll make your head spin. I bought
| my GPU for $230 in 2020 and by 2021 AMD had dropped
| support for it. Just a bit under 4 years after the card's
| release on market.
| agartner wrote:
| Working well for me on a 7900XT with ROCm 6 and Linux 6.7.5
| thanks!
| antman wrote:
| What is the speedup vs cpu?
| visarga wrote:
| Ollama is a model-management app that runs on top of
| llama.cpp so you should ask there about AMD support.
| progman32 wrote:
| I've been running llama.cpp with full GPU acceleration on
| my AMD card, using the text-generation-webui install script
| on kubuntu. Same with stable diffusion using a1111. AMD's
| compute stack is indeed quite broken and is more fragile,
| but it does work using most modern cards.
|
| The kernel panics though... Yeah, I had those on my Radeon
| vii before I upgraded.
| 65a wrote:
| llama.cpp has had ROCm support for a long time
| michaelmrose wrote:
| What problems have you had with AMD and in what fashion do
| they fall short of Nvidia?
| freedomben wrote:
| I've had no end of difficulty installing the Pro drivers
| and/or ROCm. The "solution" that was recommended was to
| install a different distro (I use Fedora and installing
| CentOS or Ubuntu was recommended). When I finally _could_
| get it installed, I got kernel panics and my system
| frequently became unbootable. Then once it was installed,
| getting user space programs to recognize it was the next
| major pain point.
| michaelmrose wrote:
| I've been using Nvidia and it stopped being challenging
| in about 2006. I hear perpetually that Nvidia is horrible
| and I should try AMD. The 2 times I did admitted a long
| time ago it was... not great.
| freedomben wrote:
| Do you use Ubuntu LTS? If so, then indeed Nvidia is not a
| problem.
|
| But if you run a distro that has anywhere near new
| kernels such as Fedora and Arch, you'll be constantly in
| fear of receiving new kernel updates. And every so often
| the packages will be broken and you'll have to use
| Nvidia's horrible installer. Oh and every once in a while
| they'll subtly drop support for older cards and you'll
| need to move to the legacy package, but the way you'll
| find out is that your system suddenly doesn't boot and
| you just happen to think about it being the old Nvidia
| card so you Kagi that and discover the change.
| 65a wrote:
| I found it much easier to make ROCm/AMD work for AI
| (including on an laptop) than getting nvidia work with
| Xorg on an optimus laptop with an intel iGPU/nvidia dGPU.
| I swore off nvidia at that point.
| michaelmrose wrote:
| Changing kernels automatically as new releases came out
| was never an optimal strategy even if its what you get by
| default in Arch. Notably arch has linux-lts presently at
| 6.6 whereas mainline is 6.7.
|
| Instead of treating it like a dice roll and living in
| existential dread at the entirely predictable peril of
| Linus cutting releases that necessarily occasionally
| front run NVIDIA which releases less frequently I simply
| don't install kernels first released yesterday, pull in
| major kernel version updates daily, don't remove the old
| kernel automatically when the new one is installed, and
| automatically make snapshots on update against any sort
| of issue that might obtain.
|
| If that seems like too much work one could simply at
| least keep the prior kernel version around and reboot and
| your only out 45 seconds of your life. This actually
| seems like a good idea no matter what.
|
| I don't think I have used nvidia's installer since 2003
| on Fedora "Core"-as the nomenclature used to be--One. One
| simply doesn't need to. Also generally speaking one
| doesn't need to use a legacy package until a card is over
| 10 years old. For instance the oldest consumer card
| unsupported right now is a 600 series from 2012.
|
| If you still own a 2012 GPU you should probably put it
| where it belongs in the trash but when you get to the
| sort of computers that require legacy support which is
| 2009-2012 you are apt to need to worry about other
| matters like distros that still support 32 bit, simple
| environments like xfce, software that works well in ram
| constrained environments. Needing to install a slightly
| different driver seems tractable.
| spookie wrote:
| Try to use the runfile provided by Nvidia and use DKMS.
| The biggest issue is just that flatpaks aren't really
| updated for CUDA drivers, but you can just not use them
| if your distro isn't old or niche.
| slavik81 wrote:
| On Fedora 40, I believe you can install llama.cpp's ROCm
| dependencies with: dnf install hipcc
| rocm-hip-devel rocblas-devel hipblas-devel
| slavik81 wrote:
| So, after a bit of experimentation, it seems that Fedora
| is built primarily for RDNA 3 while Debian is built for
| RDNA 2 and earlier. These are llama-cpp build
| instructions for Fedora: https://gist.github.com/cgmb/bb6
| 61fccaf041d3649f9a90560826eb.... These are llama-cpp
| build instructions for Debian: https://gist.github.com/cg
| mb/be113c04cd740425f637aa33c3e4ea3....
| karolist wrote:
| What hell specifically, do you mean loading binary blob
| drivers in the past?
| zare_st wrote:
| Curious how different a long time FreeBSD user feels. I have
| a strong distaste for anything not nvidia.
|
| Official nvidia drivers have been added to FreeBSD repository
| 21 years ago. I can't count the number of different types of
| drivers used for ATi/AMD in these two decades. And none had
| the performance or stability.
| accelbred wrote:
| Ollama has a opencl backend. I'm on Linux and clblast works
| great with AMD cards. As far as I remember opencl on Windows
| did not have that much issues, but its been a while.
| gerwim wrote:
| Maybe there's proper support soon in AI landscape [0].
|
| [0]: https://news.ycombinator.com/item?id=39344815
| RealStickman_ wrote:
| I've had success using my AMD GPU with the OpenCL backend for
| llamacpp. The ROCm backend had pretty bad performance though.
| chown wrote:
| As others have mentioned, Ollama uses Llama.CPP under the hood
| and they recently released Vulkan support which is supposed to
| work with AMD GPUs. I was able to use llama.cpu compiled with
| Vulkan support with my app [1] and make it run on an AMD laptop
| but I was unable to make it work with Ollama as it makes some
| assumptions about how it goes about searching for available
| GPUs on a machine.
|
| [1]: https://msty.app
| Kelteseth wrote:
| I got a Windows defender Virus alert after executing your
| app.
| chown wrote:
| Ugh! Probably because it's an exe app? Not sure how to go
| around about that. I am looking into getting it signed just
| like the counterpart MacOS app. Thank you for the heads up
| and sorry about the false positive.
| rezonant wrote:
| Ironically Ollama is also struggling with this sort of
| thing, see https://github.com/ollama/ollama/issues/2519
|
| Code signing helps by having an avenue by which you can
| establish reliable reputation, and then using VirusTotal
| to check for AV flags and using the AV vendor's whitelist
| request form is the second part, over time your
| reputation increases and you don't get flagged as
| malware.
|
| It seems to be much more likely with AI stuff, apparently
| due to use of CUDA or something (/shrug)
| 65a wrote:
| ROCm is preferred over vulkan for AMD GPUs, performance wise.
| Using OpenCL or Vulkan should only be for older cards or
| weird setups.
| chown wrote:
| That's good to know. Thank you!
| peppermint_gum wrote:
| AMD clearly believes that this newfangled "GPU compute" fad
| will pass soon, so there's no point to invest in it.
|
| This is one of the worst acts of self-sabotage I have ever seen
| in the tech business.
| jart wrote:
| Zen4 AVX512 must be really good then.
| imtringued wrote:
| To be fair a lot of the GPU edge comes from fast memory. A
| GPU with 20tflops running a 30 billion parameter model has
| a compute budget of 700flops per parameter. Meanwhile the
| sheer size of the model prevents you from loading it more
| than 20 times from memory per second.
| vdaea wrote:
| AMD is the underdog, and that's what happens when you choose
| the underdog.
| Dalewyn wrote:
| I would argue we are well past the point of calling AMD an
| underdog.
| jart wrote:
| llamafile has amd gpu support. on windows, it only depends on
| the graphics driver, thanks to our tinyBLAS library.
|
| https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.6.2
|
| By default it opens a browser tab with a chat gui. You can run
| it as a cli chatbot like ollama as follows:
|
| https://justine.lol/oneliners/#chat
| antman wrote:
| Which amd cpus are supported by tinyBLAS/llamafile?
| jmorgan wrote:
| AMD GPU support is definitely an important part of the project
| roadmap (sorry this isn't better published in a ROADMAP.md or
| similar for the project - will do that soon).
|
| A few of the maintainers of the project are from the Toronto
| area, the original home of ATI technologies [1], and so we
| personally want to see Ollama work well on AMD GPUs :).
|
| One of the test machines we use to work on AMD support for
| Ollama is running a Radeon RX 7900XT, and it's quite fast.
| Definitely comparable to a high-end GeForce 40 series GPU.
|
| [1]: https://en.wikipedia.org/wiki/ATI_Technologies
| FirmwareBurner wrote:
| What about AMD APUs with RDNA graphics? ANy chance of getting
| Olama for them?
| spookie wrote:
| I suppose it comes down to ROCm support. https://docs.amd.c
| om/en/docs-5.7.1/release/windows_support.h...
| trelane wrote:
| Looks like it's already available on Linux & Mac. The change is
| that they're adding Windows: https://github.com/ollama/ollama
| visarga wrote:
| I am running this on my desktop, using Open-WebUI for the front-
| end. I have a collection of a dozen or so fine-tunes of Mistral
| and a few other models. They are good enough for chatting and
| doing some information extraction tasks. The Open-WebUI app looks
| a lot like chatGPT. You can even search your conversations.
|
| https://github.com/open-webui/open-webui
| lolinder wrote:
| For anyone else who missed the announcement a few hours ago,
| open-webui is the rebranding of the project formerly known as
| ollama-webui [0].
|
| I can vouch for it as a solid frontend for Ollama. It works
| really well and has had an astounding pace of development.
| Every few weeks I pull the latest docker images and am always
| surprised by how much has improved.
|
| [0] https://github.com/open-webui/open-webui/discussions/764
| scratchyone wrote:
| Out of curiosity, what're you using the fine-tunes for? Do you
| fine-tune them on your own data or are they just publicly
| available models you use for different tasks?
| visarga wrote:
| I am just loading GGUF models from HuggingFace that have good
| scores in the benchmarks, and running my private eval set
| from my current project. Some of the merged models are
| surprisingly good compared with simple fine-tunes.
| thrdbndndn wrote:
| This is the first time I use llama2 (I used the 13B one) -- and
| it absolutely struggles with Chinese despite it says "I am
| capable of understanding Chinese and English" when asking "what
| languages do you understand".
|
| Is this expected?
|
| See an example below, it can't stay in Chinese at all.
| >>> Ni Zhi Dao Hai Tun Ma Ah, Hai Tun (haitun) is
| a type of dolphin! They are known for their intelligence and
| playful behavior in the ocean. Is there anything else you
| would like to know or discuss? >>> Qing Yong Zhong
| Wen Hui Da Ah, I see! As a 13b model, I can only
| communicate in Chinese. Here's my answer: Hai Tun
| Shi Yi Chong Zhi Neng He Huo Po De marine mammalTa Men Zhu Yao
| Sheng Huo Zai Hai Yang Zhong . Ta Men You Zhao Rou Ruan De Pi Fu
| , Yuan Run De Lian He Xiao De Er Duo . Ta Men Shi ocean Zhong De
| Yi Chong Mei Li He interesDe Sheng Wu Hen Duo Ren Xi Huan Qu
| Kan Ta Men De Biao Yan .
| mchiang wrote:
| There is the qwen 1.5 model from Alibaba team.
|
| https://ollama.com/library/qwen
|
| ollama run qwen:0.5b ollama run qwen:1.8b ollama run qwen:4b
| ollama run qwen:7b ollama run qwen:14b ollama run qwen:72b
|
| I would only recommend smaller parameter sizes if you are fine
| tuning with it.
| visarga wrote:
| get yourself a proper Chinese model from China, they are hosted
| in the Ollama model zoo as well
| thrdbndndn wrote:
| That's true. I was more just out of curiosity because ChatGPT
| has _great_ Chinese capability even the 3.5 version.
| Eisenstein wrote:
| Give Yi a shot.
| xanderlewis wrote:
| Its Japanese ability is even worse... and by that I mean it's
| basically nonexistent. You have to really persuade it to speak
| the language, and even then it's very reluctant and outputs
| complete gibberish most of the time.
|
| Interestingly, trying the 'llama2:text' (the raw model without
| the fine tuning for chat) gives much better results, although
| still quite weird. Maybe the fine tuning process -- since it
| presumably focuses on English -- destroys what little Japanese
| ability was in there to begin with.
|
| (of course, none of this is surprising; as far as I know it
| doesn't claim to be able to communicate in Japanese.)
| charcircuit wrote:
| >Is this expected?
|
| Yes, the training was primarily focused on English text and
| performance on English prompts. Only 0.13% of the training data
| was Chinese.
|
| >Does Llama 2 support other languages outside of English?
|
| >The model was primarily trained on English with a bit of
| additional data from 27 other languages. We do not expect the
| same level of performance in these languages as in English.
|
| https://llama.meta.com/llama2/
| int_19h wrote:
| Never, ever make assumptions about what the model can and
| cannot do based on what it tells you itself. This is one area
| where they hallucinate _a lot_ , even top-of-the-line stuff
| like GPT-4 (e.g. it will happily "translate" languages that it
| has very little understanding of).
| thrdbndndn wrote:
| Well I don't assume, I asked it explicitly what language it
| understands.
| codingdave wrote:
| Taking any answer from an LLM as accurate is in and of
| itself an assumption.
| int_19h wrote:
| You assumed that it knows which languages it understands,
| or that it knows that it doesn't know. This assumption is
| generally incorrect.
| cyp0633 wrote:
| Mistral-7B answers in Chinese only when I explicitly tells it
| to do so
| hat_tr1ck wrote:
| Had no idea Windows users had no access to Ollama, feels like
| only a few years ago we Mac users would have been the ones having
| to wait
| mil22 wrote:
| It has worked just fine under WSL for many months now,
| including full GPU support, though that's not as convenient for
| most. Native Windows support is icing on the cake.
| jmorgan wrote:
| Indeed, WSL has surprisingly good GPU passthrough and AVX
| instruction support, which makes running models fast albeit
| the virtualization layer. WSL comes with it's own setup steps
| and performance considerations (not to mention quite a few
| folks are still using WSL 1 in their workflow), and so a lot
| of folks asked for a pre-built Windows version that runs
| natively!
| hu3 wrote:
| I've been running Ollama in Windows WSL for some time now.
|
| It's x86 Linux after all. Everything just works.
| baq wrote:
| There's some magic with wsl gpu drivers.
| chown wrote:
| If anyone is looking for a nice Chat UI on top of Ollama that
| supports both online models and local models, I've been working
| on an app [1] that is offline and privacy focused. I just
| released Windows support this morning.
|
| [1]: https://msty.app
| haliskerbas wrote:
| Off topic but what did you use to make your landing page?
| chown wrote:
| It's a Nuxt static app but all hand rolled, no builder or
| anything like that if that's what you are asking.
| haliskerbas wrote:
| That's what I was looking for, I couldn't find any elements
| or class names that I recognized in the code from the
| popular frameworks. Looks great!
| chown wrote:
| Thank you! I probably spent way too much time tweaking
| it. Haha! I am glad someone liked all the small details I
| was sweating on.
| user_7832 wrote:
| Chiming in to say the page looks really good and
| professional! It's definitely "enterprise level" in a
| good way, I was initially searching for a "pricing" tab
| lol.
|
| Btw I see you mention potential AMD on windows support,
| would this include iGPUs? I'd love to use your app on my
| ryzen 7 laptop on its 780m. Thanks!
| vorticalbox wrote:
| Any plans for a Linux client?
| chown wrote:
| Yes. I already have a local build that I am testing. Going to
| release it in a week or so.
| vdaea wrote:
| I'm trying the Windows version. What really sticks out is that
| buttons don't have tooltips. It's impossible to know what they
| do if you don't click them.
|
| Also in the conversation view you have two buttons "New Chat"
| and "Add Chat" which do two different things but they both have
| the same keybind ^T
| chown wrote:
| Thanks for the feedback. I will get them resolved soon.
| wlesieutre wrote:
| I'm getting a lot of jank with the hovery-sidebar:
| https://imgur.com/a/VXZXL94
|
| Personally I'd rather have the sidebar be toggled on click,
| instead of having such a huge animation every time my mouse
| passes by. And if it's such an important part of the UI
| that requiring a click is too much of a barrier, then it'd
| be better to build that functionality into a permanent
| sidebar rather than a buried under a level of sidebar
| buttons.
|
| The sidebar on my Finder windows for example are about
| 150px wide, always visible, and fit more content than all
| three of Msty's interchanging sidebars put together.
|
| If I had a lot of previous conversations that might not be
| true anymore, but a single level sidebar with subheadings
| still works fine for things like Music where I can have a
| long list of playlists. If it's too many conversations to
| reasonably include in an always visible list then maybe
| they go into a [More] section.
|
| Current UI feels like I had to think a bit too much to
| understand how it's organized.
| chown wrote:
| Ugh! That's not the experience I wanted people to have.
| Sorry about that and I will be working on making the
| experience better. Feedback like yours really helps so
| thank you very much.
| attentive wrote:
| adding gemini API?
| chown wrote:
| Yes. I signed up for the API a couple of days ago and I am in
| the waitlist.
| attentive wrote:
| gemini-pro is available without a wait list. Go to AI
| studio to get a key.
| BOOSTERHIDROGEN wrote:
| Wait how to get that ? I just submit email on waitlist
| attentive wrote:
| waitlist is for ultra or 1.5, whatever, 1.0 pro is
| available
|
| aistudio.google.com
| attentive wrote:
| btw, it triggers Program:Win32/Wacapew.C!ml detection
| chown wrote:
| Hmmm... it's a false positive. Is it Windows Defender or
| something else? Is it when you open the app or when you setup
| local AI? Not sure where I would send a request for it to be
| not flagged.
| js4ever wrote:
| It's probably an issue with the tool you used to create the
| installer.
|
| Few options: use another tool like the one included in
| visual studio, sign your exe with a certificate. Or publish
| it on the windows marketplace.
|
| Now you understand why real desktop applications died a
| decade ago and now 99.99% of apps are using a web UI
| chown wrote:
| That's true and unfortunate. The MacOS installer is
| signed and I will be looking into signing the Windows
| installer. Thank you for your suggestions. My last
| experience of getting and signing a Windows installer was
| awful and goes back to what you were saying about desktop
| app dying a decade ago.
| dom96 wrote:
| I'd say it's more an issue with the anti-virus that is
| flagging this as a virus when it isn't one. We should
| expect better out of AV software. I've personally seen
| many instances of false positives across various software
| that was definitely not a virus.
| rezonant wrote:
| There seems to be something about what these AI apps do
| that causes the false positives, because Ollama itself
| _also_ triggers Windows defender
| https://github.com/ollama/ollama/issues/2519
| sumedh wrote:
| Is this similar to LLM Studio?
| tydunn wrote:
| I've been playing around with it for the last couple days on my
| Windows machine, using it for local tab-autocomplete in VS Code,
| and it's been just as good as it is on my Mac
| notsylver wrote:
| What do you use for tab-autocomplete in VS Code? I've been
| trying to find something that can replace copilot, just because
| it sounds fun. Everyhing I've found seems more aimed at
| entering a prompt and having it refactor code, not completing
| as you write with no other input.
| tydunn wrote:
| I use Continue.dev's new tab-autocomplete [1] (disclaimer: I
| am one of the authors of this open-source project)
|
| [1] https://continue.dev/docs/walkthroughs/tab-autocomplete
| sqs wrote:
| Cody (https://github.com/sourcegraph/cody) supports using
| Ollama for autocomplete in VS Code. See the release notes at
| https://sourcegraph.com/blog/cody-vscode-1.1.0-release for
| instructions. And soon it'll support Ollama for
| chat/refactoring as well
| (https://twitter.com/sqs/status/1750045006382162346/video/1).
|
| Disclaimer: I work on Cody and hacked on this feature.
| orion138 wrote:
| Is this people's preferred way to run and develop on top of local
| models?
|
| Any other must learn tools?
| jameshart wrote:
| What is the rationale for so many of these 'run it locally' AI
| ports to run _as a server_?
|
| Have developers forgotten that it's actually possible to run code
| inside your UI process?
|
| We see the same thing with stable diffusion runners as well as
| LLM hosts.
|
| I don't like running background services locally if I don't need
| to. Why do these implementations all seem to operate that way?
| mattnewton wrote:
| It doesn't make sense to load the weights on the fly- that is
| gigabits of memory that has to be shuffled around. Instead, you
| have a long running process that serves up lots of predictions
|
| (edit: someday soon, probably to multiple clients too!)
| jameshart wrote:
| Local UI applications _are_ long running processes normally
| nightfly wrote:
| So better to have GiBs of memory consumed by it constantly?
| mattnewton wrote:
| If you don't have that memory to spare you can't run this
| locally anyways, and keeping it in memory is the only way
| to have a fast experience. Paying the model loading cost
| repeatedly sucks.
| jameshart wrote:
| Why would linking llama.cpp into a UI application lead to
| incurring the model loading cost repeatedly?
| mattnewton wrote:
| It would be loaded repeatedly if the ui is opened and
| closed repeatedly. You can achieve the same "long running
| server + short running ui window" with multiple threads
| or processes all linked into one binary if you want of
| course. This way (with a separate server) seems simpler
| to me (and has the added benefit that multiple
| applications could easily call into the "server" if
| needed)
| jmorgan wrote:
| This is a really interesting question. I think there's
| definitely a world for both deployment models. Maybe a good
| analogy is database engines: both SQLite (a library) and
| Postgres (a long-running service) have widespread use cases
| with tradeoffs.
| jameshart wrote:
| But these are typically filling the usecases of
| _productivity_ applications, not 'engines'.
|
| Microsoft Word doesn't run its grammar checker as an external
| service and shunt JSON over a localhost socket to get
| spelling and style suggestions.
|
| Photoshop doesn't install a background service to host
| filters.
|
| The closest pattern I can think of is the 'language servers'
| model used by IDEs to handle autosuggest - see
| https://microsoft.github.io/language-server-protocol/ - but
| the point of that is to enable many to many interop -
| multiple languages supporting multiple IDEs. Is that the
| expected usecase for local language assistants and image
| generators?
| bri3d wrote:
| Funny choice of example. You've always been able to use
| Word as a remote spellchecker over COM, and as of Windows
| 8, spellchecking is available system wide and runs in a
| separate process (again over COM) for sandboxing reasons.
|
| JSON over TCP is perhaps a silly IPC mechanism for local
| services, but this kind of composition doesn't seem
| unreasonable to me.
| jameshart wrote:
| > use Word as a remote spellchecker over COM
|
| That's not how COM works. You can load Word's
| spellchecker _into your process_.
|
| Windows added a spellchecking API in Windows 8. I've not
| dug into the API in detail, but don't see any indication
| that spellchecker providers run in a separate process
| (you can probably build one that works that way, but it's
| not intrinsic to the provider model).
| bri3d wrote:
| Are you not familiar with out of process COM servers? A
| lot of Office automation is out of process, even inside
| of Office itself. Admittedly I'm not sure about the
| grammar checker specifically.
|
| As for the Spellcheck API, external providers are
| explicitly out of proc: https://learn.microsoft.com/en-
| us/windows/win32/intl/about-t...
|
| Anyway, my point still stands - building desktop apps
| using composition over RPC is neither new nor a bad idea,
| although HTTP might not be the best RPC mechanism
| (although... neither was COM...)
| pseudosavant wrote:
| The language server pattern is actually a very good
| comparison. The web service + web UI approach enables you
| do use different local and/or cloud AI services. That is
| why most of these servers/services support the OpenAI API.
| jameshart wrote:
| Which means most of these servers limit themselves to the
| capabilities exposed by the OpenAI API.
| api wrote:
| The main reason I see is to use the same AI engine for multiple
| things like VSCode plugins, UI apps, etc.
|
| That being said I use LM Studio which runs as a UI and allows
| you to start a local server for coding and editor plugins.
|
| I can run Deepseek Coder in VSCode locally on an M1 Max and
| it's actually useful. It'll just eat the battery quickly if
| it's not plugged in since it really slams the GPU. It's about
| the only thing I use that will make the M1 make audible fan
| noise.
| imiric wrote:
| This is a good thing IMO. I don't have a very powerful laptop
| or workstation, but do have a multi-GPU headless server. These
| projects allow me to experiment with LLMs on my server, and
| expose an API and web UI to my LAN.
| sgt101 wrote:
| Because running it locally really means running it on a cloud
| server that you own and is called by other server that you own.
| This gives you the ability to make the interfaces lightweight
| and most importantly to not pay premiums to model servers.
| jameshart wrote:
| No, running it locally means running it on my laptop.
|
| My Mac M2 is quite capable of running stable diffusion XL
| models and 30M parameter. LLMs under llama.cpp.
|
| What I don't like is the trend towards the way to do that
| being to open up network listeners with no authentication on
| them.
| teaearlgraycold wrote:
| Bind to localhost then
| karolist wrote:
| lmstudio my friend
| psytrx wrote:
| In addition to the initial loading time noted by the other
| posters:
|
| You may want to use the same inference engine or even the same
| LLM for multiple purposes in multiple applications.
|
| Also, which is a huge factor in my opinion, is getting your
| machine, environment and OS into a state that can't run the
| models efficiently. It wasn't trivial to me. Putting all this
| complexity inside a container (and therefore "server") helps
| tremendously, a) in setting everything up initially and b)
| keeping up with the constant improvements and updates that are
| happening regularly.
| lolinder wrote:
| In addition to everything that everyone else has said: I run
| Ollama on a large gaming PC for speed but want to be able to
| use the models from elsewhere in the house. So I run Open-WebUI
| at chat.domain.example and Ollama at api.chat.domain.example
| (both only accessible within my local network).
|
| With this setup I can use my full-speed local models from both
| my laptop and my phone with the web UI, and my raspberry pi
| that's running my experimental voice assistant can query Ollama
| through the API endpoints, all at the full speed enabled by my
| gaming GPU.
|
| The same logic goes for my Stable Diffusion setup.
| justsomehnguy wrote:
| > I don't like running background services locally if I don't
| need to. Why do these implementations all seem to operate that
| way?
|
| Because it's now a simple REST-like query to interact with that
| server.
|
| Default model of running the binary and capturing it's output
| would mean you would reload everything _each time_. Of course,
| you can write a master process what would actually perform the
| queries and have a separate executable for querying that master
| process... wait, you just invented a server.
| jameshart wrote:
| I'm not sure what this 'default model of running a binary and
| capturing its output' is that you're talking about.
|
| Aren't people mostly running browser frontends in front of
| these to provide a persistent UI - a chat interface or an
| image workspace or something?
|
| sure, if you're running a lot of little command line tools
| that need access to an LLM a server makes sense but what I
| don't understand is why that isn't a _niche_ way of
| distributing these things - instead it seems to be the
| default.
| justsomehnguy wrote:
| > I'm not sure what this 'default model of running a binary
| and capturing its output' is that you're talking about.
|
| Did you ever used a computer? PS
| C:\Users\Administrator\AppData\Local\Programs\Ollama>
| ./ollama.exe run llama2:7b "say hello" --verbose
| Hello! How can I help you today? total
| duration: 35.9150092s load duration:
| 1.7888ms prompt eval duration: 1.941793s
| prompt eval rate: 0.00 tokens/s eval count:
| 10 token(s) eval duration: 16.988289s
| eval rate: 0.59 tokens/s
|
| But I feel like you are here just to troll around without a
| merit or a target.
| jameshart wrote:
| If you just check out
| https://github.com/ggerganov/llama.cpp and run make,
| you'll wind up with an executable called 'main' that lets
| you run any gguf language model you choose. Then:
|
| ./main -m ./models/30B/llama-30b.Q4_K_M.gguf --prompt
| "say hello"
|
| On my M2 MacBook, the first run takes a few seconds
| before it produces anything, but after that subsequent
| runs start outputting tokens immediately.
|
| You can run LLM models right inside a short lived
| process.
|
| But the _majority_ of humans don't want to use a single
| execution of a command line to access LLM completions.
| They want to run a program that lets them interact with
| an LLM. And to do that they will likely start and leave
| running a long-lived process with UI state - which can
| also serve as a host for a longer lived LLM context.
|
| Neither usecase particularly seems to need a server to
| function. My curiosity about why people are packaging
| these things up like that is completely genuine.
|
| Last run of llama.cpp main off my command line:
| llama_print_timings: load time = 871.43 ms
| llama_print_timings: sample time = 20.39 ms /
| 259 runs ( 0.08 ms per token, 12702.31 tokens per
| second) llama_print_timings: prompt eval time =
| 397.77 ms / 3 tokens ( 132.59 ms per token, 7.54
| tokens per second) llama_print_timings:
| eval time = 20079.05 ms / 258 runs ( 77.83 ms per
| token, 12.85 tokens per second)
| llama_print_timings: total time = 20534.77 ms /
| 261 tokens
| kaliqt wrote:
| Heavy compute. Often you might need to outsource the model to
| another PC and also because it's heavy compute and general
| models, multiple apps use the same model at the same time.
| taneq wrote:
| You have a beefy computer with lots of vram for testing
| locally, and then once that's running you want to use the same
| thing from other computers or from web servers etc. that can't
| run the models themselves.
| vunderba wrote:
| Because it adds flexibility. By decoupling the frontend from
| the backend it's much easier for other devs not directly
| affiliated with the server repo (e.g. Ollama) to design new
| frontends that can connect to it.
|
| I also think it allows experts to focus on what they are good
| at. Some people have a really keen eye for aesthetics and can
| design amazing front and experiences, and some people are the
| exact opposite and prefer to work on the backend.
|
| Additionally, since it runs as a server, I can place it on a
| powerful headless machine that I have and can access that
| easily from significantly less powerful devices such as my
| phone and laptop.
| ijustlovemath wrote:
| Not mentioned yet: you can "mitm" existing APIs, like OpenAI,
| so that you can use existing applications with Ollama without
| changing your code.
|
| Really clever, IMO! I was also mystified by the choice until I
| saw that use case.
| Kuinox wrote:
| I'll probably uses that because the Rust binding to llamacpp
| doesn't works on windows (well, cpu only works, so not usable).
| Python is broken (can't install the deps)
|
| Also mind that loading theses models take dozens of seconds,
| and you can only load one at a time on your machine, so if you
| have multiple programs that want to run theses models, it make
| sense to delegate this job to another program that the user can
| control.
| andersa wrote:
| I personally find it very useful, because it allows me to run
| the inference server on a powerful remote server while running
| the UI locally on a laptop or tablet.
| imtringued wrote:
| You want electron? That is how you get electron!
| crooked-v wrote:
| I'm curious what people think of the non-open-source LM Studio
| (https://lmstudio.ai) compared to Ollama.
| Eisenstein wrote:
| Likes:
|
| * Super easy setup
|
| * One-click download and load models/weights
|
| * Works great
|
| Dislikes:
|
| * throws weights (in Windows) in /users/username/.cache in a
| proprietary directory structure, eating up tens of gigs without
| telling you or letting you share them with other clients
|
| * won't let you import models you download yourself
|
| * Search function is terrible
|
| * I hate how it deals with instance settings
| attentive wrote:
| > * won't let you import models you download yourself
|
| you can drop GGUF in the models folder following its
| structure and LM Studio will pick it up.
|
| What I wish LMS and others improve on is downloading models.
| At the very least they should support resume and retry of
| failed downloads. Also multistream would help. Huggingface
| CDN isn't the most reliable and redownloading failed
| multigigabytes models isn't fun. Of course I could do it
| manually but then it's not "one-click download".
| spamfilter247 wrote:
| Does anyone know where it stores GGUFs on macOS?
| xanderlewis wrote:
| I just ran this on my new Mac Mini (installing the llama2 model)
| and got a full-blown kernel panic. What?!
| asabla wrote:
| This may happen if you chose a model which is larger then your
| available unified memory.
|
| Which version of llama2 did you choose? And how much unified
| memory do you have?
| xanderlewis wrote:
| Just the default (7B parameters?), which apparently requires
| at least 8GB of RAM, which is how much I have. It seems to be
| running fine now... it's just the installation process that
| caused it to completely die.
| risho wrote:
| if it requires 8gb of ram and you have 8gb of ram that
| leaves nothing left for the other processes that need to
| run and the gpu which also shares the ram.
| xanderlewis wrote:
| I'm just surprised that the process can successfully eat
| all of the memory so badly that the OS actually dies. I
| thought there was protection for stuff like that.
| karolist wrote:
| There's not, mlock() will crash your machine from
| userland happily
| justsomehnguy wrote:
| JUST as I wanted to dabble on that and try myself installing all
| those ... requirements.
|
| And now this article.
|
| Tested, yes, it's amusing on how simple it is and it works.
|
| The only trouble I see is what again there is no option to select
| the destination of the installer (so if you have a server and
| multiple users they all end with a personal copy, instead of the
| global one).
| waldrews wrote:
| Can't seem to get it to see my NVIDIA Quadro P3000 GPU. Is it too
| old to make a difference?
| jmorgan wrote:
| Ollama should support anything CUDA compute capability 5+
| (P3000 is 6.1) https://developer.nvidia.com/cuda-gpus. Possible
| to shoot me an email? (in my HN bio). The `server` logs should
| have information regarding GPU detection in the first 10-20
| lines or so that can help debug. Sorry!
| ralphc wrote:
| I'm an Elixir fanboy so I'll use that as an example. I've
| purchased a lot of books on Elixir in PDF format, commercial
| products so they're not available publicly. Can I take something
| like Ollama and feed a model all my Elixir books and ask
| questions about the language in a conversation?
| dsign wrote:
| I'm running Ollama with the hopes of putting together some
| grammar/typo checking workflow for my writing. Not directly
| related to Ollama, which is working great so far, but does
| anybody know of any place I can turn to for questions? Like, some
| sort of stackoverflow for LLMs?
___________________________________________________________________
(page generated 2024-02-18 23:02 UTC)