hngopher.com

       [HN Gopher] Show HN: Ollama - Run LLMs on your Mac
       ___________________________________________________________________
        
       Show HN: Ollama - Run LLMs on your Mac
        
       Hi HN  A few folks and I have been working on this project for a
       couple weeks now. After previously working on the Docker project
       for a number of years (both on the container runtime and image
       registry side), the recent rise in open source language models made
       us think something similar needed to exist for large language
       models too.  While not exactly the same as running linux
       containers, running LLMs shares quite a few of the same challenges.
       There are "base layers" (e.g. models like Llama 2), specific
       configuration to run correctly (parameters, temperature, context
       window sizes etc). There's also embeddings that a model can use at
       runtime to look up data - we don't support this yet but it's
       something we're looking at doing soon.  It's an early project, and
       there's still lots to do!
        
       Author : jmorgan
       Score  : 99 points
       Date   : 2023-07-20 16:06 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | mrbonner wrote:
       | I like to troll LLM :-)
       | 
       | >>> if you drop a feather and a cannon ball together, at the same
       | height, which one would hit the ground first?
       | 
       | Thank you for your question! I'm happy to help. However, I must
       | point out that the question is based on a false premise. A
       | feather and a cannon ball cannot be dropped together at the same
       | height, as they have different weights and densities, which would
       | cause them to fall at different rates. The feather would fall
       | much slower than the cannon ball due to its light weight and
       | lower density. So, to answer your question directly, the cannon
       | ball would hit the ground first. But please keep in mind that
       | this scenario is purely hypothetical, as it is not possible to
       | drop a feather and a cannon ball together at the same height in
       | real life. Is there anything else I can help you with?
       | 
       | >>> if you drop a feather and a cannon ball together, at the same
       | height on the moon, which one would hit the ground first?
       | 
       | Thank you for your question! I'm happy to help. However, I must
       | point out that the scenario you've described is not possible as
       | the moon does not have an atmosphere, and therefore, there is no
       | "ground" on the moon for either the feather or the cannon ball to
       | hit. Both the feather and the cannon ball would remain suspended
       | in mid-air due to the lack of gravity on the moon. So, neither
       | one would hit the ground first. Is there anything else I can help
       | you with?
        
         | quadrature wrote:
         | Well it did get the first one right.
        
         | lagniappe wrote:
         | >cannon ball would remain suspended in mid-air due to the lack
         | of gravity
         | 
         | Surface gravity 1.622 m/s2 (0.1654 g; 5.318 ft/s2)
        
         | anigbrowl wrote:
         | Cartoon physics are inherently superior to reality
        
         | [deleted]
        
         | mrbonner wrote:
         | I wanted to reply to another response here but it got deleted.
         | Yes, I think we all (hopefully) understand LLM is not a logic
         | model. But, its answers are so convincing to the point we are
         | fooled into thinking it is a SME, in every thing! At least, I
         | found myself fooled in a few occasions.
        
       | magundu wrote:
       | It is really amazing.
       | 
       | With this command "ollama run llama2", I can run llama2
       | instantly.
       | 
       | Do you know how much RAM and disk space needed to run this llama2
       | model?
        
         | mchiang wrote:
         | By default the `llama2` model is the 7B model, and it's
         | recommended you have at least 16GB of memory to run it.
         | 
         | Regarding the disk space, the model itself is 3.8GB.
        
         | jmorgan wrote:
         | For disk size, you can see how big models are locally by
         | running "ollama list"                 NAME          SIZE
         | MODIFIED             llama2:latest 3.8 GB 3 hours ago
         | 
         | Now, if you make new models that are "FROM llama2" for example,
         | Ollama will re-use this "base model" so you don't store it
         | twice.
        
       | zenlikethat wrote:
       | One Q for the maintainers, in
       | https://github.com/jmorganca/ollama/blob/main/library/modelf...
       | does that imply you can override TEMPLATE directly? I am
       | experimenting with adding some dynamic context between user
       | instructions
        
         | jmorgan wrote:
         | You can, although this is definitely "hard mode" when writing
         | Modelfiles.
         | 
         | Easier is to override SYSTEM which allows will customize the
         | system prompt.
         | 
         | SYSTEM "You are an expert in writing python."
         | 
         | If you do want modify the template, you can start by looking at
         | the "base images"
         | https://github.com/jmorganca/ollama/blob/main/library/modelf...
         | 
         | For example with Llama 2 as a base model, you can do this, but
         | YMWV:                 FROM llama2       TEMPLATE """       {{-
         | if .First }}       <<SYS>>       A conversation between three
         | friends       <</SYS>>       {{- end }}            Friend 1:
         | hello       Friend 2: hey       Friend 3: {{ .Prompt }}
         | """
         | 
         | edit: we're looking at how to make it easier so you don't need
         | {{ .First }} to provide the system prompt on first message
        
       | nidnogg wrote:
       | I feel like I've seen a project just like this pop up almost
       | every other day here on HN. Sorry for my ignorance, but how is
       | this different when compared to other "LLM on your mac CLI"
       | solutions nowadays? (please bear in mind I'm not up to speed with
       | these sibling projects)
       | 
       | Is it mostly in the `Modelfile` approach with expandable models
       | and quick setup?
        
         | zenlikethat wrote:
         | So far for me it's been nice not having to spend one second
         | mucking with Python (which is slow, and can be complex managing
         | venvs etc), but the Modelfile is what's started to pique my
         | interest. Bundling up metadata about prompts, parameters,
         | models and hopefully later embeddings, LoRas, etc seems really
         | promising. (Considering most people are just sharing prompts
         | around atm)
         | 
         | I added a PR that supports dynamic command output injection,
         | among other interesting things:
         | https://github.com/jmorganca/ollama/pull/132. So you can
         | imagine taking the output of, say, top and having the LLM parse
         | it into useful information for you. Or having a chat session
         | where you roll a d20 between each user interaction to determine
         | their luck. Then being able to share that around to other
         | people generically.
         | 
         | I'm also hoping at some point this project or a similar one
         | will enable easy passing of full blown pipelines/chains of
         | stuff around, as well as a ChatGPT style conversation sync and
         | sharing tool.
        
           | [deleted]
        
       | barefeg wrote:
       | Could you explain how it works?
        
         | jmorgan wrote:
         | Under the hood it does a few things I'll shed some light on (at
         | a high level!):
         | 
         | 1. Running the model: it's built on the open-source (and
         | amazing) llama.cpp project for running quantized (i.e.
         | compressed) models like Llama 2 (launched yesterday) that will
         | fit in memory on even a commodity Mac. It's similar to their
         | "server" example as a starting point.
         | 
         | 2. Downloading and storing models: models are distributed in a
         | way that ensures their integrity and re-usability as much as
         | possible (since they are large files!). For this we use a
         | similar approach as Docker
         | (https://github.com/distribution/distribution)
         | 
         | 3. Creating custom models: models can be extended with this new
         | idea we're experimenting with: a Modelfile. What this will do
         | is effectively add "layers" to a model so you can distribute
         | model data together and keep them self-contained. This builds
         | on what I mentioned in 2 - our hope is this will make it easier
         | to extend models like Llama 2 to your own use cases (e.g. a
         | character).
        
       | joebiden2 wrote:
       | What does this add over llama.cpp? Is it just an "easier" way to
       | setup llama.cpp locally?
       | 
       | If so, I don't really get it, because setting up llama.cpp
       | locally is quite easy and well documented. And this appears to be
       | a fork. Seems a bit fishy to me, when looking at the other "top"
       | comments (with this one having no upvotes, but still #2 right
       | now).
       | 
       | (llama.cpp's original intention is identical to yours: The main
       | goal of llama.cpp is to run the LLaMA model using 4-bit integer
       | quantization on a MacBook1)
       | 
       | 1 https://github.com/ggerganov/llama.cpp#description
        
         | jmorgan wrote:
         | The llama.cpp project is absolutely amazing. Our goal was to
         | build with/extend the project (vs try to be an alternative).
         | Ollama was originally inspired by the "server" example:
         | https://github.com/ggerganov/llama.cpp/tree/master/examples/...
         | 
         | This project builds on llama.cpp in a few ways:
         | 
         | 1. Easy install! Precompiled for Mac (Windows and Linux coming
         | soon)
         | 
         | 2. Run 2+ models: loading and unloading models as users need
         | them, including via a REST API. Lots to do here, but even small
         | models are memory hogs and they take quite a while to load, so
         | the hope is to provide basic "scheduling"
         | 
         | 3. Packaging: content-addressable packaging that bundles GGML-
         | based weights with prompts, parameters, licenses and other
         | metadata. Later the goal is to bundle embeddings and other
         | larger files custom models (for specific use cases, a la
         | PrivateGPT) would need to run.
         | 
         | edit: formatting
        
       | tough wrote:
       | Error: Post "http://127.0.0.1:11434/api/generate": EOF
        
         | mchiang wrote:
         | This is most likely an out of memory problem without seeing the
         | logs.
         | 
         | We have a fix in the works that will be released soon.
         | 
         | May I ask what mac & memory you're running this on?
        
           | mchiang wrote:
           | Try running the orca model (default is 3B), and it requires
           | much less memory.
           | 
           | ``` ollama run orca ```
        
       | politelemon wrote:
       | Surprised to see no Linux support, considering that it's a
       | project that manages... docker.
        
         | williamstein wrote:
         | Is this a project that "manages docker" or is it a project that
         | has an overall design and command line interface that is
         | _inspired by Docker_?
        
           | jmorgan wrote:
           | Definitely the latter!
        
         | mchiang wrote:
         | Linux support is coming, you can build it right now by running:
         | 
         | `CGO_ENABLED=1 go build . `
        
         | jmorgan wrote:
         | While we haven't built Linux binaries yet (waiting to make sure
         | we have _fast_ GPU support before publishing a Linux and
         | Windows binaries) it does run on Linux. For example it's used
         | to create and publish the "library" images (e.g. "llama2") on a
         | linux server today
         | 
         | If you have the Go toolchain installed, there are some steps
         | from building it from source here:
         | https://github.com/jmorganca/ollama#building
        
       | zenlikethat wrote:
       | I've been playing with this lately and it's been loads of fun (on
       | OSX in particular, on Windows/WSL I don't think it's rigged up
       | with GPTQ yet).
       | 
       | Of note is the experimental "Modelfile" that allows you to ship
       | around an AI character or "stack" like a Docker image. I can
       | really see the future around this shaping up to be really
       | exciting, since I'll be able to hand you something much more
       | replicable in terms of model, prompt, etc, and once we are able
       | to chain calls and bundle embeddings, LoRa, etc., it seems like
       | it could be really powerful.
        
         | danryan wrote:
         | I came to the comments to see if anyone shared our opinion; +1
         | 
         | The Modelfile is an exciting abstraction, OP.
        
       | littlestymaar wrote:
       | This is doing inference on the CPU, right?
        
         | mchiang wrote:
         | On the mac, we have enabled Metal support.
        
       | willietran wrote:
       | Whoa, this is really neat. Does this mean that I can essentially
       | try any LLM on my local machine?
        
         | Patrick_Devine wrote:
         | Yep. Right now we've packaged llama2, vicuna, wizardlm, and
         | orca. The idea is to make it crazy easy to get started though.
         | You do need quite a bit of RAM (16GB should work for the
         | smaller models, 32MB+ for the bigger ones), and for now a newer
         | Mac. We're working versions for Windows and Linux too though.
         | 
         | EDIT: We don't let you run stuff from HF, but we are trying to
         | repackage the popular models. The plan is to let you upload
         | your own in the future to share them.
        
           | willietran wrote:
           | Awesome! Thanks for this. Trying this out now.
        
         | zenlikethat wrote:
         | Not "any" yet per se, but the groundwork is there. It helped me
         | try out the GGML stuff after failing to get it online
         | previously.
        
       | jmorgan wrote:
       | One fun use case of local models is creating "characters" or
       | companions (a la character.ai and other tools).
       | 
       | It's early but for those interested, this project has a way to do
       | with the concept of a "Modelfile". Here's an example for Mario:
       | https://github.com/jmorganca/ollama/tree/main/examples/mario
       | ollama create mario -f ./Modelfile       ollama run mario
       | >>> Hi!            WAAH-HOO! *jumps up and down* IT'S-A ME,
       | MARIO! *grinning* HOW-A YOU DOIN'? *wink wink*
        
       | jmorgan wrote:
       | To add: the easiest way to get up and running is to download the
       | Mac app:
       | 
       | https://ollama.ai/download
       | 
       | Then running a model (e.g. Llama 2) can be done with
       | ollama run llama2
       | 
       | For Macs with less memory (<= 8GB) you'll want to try a smaller
       | model - orca is the smallest in the "model registry" right now:
       | ollama run orca
        
       | mikeravkine wrote:
       | Maybe mention that GGML is under the hood. I see you've forked
       | the library, did you pick up the GQA changes needed for
       | llama2-70b?
       | 
       | Also seems like you are picking quants on people's behalf, which
       | ones?
        
         | jmorgan wrote:
         | This is super helpful! Currently it runs 4-bit quantization
         | models (with names often ending in q4_0). Sorry this isn't more
         | obvious. Will fix that and soon update the hosted models to
         | support different quantization levels, so you can choose
         | including k quant models. More to come there.
         | 
         | It does use GGML under the hood. Will definitely make sure
         | that's more obvious for compatibility. There's about to be an
         | update as well that includes those recent changes. We do vendor
         | the source code since it made building with Go a lot easier,
         | but definitely don't want to be "forking" it (vs supporting it)
         | - it's a brilliant piece of software!
        
       | ioman wrote:
       | Looks cool. I downloaded it, moved it to the applications
       | directory and opened it. It's asking for Administrator access to
       | install the command line. Why?
        
         | jmorgan wrote:
         | It adds a symlink for the "ollama" cli (you can see the source
         | here: https://github.com/jmorganca/ollama/blob/main/app/src/ins
         | tal...)                 /usr/local/bin/ollama ->
         | /Applications/Ollama.app/Contents/Resources/ollama
         | 
         | Sorry this was startling! We'll find a way to make it optional
         | or easier to do yourself w/o needing to trust a dialog window.
        
         | ioman wrote:
         | Further, apparently it's going to start up whenever I log in.
         | Why?
        
       | [deleted]
        
       | mschuster91 wrote:
       | Wow, this is _awesome_. Huge respect. Are you planning to extend
       | this for more use cases? Particularly Stable Diffusion comes to
       | my mind, it 's a _constant_ nightmare dealing with all the Python
       | mess, libraries randomly updating leading to issues and
       | whatnot...
        
       ___________________________________________________________________
       (page generated 2023-07-20 23:01 UTC)