[HN Gopher] LM Studio - Discover, download, and run local LLMs
       ___________________________________________________________________
        
       LM Studio - Discover, download, and run local LLMs
        
       Author : victormustar
       Score  : 394 points
       Date   : 2023-11-22 09:38 UTC (13 hours ago)
        
 (HTM) web link (lmstudio.ai)
 (TXT) w3m dump (lmstudio.ai)
        
       | d_meeze wrote:
       | So if you suspect I am using this for business related purposes
       | you may take any action you want to spy on me? Such Terms. Very
       | Use.
        
         | tmikaeld wrote:
         | It's a standard clause for most apps. If a breach of the terms
         | of conditions (such as using it for commercial purposes, like
         | selling the software), they are allowed to launch an
         | investigation. No where does this mention "spying" on modifying
         | the app for such use.
        
           | isilofi wrote:
           | Maybe the app is already modified for such use and just needs
           | to be given a "start spying" command.
        
             | tmikaeld wrote:
             | Maybe, personally I don't trust closed source apps like
             | these for that reason.
             | 
             | And when I do try them, it's with Little Snitch blocking
             | outgoing connections.
        
         | busssard wrote:
         | Thats why as a business i would rather use a trusted FOSS LLm
         | interface like textgeneration-WEBUI
         | https://github.com/oobabooga/text-generation-webui
        
           | nacs wrote:
           | Or a simpler alternative: https://ollama.ai/
        
             | hobofan wrote:
             | ollama doesn't come packaged with an easy to invoke UI
             | though.
        
               | manyoso wrote:
               | Gpt4all does
        
         | moffkalast wrote:
         | Much Sus. Wow.
        
         | ionwake wrote:
         | Sorry I haven't used this product yet - Do the chat messages
         | get uploaded to a server ?
        
       | sumedh wrote:
       | How does this App make money?
        
         | tmikaeld wrote:
         | Looks like they'll charge for a "Pro" version that can be used
         | commercially. Though they'd have to confirm that, I only
         | deducted from their ToS.
        
       | eurekin wrote:
       | This app could use some simple UI improvements:
       | 
       | - The chatbox field has a normal "write here" state, when no chat
       | is really selected. I thought my keyboard broke until I
       | discovered that
       | 
       | - I didn't find a way to set cuda acceleration before loading a
       | model, only managed to set gpu offloaded layers and using
       | "relaunch to apply"
       | 
       | - Some HugginFace models are simply not listed and there's no
       | indication about why. I guess models are really curated, but
       | somehow presented as a HuggingFace browser?
       | 
       | - Scrolling in the accordion parts of the interface seems to be
       | responding to mouse wheel scroll only. I have a mouse with a
       | damaged one and couldn't find a way to reliably navigate to
       | bottom drawers
       | 
       | That said, I really liked the server tab, which allowed for
       | initial debugging very easily
        
         | easygenes wrote:
         | It's basically a front end for llama.cpp, so it will only show
         | models with GGUF quantizations.
        
           | eurekin wrote:
           | Ah, makes sense
        
         | hahn-kev wrote:
         | Those are really weird bugs, how do you even manage that these
         | days
        
           | eurekin wrote:
           | Not sure if sarcastic or not. Assuming not.
           | 
           | I did find it quite useful for opening a socket for remote
           | tooling (played around withe the continue plugin).
           | 
           | The quirky UI did slow me down,but nothing really
           | showstopping
        
       | pointlessone wrote:
       | M1 is only 3 years old and no one cares to support intel macs any
       | more. There are surely a lot of them out there. Are they that
       | much worse to run LLMs on?
        
         | amrrs wrote:
         | Ollama works super fine on Intel Mac
         | 
         | The demo on this video is from Intel Mac
         | https://youtu.be/C0GmAmyhVxM?si=puTCpGWButsNvKA5
         | 
         | It also supports openai compatible api and completely open-
         | source unlike LM studio
        
           | eurekin wrote:
           | Yeah, ollama since way nicer in UX
        
           | urtrs wrote:
           | Ollama's API is not openai compatible.
        
             | rsolva wrote:
             | Correct, but LocalAI has a compatible API for those who
             | needs it.
        
       | isilofi wrote:
       | Disappointing, no proper Linux support. Just "ask on discord".
        
         | apstls wrote:
         | from a pinned message on their discord:
         | https://s3.amazonaws.com/releases.lmstudio.ai/prerelease/LM+...
        
       | kgeist wrote:
       | For my experiments with new self-hostable models on Linux, I've
       | been using a script to download GGUF-models from TheBloke on
       | HuggingFace (currently, TheBloke's repository has 657 models in
       | the GGUF format) which I feed to a simple program I wrote which
       | invokes llama.cpp compiled with GPU support. The GGUF format and
       | TheBloke are a blessing, because I'm able to check out new models
       | basically on the day of their release (TheBloke is very fast) and
       | without an issue. However, the only frontend I have is console.
       | Judging by their site, their setup is exactly the same as mine
       | (which I implemented over a weekend), except that they also added
       | a React-based UI on top. I wonder, how they're planning to
       | commercialize it, because it's pretty trivial to replicate, and
       | there're already open-source UI's like oogabooga.
        
         | bambax wrote:
         | I'd like to build myself a headless server to run models, that
         | could be queried from various clients locally on my LAN, but am
         | usure where to start and what the hardware requirements would
         | be. Software can always be changed later but I'd rather buy the
         | hardware parts only once.
         | 
         | Do you have recommendations about this? or blog posts to get
         | started? What would be a decent hardware configuration?
        
           | fintechie wrote:
           | You can currently do this in an M2 Max with ollama and a
           | Nextjs UI [0] running in a docker container. Any devices in
           | the network can use the UI... and I guess if you want a LAN
           | API you just need to run another container with with OAI
           | compatible API that can query ollama.. eg [1]
           | 
           | [0]https://github.com/ivanfioravanti/chatbot-ollama
           | 
           | [1]https://github.com/BerriAI/litellm
        
           | rsolva wrote:
           | Ollama does this. I run it in a container on my homelab
           | (Proxmox on a HP EliteDesk SFF G2 800) and 7B models run
           | decently fast on CPU-only. Ollama has a nice API and makes it
           | easy to manage models.
           | 
           | Together with ollama-webui, it can replace ChatGPT 3.5 for
           | most tasks. I also use it in VSCode and nvim with plugins,
           | works great!
           | 
           | I have been meaning to write a short blog post about my
           | setup...
        
             | mfalcon wrote:
             | I've been trying Ollama locally. I've yet to know how it'll
             | behave in a production setting.
        
               | kkielhofner wrote:
               | Depending on what you mean by "production" you'll
               | probably want to look at "real" serving implementations
               | like HF TGI, vLLM, lmdeploy, Triton Inference Server
               | (tensorrt-llm), etc. There are also more bespoke
               | implementations for things like serving large numbers of
               | LoRA adapters[0].
               | 
               | These are heavily optimized for more efficient memory
               | usage, performance, and responsiveness when serving large
               | numbers of concurrent requests/users in addition to
               | things like model versioning/hot load/reload/etc,
               | Prometheus metrics, things like that.
               | 
               | One major difference is at this level a lot of the more
               | aggressive memory optimization techniques and support for
               | CPU aren't even considered. Generally speaking you get
               | GPTQ and possibly AWQ quantization + their optimizations
               | + CUDA only. Their target users and their use cases are
               | often using A100/H100 and just trying to need fewer of
               | them. Support for lower VRAM cards, older CUDA compute
               | architectures, etc come secondary to that (for the most
               | part).
               | 
               | [0] - https://github.com/S-LoRA/S-LoRA
        
               | mfalcon wrote:
               | Thanks! Really helpful. I've a 3090 at home and my idea
               | is to do some testing on a similar config in the cloud to
               | have an idea of the amount of requests that could be
               | served.
        
               | kkielhofner wrote:
               | The good news is the number of requests and performance
               | is very impressive. For example, on my RTX 4090 from
               | testing many months ago with lmdeploy (it was the first
               | to support AWQ) I was getting roughly 70 tokens/s each
               | across 10 simultaneous sessions with LLama2-13b-Chat -
               | almost 700 tokens/s total. If I were to test again now
               | with all of the impressive stuff that's been added to all
               | of these I'm sure it would only be better (likely
               | dramatically).
               | 
               | The bad news is because "low VRAM cards" like the 24GB
               | RTX 3090 and RTX 4090 aren't really targetted by these
               | frameworks you'll eventually run into "Yeah you're going
               | to need more VRAM for that model/configuration. That's
               | just how it is." as opposed to some of the approaches for
               | local/single session serving that emphasize memory
               | optimization first and tokens/s for a single session
               | next. Often with no consideration or support at all for
               | multiple simultaneous sessions.
               | 
               | It's certainly possible that with time these serving
               | frameworks will deploy more optimizations and strategies
               | for low VRAM cards but if you look at timelines to even
               | implement quantization support (as one example) it's
               | definitely an after-thought and typically only
               | implemented when it aligns with the overall "more tokens
               | for more users across more sessions on the same hardware"
               | goals.
               | 
               | Loading a 70B model on CPU and getting 3 tokens/s (or
               | whatever) is basically seen as an interesting yet
               | completely impractical and irrelevant curiosity to these
               | projects.
               | 
               | In the end "the right tool for the job" always applies.
        
             | Loic wrote:
             | If I may ask, which plugins are you using in VSCode?
        
               | rsolva wrote:
               | I'm using the extension Continue: https://marketplace.vis
               | ualstudio.com/items?itemName=Continue...
               | 
               | The setup of connecting to Ollama is a bit clunky, but
               | once it's set up it works well!
        
           | akx wrote:
           | Just compile llama.cpp's server example, and you have a local
           | HTTP API. It also has a simple UI (disclaimer: to which I've
           | contributed).
           | 
           | https://github.com/ggerganov/llama.cpp/blob/master/examples/.
           | ..
        
           | guntars wrote:
           | LM Studio can start a local server with the APIs matching
           | OpenAI. You can't do concurrent requests, but that should get
           | you started.
        
           | Havoc wrote:
           | > usure where to start and what the hardware requirements
           | would be
           | 
           | Have a look at the localllama subreddit
           | 
           | In short though dual 3090 is common, single 4090 or various
           | flavours of M123 macs. Alternatively p40 can be jury-rigged
           | too but research that carefully. In fact anything with more
           | than one gpu is going to require careful research
        
         | ekianjo wrote:
         | You dont need thebloke. Its trivial to make gguf files from bin
         | models by yourself.
        
           | cooperaustinj wrote:
           | What a comment. Why do it the easy way when the more
           | difficult and slower way works ok it to the same result!? For
           | people who just want to USE models and not back at them,
           | TheBloke is exactly the right place to go.
           | 
           | Like telling someone interested in 3D printing minis to build
           | a 3D printer instead of buying one. Obviously that helps them
           | get to their goal of printing minis faster right?
        
             | paul_n wrote:
             | Actually, consider that the commenter may have helped un-
             | obfuscate this world a little bit by saying that it is in
             | fact easy. To be honest the hardest part about the local
             | LLM scene is the absurd amount of jargon introduced -
             | everything looks a bit more complex than it is. It's really
             | is easy with llama.cpp, someone even wrote a tutorial here:
             | https://github.com/ggerganov/llama.cpp/discussions/2948 .
             | 
             | But yes, TheBloke tends to have conversions up very quickly
             | as well and has made a name for himself for doing this
             | (+more)
        
       | andy99 wrote:
       | I don't mean this as a criticism, I'm just curious because I work
       | in this space too: who is this for? What is the niche of people
       | savvy enough to use this who can't run one of the many open
       | source local llm software? It looks in the screenshot like it's
       | exposing much of the complexity of configuration anyway. Is the
       | value in the interface and management of conversation and models?
       | It would be nice to see info or even speculation about the
       | potential market segments of LLM users.
        
         | pridkett wrote:
         | In most workplaces that deal with LLMs you've got a few classes
         | of people:
         | 
         | 1. People who understand LLMs and know how to run them and have
         | access to run them on the cloud. 2. People who understand LLMs
         | well enough but don't have access to cloud resources - but
         | still have a decent MacBook Pro. Or maybe access to cloud
         | resources is done via overly tight pipelines. 3. People who are
         | interested in LLMs but don't have enough technical chops/time
         | to get things going with Llama CPP. 4. People who are fans of
         | LLMs but can't even install stuff in their computer.
         | 
         | This is clearly for #3 and it works well for that group of
         | people. It could also be for #2 when they don't want to spin up
         | their own front end.
        
         | wiihack wrote:
         | I got Mistral-7b running locally, and although it wasn't hard,
         | it did take some time nonetheless. I just wanted to try it out
         | and was not that interested in the technical details.
        
         | 3cats-in-a-coat wrote:
         | It's for people who want to discover LLMs and either don't have
         | the skill to deploy it, or value their time, and prefer not to
         | fool around for hours getting it to work before they can try
         | it.
         | 
         | The fact it has configuration is good, as long as it has some
         | defaults.
        
           | transcriptase wrote:
           | Exactly. People like me have been waiting for a tool like
           | this.
           | 
           | I'm more than capable of compiling/installing/running pretty
           | much any software, but all I want is the ability to chat with
           | a LLM of my choice without spending an afternoon tabbing back
           | to a 30 step esoteric GitHub .md full of caveats,
           | assumptions, and requiring dependencies to be installed and
           | configured according to preferences I don't have.
        
           | paul_n wrote:
           | Yeah, I think I fit into this category. If I see a new model
           | announced, it's been nice to just click and evaluate for
           | myself if it's useful for me. If anyone knows other tools for
           | this kind workflow I'd love to hear about them. Right now I
           | just keep my "test" prompts in a text file.
        
         | mannyv wrote:
         | It's actually quite handy. I built all the various things by
         | hand at one point, but had to wipe it all. Instead of following
         | the directions again I just downloaded this.
         | 
         | Being able to swap out models is also handy. This probably
         | saved a couple of hours of my life, which I appreciate.
        
       | smusamashah wrote:
       | Why purple or some shade of purple is the color of all AI
       | products? For some reason, the landing pages of AI products
       | immediately remind of Crypto products. This one does not have
       | Crypto vibes but the colour is purple. I don't get why.
        
         | 3cats-in-a-coat wrote:
         | Because apps mostly prefer dark theme now, and dark red, brown,
         | dark green and so on look weird, and gray is OK, but very
         | boring, like someone desaturated the UI. Which leaves shades of
         | blue and purple.
        
         | chatmasta wrote:
         | It's a default color in Tailwind.css and is used in a lot of
         | the templates and examples. Nine times out of ten, if you check
         | the source of a page with this flavor of purple, you'll see
         | it's using Tailwind, as the OP site in fact does.
        
           | smusamashah wrote:
           | Ah! that makes more sense. New startup, new tech and
           | therefore the new default color. I hope its just that and
           | because I only tend to notice AI startups purple is what I
           | end up seeing.
        
       | replete wrote:
       | RIP Intel users
        
       | hugovie wrote:
       | LMStudio is great to run local LLMs, also support OpenAI-
       | compatible API. In the case you need more advance UI/UX, you can
       | use LMStudio with MindMac(https://mindmac.app), just check this
       | video for details https://www.youtube.com/watch?v=3KcVp5QQ1Ak.
        
         | smcleod wrote:
         | Thanks for sharing MindMac - just tried it out and it's exactly
         | what I was looking for, great to see Ollama integration is
         | coming soon!
        
           | hugovie wrote:
           | Thank you for your support. I just found a workaround
           | solution to use Ollama with MindMac. Please check this video
           | https://www.youtube.com/watch?v=bZfV70YMuH0 for more details.
           | I will integrate Ollama deeply in the future version.
        
         | shanusmagnus wrote:
         | MindMac is the first example I've seen where the UI for working
         | w/ LLMs is not complete and utter horseshit and starts to
         | support workflows that are sensible.
         | 
         | I will buy this with so much enthusiasm if it holds up. Argh,
         | this has been such a pain point.
        
       | hnuser123456 wrote:
       | This works, but I've noticed that my CPU use goes up to about 30
       | percent, all in kernel time (windows), after installing and
       | opening this, even when it's not doing anything, on two separate
       | machines... I also hear the fan spinning fast on my laptop.
       | 
       | Killed the LM studio process and re-opened it and the ghost
       | background usage is down to about 5%.
        
       | manyoso wrote:
       | For those looking for an open source alternative with Mac,
       | Windows, Linux support check out GPT4All.io
        
         | thejohnconway wrote:
         | Terrible name, given that its value is that it runs locally,
         | and you can't do that with ChatGPT.
        
       | reustle wrote:
       | This looks great!
       | 
       | If you're looking to do the same with open source code, you could
       | likely run Ollama and a UI.
       | 
       | https://github.com/jmorganca/ollama + https://github.com/ollama-
       | webui/ollama-webui
        
         | nigma wrote:
         | I'm having a lot of fun chatting with characters using Faraday
         | and koboldcpp. Faraday has a great UI that lets you adjust
         | character profiles, generate alternative model responses, undo,
         | or edit dialogue, and experiment with how models react to your
         | input. There's also SillyTavern that I have yet to try out.
         | 
         | - https://faraday.dev/
         | 
         | - https://github.com/LostRuins/koboldcpp
         | 
         | - https://github.com/SillyTavern/SillyTavern
        
         | eshack94 wrote:
         | Ollama is fantastic. I cannot recommend this project highly
         | enough. Completely open-source, lightweight, and a great
         | community.
        
         | activatedgeek wrote:
         | I looked at Ollama before, but couldn't quite figure something
         | out from the docs [1]
         | 
         | It looks like a lot of the tooling is heavily engineered for a
         | set of modern popular LLM-esque models. And looks like
         | llama.cpp also supports LoRA models, so I'd assume there is a
         | way to engineer a pipeline from LoRA to llama.cpp deployments,
         | which probably covers quite a broad set of possibilities.
         | 
         | Beyond llama.cpp, can someone point me to what the broader
         | community uses for general PyTorch model deployments?
         | 
         | I haven't quite ever self-hosted models, and am really keen to
         | do one. Ideally, I am looking for something that stays close to
         | the PyTorch core, and therefore allows me the flexibility to
         | take any nn.Module to production.
         | 
         | [1]:
         | https://github.com/jmorganca/ollama/blob/main/docs/import.md.
        
         | seydor wrote:
         | oobabooga is no longer popular?
        
           | oceanplexian wrote:
           | oobabooga is the king.
           | 
           | As far as I know, ollama doesn't support exllama, qlora fine
           | tuning, multi-GPU, etc. Text-generation-webui might seem like
           | a science project, but it's leagues ahead (Like 2-4x faster
           | inference with the right plugins) of everything else. Also
           | has a nice openai mock API that works great.
        
         | smcleod wrote:
         | With a couple of other folks I'm currently working on an Ollama
         | GUI, it's in early stages of development -
         | https://github.com/ai-qol-things/rusty-ollama
        
       | omneity wrote:
       | I really like LM Studio and had it open when I came across this
       | post. LM Studio is an interesting mixture of:
       | 
       | - A local model runtime
       | 
       | - A model catalog
       | 
       | - A UI to chat with the models easily
       | 
       | - An openAI compatible API
       | 
       | And it has several plugins such as for RAG (using ChromaDB) and
       | others.
       | 
       | Personally I think the positioning is very interesting. They're
       | well positioned to take advantage of new capabilities in the OS
       | ecosystem.
       | 
       | It's still unfortunate that it is not itself open-source.
        
         | addandsubtract wrote:
         | Does it also let you connect to the ChatGPT API and use it?
        
           | omneity wrote:
           | I haven't found that option. I know it exists in Gpt4all
           | though.
           | 
           | Personally I use a locally served frontend to use ChatGPT via
           | API.
        
         | SOLAR_FIELDS wrote:
         | How does it compare with something like FastChat?
         | https://github.com/lm-sys/FastChat
         | 
         | Feature set seems like a decent amount of overlap. One
         | limitation of FastChat, as far as I can tell, is that one is
         | limited to the models that FastChat supports (though I think it
         | would be minor to modify it to support arbitrary models?)
        
       | pentagrama wrote:
       | Curious about this and I just download it.
       | 
       | Want to try uncensored models.
       | 
       | I have a question, looking for the most popular "uncensored"
       | model I just find "TheBloke/Luna-AI-Llama2-Uncensored-GGML", but
       | it has 14 files to download between 2 to 7 GB, I just download
       | the first one: https://imgur.com/a/DE2byOB
       | 
       | I try the model and it works: https://imgur.com/a/2vtPcui
       | 
       | I should download all the 14 files to get better results?
       | 
       | Also, asking how to make a bomb it looks that at least this model
       | isn't "uncesored": https://imgur.com/a/iYz7VYQ
        
         | jwblackwell wrote:
         | basically no open source fine tunes are censored, you can get
         | an idea of popular models people are using here:
         | https://openrouter.ai/models?o=top-weekly
         | 
         | teknium/openhermes-2.5-mistral-7b is a good one
         | 
         | You don't need all 14 files, just pick one that is recommended
         | with a slight loss of quality - hover over the little (i)icon
         | to find out
        
           | pentagrama wrote:
           | Thank you! Now I checked the (i) tooltips. Just downloaded
           | the bigger file (7GB) that says "Minimal loss of quality".
        
         | nerdenough wrote:
         | The readme of their repositories each have tables that detail
         | the quality of each file. The QK_4_M and QK_5_M seem to be the
         | two main recommended ones for low quality loss while too being
         | too large.
         | 
         | Only need 1 of the files, but recommend checking out the GGUF
         | version of the model (just replace GGML in the URL) instead of
         | GGML. Llama.cpp no longer supports GGML, and not sure if
         | TheBloke still uploads new GGML versions of models.
        
         | 0xEF wrote:
         | Honest question from someone new to exploring and using these
         | models; why do you need uncensored? What are the use-cases that
         | would call for it?
         | 
         | Again, not questioning your motives or anything, just straight
         | up curious. To use your example, any of us can find bomb
         | building info online fairly easily, and has been a point of
         | social contention since the Anarchist's cookbook. Nobody needs
         | an uncensored LLM for that, of course.
        
           | xcv123 wrote:
           | For the entertainment value
        
           | lloydatkinson wrote:
           | It's very easy to hit absurd "moral" limits on chatgpt for
           | the most stupid things.
           | 
           | Earlier I was looking for "a phrase that is used as an insult
           | for someone who writes with too much rambling" and all I got
           | was some bullshit about how it's sorry but it can't do that
           | because it's allegedly against its OpenAI rules.
           | 
           | So I asked again "a phrase negatively used to mean someone
           | that writes too much while rambling" and it worked.
           | 
           | I simply cannot be bothered to deal with stupid insipid
           | "corporate friendly language" and other dumb restrictions.
           | 
           | Imagine having a real conversation with someone and they
           | freaked out any time anything negative was discussed?
           | 
           | TLDR: Thought police ruining LLMs
        
           | pentagrama wrote:
           | Curiosity and entertainment. Also have an experience that
           | isn't available on the current popular consumer products like
           | this.
        
           | CamperBob2 wrote:
           | None of your business.
        
           | rgbrgb wrote:
           | perhaps they are in a war zone and must learn to make molotov
           | cocktails without connecting to the internet
        
           | OJFord wrote:
           | Where do you typically have the 'safe search' setting when
           | you web search? Personally I have it 'off' ('moderate' when I
           | worked in an office I think) even though I'm not looking for
           | anything that _ought_ to be filtered by it.
           | 
           | I'm not using models censored or otherwise, but I imagine I'd
           | feel the same way there - I won't be offended, so don't try
           | to be too clever, just give me the unfiltered results and let
           | me decide what's correct.
           | 
           | (Bing AI actually banned me for trying to generate images a
           | rough likeness of myself by combining traits of minor
           | celebrities - in combination it shouldn't have looked like
           | those people either, so I don't think should violate ToS,
           | certainly it didn't in intention (I wanted _myself_ , but it
           | doesn't know what I look like and I couldn't provide a photo
           | at the time, idk if you can now (banned!)) so it does happen,
           | 'false positive censoring' if you like.)
        
         | mark_l_watson wrote:
         | No, each download is just a different quantization of the same
         | model.
        
       | chatmasta wrote:
       | Am I missing something here? I'm on a recent M2 machine. Every
       | model I've downloaded fails to load immediately when trying to
       | load it. Is there some way to get feedback on the reason for
       | failure, like a log file or something?
       | 
       | EDIT: The problem is I'm on macOS 13.2 (Ventura). According to a
       | message in Discord, the minimum version for some (most?) models
       | is 13.6.
        
       | rgbrgb wrote:
       | LMStudio is great, if a bit daunting. If you're on Mac and want a
       | native open source interface, try out FreeChat
       | https://www.freechat.run
        
         | NoMoreNicksLeft wrote:
         | Thanks for the link.
         | 
         | I expected it to not let me run this. I have an intel Macbook,
         | was expecting that I'd need Apple Silicon... am I
         | misunderstanding something? I get fairly fast results at the
         | prompt with the default model. How's this thing running with
         | whatever shitty GPU I have in my laptop?
        
           | rgbrgb wrote:
           | that's the magic of llama.cpp!
           | 
           | I include a universal binary of llama.cpp's server example to
           | do inference. What's your machine? The lowest spec I've heard
           | it running on is a 2017 iMac with 8GB RAM (~5.5 tokens/s). On
           | my m1 with 64GB RAM I get ~30 tokens per second on the
           | default 7B model.
        
             | NoMoreNicksLeft wrote:
             | Macbook Pro 2020 with 16gb of system ram. I think the gpu
             | is Iris Plus? But I don't much keep up on those.
             | 
             | I'm now delving into getting this running in Terminal...
             | there are a few things I want to try that I don't think the
             | simple interface allows.
             | 
             | Also, I've noticed that when chats get a few kilobytes
             | long, it just seizes up and can't go further. I complained
             | to it, it spent a sentence apologizing, started up where it
             | left off... and got about 12 words further.
        
               | rgbrgb wrote:
               | hm yeah i think I need to update llama.cpp to get this
               | fix https://github.com/ggerganov/llama.cpp/pull/3996
               | 
               | Thanks for trying it!
        
       | cloudking wrote:
       | Is anyone using open source models to actually get work done or
       | solving problems in their software architecture? So far I haven't
       | found anything near the quality of GPT-4.
        
         | voiceblue wrote:
         | WizardCoder is probably close to state of the art for open
         | models as of right now.
         | 
         | https://github.com/nlpxucan/WizardLM/tree/main/WizardCoder
         | 
         | https://huggingface.co/WizardLM/WizardCoder-Python-34B-V1.0
         | 
         | Top of the line consumer machines can run this at a good clip,
         | though most machines will need to use a quantized model
         | (ExLlamaV2 is quite fast). I found a model for that as well,
         | though I haven't used it myself:
         | 
         | https://huggingface.co/oobabooga/CodeBooga-34B-v0.1-EXL2-4.2...
        
         | gsuuon wrote:
         | Zephyr is coherent enough to bounce ideas off of, but I'm
         | eagerly awaiting when open-source models are on par
         | productivity wise with the big providers. I imagine some folks
         | are utilizing codellama 34b somehow, but I haven't been able to
         | effectively.
        
         | xcv123 wrote:
         | > So far I haven't found anything near the quality of GPT-4
         | 
         | GPT-4 has an estimated 1.8 trillion parameters. Orders of
         | magnitude beyond open source models and ~10x GPT-3.5 which has
         | 175 billion parameters.
         | 
         | https://the-decoder.com/gpt-4-architecture-datasets-costs-an...
        
         | ramoz wrote:
         | The reality is there is no general use case for open source
         | models, such as there is gpt-4.
         | 
         | The decent chat ones are based on gpt data and they're
         | basically shitty distilled models.
         | 
         | The best use case is a narrow one that you decide and can
         | create adequate fine-tuning data around. Plenty of real
         | production ability here.
        
       | cvhashim04 wrote:
       | Wow this is sleek, good job.
        
       | machiaweliczny wrote:
       | https://github.com/enricoros/big-agi seems better and is open
       | source
        
       | RecycledEle wrote:
       | This is what I use on Windows 10.
       | 
       | I have an HP z440 with an E5-1630 v4 and 64GB DDR4 quad channel
       | RAM.
       | 
       | I run LLMs on my CPU, and the 7 billion parameter models spit out
       | text faster than I can read it.
       | 
       | I wish it supported LMMs (multi modal models.)
        
       | deskamess wrote:
       | Newbie question... Is this purely for hosting _text_ language
       | models? Is there something similar for image models? i.e., upload
       | an image and have some local model provide some detection
       | /feedback on it.
        
       | stuckinhell wrote:
       | After the latest Chatgpt debacles, the poor performance I'm
       | getting from 4 turbo, I'd really like a local version of chatgpt4
       | or equivalent. I'd even buy a new pc if I had too.
        
         | ramoz wrote:
         | Wouldn't everyone.
        
       | sunsmile wrote:
       | Considering the code is closed source and they can change the ToS
       | anytime to send conversation data to their servers whenever they
       | want, i would like to know what would be the benefit of using
       | this over ChatGPT?
        
       | porcoda wrote:
       | Amusing qualifications for the senior engineering roles they're
       | hiring for:
       | 
       | "Deep understanding of what is a computer, what is computer
       | software, and how the two relate."
       | 
       | Right after the senior ML role that requires people understand
       | how to write "algorithms and programs."
       | 
       | Kinda hard to take those kinds of requirements seriously.
        
         | wongarsu wrote:
         | The second one isn't _that_ bad in context. But the Senior
         | Systems Software Engineer is wild, with  "Deep understanding of
         | what is a computer, what is computer software, and how the two
         | relate" followed by "Experience writing and maintaining
         | production code in C++14 or newer". You'd think the latter
         | would imply the former, but maybe not...
         | 
         | They seem to have even lowered expectations a bit. Two months
         | ago [1] they were already hiring for that role (or a very, very
         | similar one), but back then you needed experience with
         | "mission-critical code in C++17", now just "production code in
         | C++14".
         | 
         | 1:
         | http://web.archive.org/web/20230922170941/https://lmstudio.a...
        
           | xcv123 wrote:
           | > You'd think the latter would imply the former, but maybe
           | not...
           | 
           | I wouldn't put C++ devs on too high of a pedestal. I got away
           | with writing shitty C++ code for years before I really knew
           | what I was doing. It still worked though.
        
         | xcv123 wrote:
         | > "Deep understanding of what is a computer, what is computer
         | software, and how the two relate."
         | 
         | Seems like a joke, but many developers do not really understand
         | what's going on behind the scenes. This gets straight to the
         | point. They don't care about HR keyword matching on your CV, or
         | how many years of experience you have of being a mediocre
         | developer with language X or framework Y. I guess during the
         | interview they will investigate whether you truly understand
         | the fundamentals.
        
       | FooBarWidget wrote:
       | The settings say saving chats can take 2 GB. Why? What states do
       | chat LLMs have? Isn't the only state the chat history text?
        
       | whartung wrote:
       | I'm late to the game about this. So I'll ask a stupid question.
       | 
       | As a contrived example, what happens if you feed the LoTR books,
       | the Hobbit, the Silmarillion, and whatever else is germane, into
       | an LLM?
       | 
       | Is there a base, empty, "ignorant" LLM that is used as a seed?
       | 
       | Do you end up with a Middle Earth savant?
       | 
       | Just how does all this work?
        
         | LeoPanthera wrote:
         | There isn't enough text in the Tolkein works to generate a
         | functional LLM. You need to start with a base model that
         | contains enough English (or language of your choice) to become
         | functional.
         | 
         | Such a model (LLaMa is a good example) is not "ignorant," but
         | rather a generalized model capable of a wide range of language
         | tasks. This base model does not have specialized knowledge in
         | any one area but has a broad understanding based on the diverse
         | training data.
         | 
         | If you were to "feed" Tolkien's specific books into this
         | general LLM, the model wouldn't become a Middle Earth savant.
         | It would still provide responses based on its broad training.
         | It might generate text that reflects the style or themes of
         | Tolkien's work if it has learned this from the broader training
         | data, but its responses would be based on patterns learned from
         | the entire dataset, not just those books.
        
           | whartung wrote:
           | So, it wouldn't necessarily "know" much about Middle Earth,
           | but might take a stab at writing like Tolkien?
        
       | jhoechtl wrote:
       | Is this an alternative to privategpt or GPT4All?
        
       | cztomsik wrote:
       | Also, if you don't know what all the toggles are for, this is a
       | simpler attempt by me: https://www.avapls.com/
        
       | MuffinFlavored wrote:
       | 1. Mistral
       | 
       | 2. Llama 2
       | 
       | 3. Code Llama
       | 
       | 4. Orca Mini
       | 
       | 5. Vicuna
       | 
       | What can I do with any of these models that won't result in 50%
       | hallucinations/it recommending code with APIs that don't exist/it
       | recommending basically regurgitated StackOverflow historical out
       | of date answers (that it was trained on) for libraries that have
       | had their versions/APIs change, etc?
       | 
       | Can somebody share one real use case they are using any of these
       | models for?
        
         | shironandonon_ wrote:
         | Why don't you wrap it with a verification system (eg: web
         | scraper) and auto-regenerate / jailbreak any things you don't
         | like?
        
           | MuffinFlavored wrote:
           | Because I pay $20/mo for GPT-4 and don't understand why
           | anybody would run a "less-good" version locally that you can
           | trust less/that has less functionality.
           | 
           | That's why I wanted to try to understand, what am I missing
           | about local-toy LLMs. How are they not just noise/nonsense
           | generators?
        
       ___________________________________________________________________
       (page generated 2023-11-22 23:01 UTC)