[HN Gopher] Large language models are having their Stable Diffus...
___________________________________________________________________
Large language models are having their Stable Diffusion moment
Author : simonw
Score : 186 points
Date : 2023-03-11 19:19 UTC (3 hours ago)
(HTM) web link (simonwillison.net)
(TXT) w3m dump (simonwillison.net)
| homarp wrote:
| There is even has r/LocalLLaMA/
| minimaxir wrote:
| Currently right now there's too many caveats to run even the 7B
| model per the workflows mentioned in the article.
|
| The big difference between it and Stable Diffusion which caused
| the latter to go megaviral is a) it can run on a typical GPU that
| gamers likely already have without hitting a perf ceiling and b)
| it can run easily in a free Colab GPU. Although Hugging Face
| transformers can run a 7B model on a T4 GPU w/ 8-bit loading, but
| with its own caveats too.
|
| There's a big difference between "can run" and "can run _well_ ".
| VQGAN + CLIP had a lot of friction too and that's partially why
| AI image generation didn't go megaviral then.
| bestcoder69 wrote:
| Then this is SD for Apple silicon users. 13B runs on my m1 air
| at 200-300ms/token using llama.cpp. Outputs feel like original
| GPT-3, unlike any of the competitors I've tried. Granted- non-
| scientific first impressions.
| j45 wrote:
| Agreed. For those who have been quietly sitting with a base
| Mac Studio, or a reasonably capable Mac Mini. The
| possibilities changed on some fronts, but GPT's extremely low
| price on their API remains a good option.
| aaomidi wrote:
| Difference is chatgpt is not privacy friendly.
| staticautomatic wrote:
| Is it still not privacy friendly on Azure?
| aaomidi wrote:
| Azure has access to your queries. Running locally really
| is the only way of having a privacy friendly LLM.
| [deleted]
| ddren wrote:
| They have recently merged support for x86. I get 230ms/token
| on the 13B model on a 8 core 9900k under WSL2.
| [deleted]
| [deleted]
| simonw wrote:
| By caveats do you mean the licensing terms or the difficulty of
| prompting the model?
|
| Unless it's relicensed I don't expect LLaMA to be a long-term
| foundation model. But it's shown that yes, you can run a GPT-3
| class model on an M1 Mac with 8GB of RAM (or maybe 16GB for the
| 13B one?)
|
| I fully expect other models to follow, from other
| organizations, with better capabilities and more friendly
| licensing terms.
| zamnos wrote:
| But is anyone actually making money off of StableDiffusion?
| Maybe the shovel-sellers (runpod.io et al), but afaik no one
| using it as the foundation for a revenue generating company.
| I ask, because yes, technically, you can't get LLaMA legally
| unless you're a researcher and get it directly from Facebook.
| But that's not going to stop the faithful from finding a copy
| and working on it.
| simonw wrote:
| I believe Midjourney may have used bits of Stable Diffusion
| in their product, which is definitely profitable.
| logifail wrote:
| > is anyone actually making money off of StableDiffusion?
|
| We're all still waiting to hear about (non-shovel-selling)
| successes in this space.
| pmoriarty wrote:
| I don't know about Stable Diffusion in particular, but
| three examples of AI-generated art making money
| immediately spring to mind:
|
| 1 - some guy won hundreds of dollars in an art contest
| from AI generated art (and this made big news, so it
| should be easy to find)
|
| 2 - one person reported using midjourney's images as a
| starting point for images that wound up being used in a
| physical magazine
|
| 3 - another artist has used midjourney images that they
| modify to sell in all sorts of contexts (like background
| images on stock illustration sites)
|
| You'd probably find many other examples in midjourney's
| #in-the-world discord channel.
|
| I'd also be shocked if stock image sites, clipart sites
| and freelance design/illustration sites weren't already
| flooded with AI generated images that have been sold for
| money.
|
| That being said, because high questly AI-generated images
| are so easy to make, the value of images of all types is
| likely to plummet soon if it hasn't already.
| minimaxir wrote:
| Ignoring the licensing issues, there are a few other
| constraints that would make the model harder to go viral
| outside of developers who spend a lot of time in this space
| already:
|
| 1) Model weights are heavy for just experimentation, although
| quantizing it down to 4-bit might make them on par with SD
| FP16.
|
| 2) Requires extreme CLI shenanigans (and likely configuration
| since you have to run make) compared to just running a Colab
| Notebook or a .bat Windows Installer for the A1111 UI.
|
| 3) Again hardware: a M1 Pro or a RTX 4090 is not super common
| among people who are just curious about text generation.
|
| 4) It is possible the extreme quantization could be affecting
| text output quality; although the examples are coherent for
| simple queries, more complex GPT-3-esque queries might become
| relatively incoherent. Particularly with ChatGPT and its
| cheap API (timely!) out now such that even nontechies have a
| strong baseline on good output already. The viral moment for
| SD was that it was easy to use _and_ it was a significant
| quality leap over VQGAN + CLIP.
|
| I was going to say inference speed since that's usually
| another constraint for new LLMs but given the 61.41 ms/token
| cited for the 7B model in the repo/your GIF, that seems on
| par with the inference speed from OPT-6.7B FP16 in
| transformers on a T4.
|
| Some of these caveats are fixable, but even then I don't
| think LLaMA will have its Stable Diffusion moment.
| simonw wrote:
| The 4-bit quantized models are 4GB for 7B and 8GB for 13B.
|
| I'm not too worried about CLI shenanigans, because of what
| happened with whisper.cpp - it resulted in apps like
| https://goodsnooze.gumroad.com/l/macwhisper - wouldn't be
| at all surprised to see the same happen with llama.cpp
|
| A regular M1 with 8GB of RAM appears to be good enough to
| run that 7B model. I wonder at what point it will run on an
| iPhone... the Stable Diffusion model was 4GB when they
| first released it, and that runs on iOS now after some more
| optimization tricks.
|
| For me though, the "Stable Diffusion" moment isn't
| necessarily about the LLaMA model itself. It's not licensed
| for commercial use, so it won't see nearly the same level
| of things built on top of it.
|
| The key moment for me is that I've now personally seen a
| GPT-3 scale model running on my own personal laptop. I know
| it can be done! Now I just need to wait for the inevitable
| openly-licensed, instruction-tuned model that runs on the
| same hardware.
|
| It's that, but also the forthcoming explosion of developer
| innovation that a local model will unleash. llama.cpp is
| just the first hint of that.
| smoldesu wrote:
| > The key moment for me is that I've now personally seen
| a GPT-3 scale model running on my own personal laptop.
|
| I hate to pooh-pooh it for everyone, but this was
| possible before LLaMa. GPT-J-125m/6b have been around for
| a while, and are frankly easier to install and get
| results out of. The smaller pruned model even fits on an
| iPhone.
|
| The problem is more that these smaller models won't ever
| compete with GPT-scale APIs. Tomorrow's local LLaMa might
| beat yesterday's ChatGPT, but I think those optimistic
| for the democratization of chatbot intelligence are
| setting their hopes a bit high. LLaMa _really_ isn 't
| breaking new ground.
| simonw wrote:
| I'm not particularly interested in beating ChatGPT: I'm
| looking for a "calculator for words" which I can use for
| things like summarization, term extraction, text
| rephrasing etc - maybe translation between languages too.
|
| There are all kinds of things I want to be able to do
| with a LLM that are a lot tighter than general chatbots.
|
| I'd love to see a demo of GPT-J on an iPhone!
| tracyhenry wrote:
| Another big difference is quality of the results. Haven't tried
| myself but seen many complaints that it's nowhere near GPT-3
| (at least for the 7B version). Correct me if I'm wrong!
| bestcoder69 wrote:
| 13B feels on-par with the base non-instruction davinci.
| People might not realize how it was a bit trickier to prompt
| gpt3 when it first released.
| simonw wrote:
| That doesn't bother me so much. GPT-3 had instruction tuning,
| which makes it MUCH easier to use.
|
| Now that I've seen that LLaMA can work I'm confident someone
| will release an openly licensed instruction-tuned model that
| works on the same hardware at some point soon.
|
| I also expect that there are prompt engineering tricks which
| can be used to get really great results out of LLaMA. I'm
| hoping someone will come up with a good prompt to get it to
| summarization, for example.
| sp332 wrote:
| ChatGPT had an estimated 20,000 hours of human feedback.
| That's not going to be easy to replicate in an open source
| way.
| jacooper wrote:
| Does anybody know how to run this on Linux with an AMD GPU?
|
| Also do I have to bother with their crappy driver module that
| doesn't support most GPUs?
| patricktlo wrote:
| That's amazing, any chance of running it on my trusty GTX 1060
| 6gb, or that's not enough VRAM?
| [deleted]
| ilovefood wrote:
| This is really great, very good write-up.
|
| Seems it now also supports AVX2 for x86 architectures too.
| https://twitter.com/ggerganov/status/1634588951821393922
| bilsbie wrote:
| How's it looking for a six year old MacBook?
|
| Not there yet?
|
| Does this still use the gpu?
| simonw wrote:
| I believe lambda.cpp has been designed for at least an M1 - no
| idea if there are options for running LLaMA on older hardware.
| astrange wrote:
| It doesn't use CoreML so it should work on Intel machines at
| some speed.
|
| If it used the GPU/ANE and was a true large language model
| then it would only work on M1 systems because they're unified
| memory (which nothing except an A100 can match.)
| Spiwux wrote:
| People have been running large language models locally for a
| while now. For now the general consensus is that llama is not
| fundamentally better than local models with similar resource
| requirements, and in all the comparisons it falls short of an
| instruction-tuned model like Chat GPT
| version_five wrote:
| But llama is the most performant model with weights available
| in the wild.
|
| Personally I hope we quickly get to the stage that there's a
| real open llm like SD is to DALL-E. It sucks to have to bother
| with Facebook's core model, and give it more attention than it
| deserves, just because it's out there.
|
| If facebook had actually released it as an open model, I would
| have said that all the credit should go to them. But instead
| people are doing great open source work on top of their un-free
| model just because it's available, and in the popular
| conception they're going to get credit that they shouldn't
| bestcoder69 wrote:
| What instruction tuned LLM is better?
| yunyu wrote:
| FLAN-UL2
| loufe wrote:
| I've been following LLaMa closely since release and I'm
| surprised to see the claim that it's "general consensus" that's
| it isn't superior. I've seen machine and anecdotal evidence to
| the contrary. I'm not suggesting you're lying, but I am
| curious, can you point me to something you're reading?
| simonw wrote:
| My argument here is that this represents a tipping point.
|
| Prior to LLaMA + llama.cpp you could maybe run a large language
| model locally... if you had the right GPU rig, and if you
| really knew what you were doing, and were willing to put in a
| lot of effort to find and figure out how to run a model.
|
| My hunch is that the ability to run on a M1/M2 MacBook is going
| to open this up to a lot more people.
|
| (I'm exposing my bias here as a M2 Mac owner.)
|
| I think the race is now on to be the first organization to
| release a good instruction-tuned model that can run on personal
| hardware.
| stonerri wrote:
| As someone who just got the 7B running on a base MacBook
| M1/8GB, I strongly agree. The rate of tool development &
| prompt generation should see the same increase that Stable
| Diffusion did a few months (weeks?) ago.
|
| And given how early the cpp port is, there is likely plenty
| of performance headroom with more m1/m2-specific
| optimization.
| seydor wrote:
| I wonder why we don't have external "neural processing" devices
| like we once had soundcards. Is anyone working on hardware
| implementation of transformers?
|
| Kudos to Yann lecun for getting his revenge for Galactica
| jhrmnn wrote:
| https://en.wikipedia.org/wiki/Tensor_Processing_Unit
| seydor wrote:
| but those are not for sale, and not transformer-specific.
| There must be some optimizations that can be done in hardware
| and transformers are several years old now
| ruuda wrote:
| You likely already bought one.
|
| https://blog.google/products/pixel/introducing-google-
| tensor...
|
| https://apple.fandom.com/wiki/Neural_Engine
| jhrmnn wrote:
| Computation-wise, transformers are really just a bunch of
| matrix multiplications, nothing more to it. (Which is
| partially why they're so efficient and scalable.) Also,
| Nvidia's GPU architectures are moving in the TPU direction
| (https://www.nvidia.com/en-us/data-center/tensor-cores/).
| zenogantner wrote:
| > wonder why we don't have external "neural processing" devices
| like we once had soundcards.
|
| Some video cards/GPUs have become just that, becoming more and
| more geared towards non-graphics workloads ...
| valine wrote:
| Nvidia A100 is exactly that. It has lower cuda performance than
| a RTX 4090, and is almost entirely geared toward ML workloads.
| rvz wrote:
| There you go and very unsurprising to see that happen very
| quickly, unless you have a Apple Silicon machine and want to
| download the model to try it yourself.
|
| I still think that open source LLMs models have to be much
| smaller than 200GB and to be much better than ChatGPT to be more
| accessible and highly disruptive to OpenAI.
|
| It is a great accident needed thanks to Meta. For now one can use
| it as a service and make it as a SaaS rather than depend fully on
| OpenAI. Open source (or even free binary only) LLMs will
| eventually disrupt OpenAI's business plans.
| simonw wrote:
| The 4-bit quantized version of LLaMA 7B used by llama.cpp is a
| 4GB file. The 13B model is under 8GB.
| Mathnerd314 wrote:
| > This all changed yesterday, thanks to the combination of
| Facebook's LLaMA model and llama.cpp by Georgi Gerganov.
|
| George Hotz was so confident that he was riding the wave with his
| Python implementation:
| https://github.com/geohot/tinygrad/blob/master/examples/llam....
| But I guess not, pure C++ seems better.
| quotemstr wrote:
| Isn't it more the four bit quantization than the choice of C++
| as an orchestrator that's the win? It's not as if in either the
| C++ or the Python case that high level code is actually doing
| the matrix multiplications.
|
| That basically the whole AI revolution is powered by CPython of
| all things (not even PyPy) is the 100 megaton nuke that should
| end language warring forever.
|
| That the first AGI will likely be running under a VM so
| inefficient that even integers are reference counted is God
| laughing in the face of all the people who've spent the past
| decades arguing that this language or that language is
| "faster". Amdahl was right: only inner loops matter.
| minimaxir wrote:
| > That basically the whole AI revolution is powered by
| CPython of all things (not even PyPy) is the 100 megaton nuke
| that should end language warring forever.
|
| And a lot of new AI tooling such as tokenization has been
| developed for Python using Rust (pyo3)
| camjohnson26 wrote:
| Are there any online communities running these models with non
| professional hardware? I keep running into issues with poor
| documentation or outdated scripts with GPT neox, BLOOM, and even
| stable diffusion 2. Seems like most of the support is either for
| professionals with clusters of A100s, or consumers who aren't
| using code. I have 3 16gb Quadra GPUs but getting this stuff
| running on them has been surprisingly difficult
| moyix wrote:
| There's a group of folks on 4chan doing this on gaming class
| hardware (4080s etc). They have a doc here:
| https://rentry.org/llama-tard-v2
| BaculumMeumEst wrote:
| would i have better luck with a gtx 1070 with 8gb of vram or a
| macbook m1 pro with 16gb of ram?
| techstrategist wrote:
| M1 Pro for sure
| rahimnathwani wrote:
| The latter.
___________________________________________________________________
(page generated 2023-03-11 23:00 UTC)