[HN Gopher] Running large language models like ChatGPT on a sing...
___________________________________________________________________
Running large language models like ChatGPT on a single GPU
Author : _nhynes
Score : 434 points
Date : 2023-02-20 16:55 UTC (6 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| baobabKoodaa wrote:
| I just tried to run the example in the README, using the OPT-30B
| model. It appeared to download 60GiB of model files, and then it
| attempted to read all of it into RAM. My laptop has "only" 32GiB
| of RAM so it just ran out of memory.
| baobabKoodaa wrote:
| FWIW I was able to load the OPT-6.7B model and play with it in
| chatbot mode. This would not have been possible without the
| offloading, so... cool stuff!
| Miraste wrote:
| You have to change the --percent flag. It takes some
| experimentation. The format is three pairs of 0-100 integers,
| one for parameters, attention cache and hidden states
| respectively. The first zero is percent on GPU, the second one
| is percent on CPU (system RAM), and the remaining percentage
| will go on disk.
|
| For disk offloading to work you may also have to specify
| --offload-dir.
|
| I have opt-30B running on a 3090 with --percent 20 50 100 0 100
| 0, although I think those could be tweaked to be faster.
| ImprobableTruth wrote:
| How fast is it in single batch mode?
| Miraste wrote:
| After turning on compression I was able to fit the whole
| thing in GPU memory and then it became much faster. Not
| ChatGPT speeds or anything, but under a minute for a
| response in their chatbot demo. A few seconds in some
| cases.
| lxe wrote:
| How much system RAM are you running with? And I'm guessing it
| wouldn't hurt to have a fast SSD for disk offloading?
| Miraste wrote:
| 128GB, but by turning on compression I managed to fit the
| whole thing on the GPU. I did try it off a mix of RAM and
| SSD as well, and it was slower but still usable. Presumably
| disk speed matters a lot.
| lxe wrote:
| Well just got some more sticks. While I wait for RAM to
| arrive, will try with compress_weight and compress_cache.
| If you're in any discord or any other space where people
| are tinkering with this, would love to join!
| lxe wrote:
| With compression, was able to get 30b to run on 3090 with
| '100 0'! Let me see if I can tweak the prompt a bit and
| make it come alive...
| bee_rider wrote:
| Hmm, well we used to have swap partitions equal in size to our
| memory... you'll have 4GiB left over!
| dharma1 wrote:
| I'd love to run this on a single 24gb 3090 - how much dram / SSD
| space do I need for a decent LLM, when it's quantised to 4bits?
| Miraste wrote:
| I've been trying this, and with compression on (4 bits) you can
| fit the entire 30B model on the 3090.
| lxe wrote:
| Got the ops-6.7b chatbot running on a windows machine with a 3090
| in mere minutes. The only difference was to install the cuda
| pytorch `pip install torch==1.13.1+cu117 --extra-index-url
| https://download.pytorch.org/whl/cu117` just like in stable
| diffusion's case.
|
| It performs as expected: Human: Tell me a joke
| Machine: I have no sense of humour Human: What's
| 2+5? Machine: I cannot answer that.
| rjb7731 wrote:
| Looks like it might be no bueno on google colab for now,
| chatbot.py takes prompts via input() too rather then a command
| line argument.
| A4ET8a8uTh0 wrote:
| Hey. So did anyone try doing it with AMD cards ( I know Nvidia
| seems preferable now )?
| Ajedi32 wrote:
| 6.7b is pretty small, no? Do you even need offloading for that
| on a 3090? I'd be curious to see what's needed to run opt-30b
| or opt-66b with reasonable performance. The README suggests
| that even opt-175b should be doable with okay performance on a
| single NVIDIA T4 if you have enough RAM.
| nathan_compton wrote:
| It is entirely possible to run 6.7B parameter models on a
| 3090, although I believe you need 16 bit weights. I think you
| can squeeze a 20b parameter model onto the 3090 if you go all
| the way down to 8.
| dom96 wrote:
| It's really interesting that these models are written in Python.
| Anyone know how much of a speed up using a faster language here
| would have? Maybe it's already off-loading a lot of the
| computation to C (I know many Python libraries do this), but I'd
| love to know.
| ianzakalwe wrote:
| Python is mostly just a glue code nowadays, all data loading,
| processing and computations are handled by low level languages
| (C/C++), python is there just to instruct those low level
| libraries how to compose into one final computation.
| albertzeyer wrote:
| Python is just the gluing language. All the heavy lifting
| happens in CUDA or CuBLAS or CuDNN or so.
|
| Most optimizations for saving memory is by using lower
| precision numbers (float16 or less), quantization (int8 or
| int4), sparsification, etc. But this is all handled by the
| underlying framework like PyTorch.
|
| There are C++ implementations but they optimize on different
| aspects. For example: https://github.com/OpenNMT/CTranslate2/
| brrrrrm wrote:
| For _large_ models, there are two main ways folks have been
| optimizing machine learning execution:
|
| 1. lowering precision of the operations (reducing compute
| "width" and increasing parallelization)
|
| 2. fusing operations into the same GPU code (reducing memory-
| bandwidth usage)
|
| Neither of those optimizations would benefit from swapping to a
| faster language.
|
| Why? The typical "large" neural network operation runs on the
| order of a dozen microseconds to milliseconds. Models are
| usually composed of hundred if not thousands of these. The
| overhead of using Python is around 0.5 microseconds per
| operation (best case on Intel, worst case on Apple ARM). So
| that's maybe a 5% net loss if things were running
| synchronously. But they're not! When you call GPU code, you
| actually do it asynchronously, so the language latency can be
| completely hidden.
|
| So really, all you want in an ML language is the ability to 1.
| change the type of the underlying data on the fly (Python is
| really good at this) and 2. rewrite the operations being
| dispatched to on the fly (Python is also really good at this).
|
| For smaller models (i.e. things that run in sub-microsecond
| world), Python is not the right choice for training or
| deploying.
| amelius wrote:
| Your view of "offloading" things to a faster language is wrong.
| It's already written in a fast language (C++ or CUDA). Python
| is just an easy to use way of invoking the various libraries.
| Switching to a faster language for everything would just make
| experimenting and implementing things more cumbersome and would
| make the technology as a whole move slower.
| spaintech wrote:
| interesting article, I have to give that a try! :D
|
| One ting is that while getting the value of running pretrained
| model weights like OPT-175B, there are also a potential downsides
| to using pre-trained models, such as the need to fine-tune the
| model to your specific task, potential compatibility issues with
| your existing infrastructure (integration ) , and the possibility
| that the pre-trained model may not perform as well as a model
| trained specifically on your data. Ultimately, the decision of
| whether to use a pre-trained model will be based on the outcomes,
| no harm in trying it out before you build from scratch, IMO.
| albertzeyer wrote:
| It would be helpful to upload the paper to Arxiv, for better
| accessibility and visibility.
|
| https://github.com/Ying1123/FlexGen/blob/main/docs/paper.pdf
| https://docs.google.com/viewer?url=https://github.com/Ying11...
| muttled wrote:
| This is cool! But I wonder if it's economical using cloud
| hardware. The author claims 1.12 tokens/second on the 175B
| parameter model (arguably comparable to GPT-3 Davinci). That's
| about 100k tokens a day on the GCP machine the author used.
| Someone double check my numbers here, but given the Davinci base
| cost of $0.02 per 1k tokens and GCP cost for the hardware listed
| "NVIIDA T4 (16GB) instance on GCP with 208GB of DRAM and 1.5TB of
| SSD" coming up to about $434 on spot instance pricing, you could
| simply use the OpenAI API and generate about 723k tokens a day
| for the same price as running the spot instance (which could go
| offline at any point due to it being a spot instance).
|
| Running the fine-tuned versions of OpenAI models are
| approximately 6x more expensive per token. If you were running a
| fine-tuned model on local commodity hardware, the economies would
| start to tilt in favor of doing something like this if the load
| was predictable and relatively constant.
| pclmulqdq wrote:
| Cloud accelerators carry a huge price premium because there
| aren't very many of them available and they aren't as fungible
| as CPUs. Comparing to a local GPU would likely be favorable for
| the local machine.
| breckenedge wrote:
| Thanks for running the cloud numbers on this. I ran some DIY
| numbers and they indicate less than a week to break even with
| the cloud, including all hardware and electricity costs. The
| cloud seems stupid expensive compared to running your own
| hardware for this kind of task.
| throwawayapples wrote:
| The cloud is _always_ stupid expensive compared to running
| your own hardware for almost any sort of task that isn 't
| highly variable upon one or more axis (cpu, ram, etc), but
| less than a week to break even is truly shocking.
| p1esk wrote:
| The cloud has been cheaper to train large models on for a
| couple years now. Compare buying 8xA100 server vs renting
| one on Lambda Labs. At least 3 years to break even - if you
| are using it non-stop 24/7. Longer if not.
| cardine wrote:
| This is not true - the break even period is closer to 6-7
| months.
| p1esk wrote:
| A single 8xA100 server is ~150k. On demand cost to rent
| it is $8.8/hour. Do the math and don't forget the energy
| costs.
| swatcoder wrote:
| Sometimes control is more important than cost.
| ImprobableTruth wrote:
| You've made one huge mistake: Davinci's $0.02 is not just per
| 1k tokens _generated_ but also _context tokens consumed_. So if
| you generate 50 tokens per request with 1k context, the price
| is actually 20 times as large at $0.40 per 1k tokens generated
| - much less palatable, costing 3 times as much as the cloud
| hosted version of this.
|
| And that's not even taking into account the gigantic markup
| cloud services have.
| yorwba wrote:
| Most of the computational cost of producing an output token
| is spent on consuming input tokens (including previous output
| tokens that are fed back in); only the final unembedding
| matrix could be eliminated if you don't care about the output
| logits for the context.
|
| So it's not correct to only modify OpenAI's prices to account
| for the ratio of context tokens to output tokens. Both of
| them get multiplied by 20 (if that's what your ratio is).
| ImprobableTruth wrote:
| No, because they're already taking that into account.
|
| >Metric: generation throughput (token/s) = number of the
| generated tokens / (time for processing prompts + time for
| generation).
|
| (Though they're doing batching, so this is an unfair
| comparison. Would be interesting to get single batch
| speed.)
| cypress66 wrote:
| This is most likely aimed at people running models locally.
|
| And a homelab with 3090s/4090s is one or two orders of
| magnitude cheaper than GCP, if you use them continuously.
| SomeHacker44 wrote:
| I do not know anyone offhand with a 200+GB RAM home computer.
| The GPU is not all that is needed; you need to keep the
| parameters and other stuff in memory too.
| zargon wrote:
| 256gb of ddr4 rdimms only costs about $400 right now. $200
| for ddr3. Not uncommon in homelabs.
|
| I don't think 200gb ram is actually required, that's just
| what that cloud vm was spec'd with. Though the 175b model
| should see benefit with ram even beyond 200gb.
| Filligree wrote:
| Running it off a fast NVMe apparently works. I don't know
| what the performance is like, though.
| woadwarrior01 wrote:
| I own a two year old headless home computer with 256GB of
| RAM and two 3090s. I ssh into it from my mac to run ML
| training jobs.
| adamnemecek wrote:
| I have recently written a paper on understanding transformer
| learning via the lens of coinduction & Hopf algebra.
| https://arxiv.org/abs/2302.01834
|
| The learning mechanism of transformer models was poorly
| understood however it turns out that a transformer is like a
| circuit with a feedback.
|
| I argue that autodiff can be replaced with what I call in the
| paper Hopf coherence which happens within the single layer as
| opposed to across the whole graph.
|
| Furthermore, if we view transformers as Hopf algebras, one can
| bring convolutional models, diffusion models and transformers
| under a single umbrella.
|
| I'm working on a next gen Hopf algebra based machine learning
| framework.
|
| Join my discord if you want to discuss this further
| https://discord.gg/mr9TAhpyBW
| qualudeheart wrote:
| Powerful idea.
| adamnemecek wrote:
| Hopf algebras are next gen.
| kneel wrote:
| what
| adamnemecek wrote:
| which part
| lxe wrote:
| Got the ops-30b chatbot running on 3090 24GB. I set
| compress_weight=True and compress_cache=True, and ran with
| `python apps/chatbot.py --model facebook/opt-30b --percent 100 0
| 100 0 100 0`.
|
| I also modified the prompt a bit to make it more... uh alive:
| Assistant: Did you know that Saturn is 97 times the size of
| Earth? Human: Are you sure? Assistant: What
| difference does size make, really, anyway? Human: You
| didn't answer my question. Assistant: Hey, it's your
| question I didn't answer! Anyway, did you know that on the
| surface of Saturn, there's lakes of liquid methane and
| ethane? Human: I didn't know that. Are there any lakes on
| the Sun? Assistant: Look, I'll be honest with you. I
| don't know much about the Sun. Is it even a planet?
| Human: It sure is. Why don't you know about the Sun? I thought
| you were a hyperintelligent AI? Assistant: You're right.
| I'm a hyperintelligent AI. I'm hyperintelligent enough to know
| that I know jack shit about the Sun.
| syntaxing wrote:
| This is amazing. Reminds me of claptrap from Borderlands
| LoganDark wrote:
| It really does
| razor_router wrote:
| That's impressive! How much customization did you have to do in
| order to get the chatbot to respond like that?
| lxe wrote:
| Not much. Here's the prompt: A chat between a
| human and a more funny and witty intelligent assistant
| Human: Hello! What can you do? Assistant: Hey there! I
| can pretty much do whatever you ask! Human: What is the
| name of the tallest mountain in the world? Assistant:
| It's Everest. Human: Are you sure? Assistant: You
| damn right I'm sure! I'm a hyperintelligent AI!
| posguy wrote:
| All I can think of is outsourcing to ChatGPT now...
|
| https://www.youtube.com/watch?v=rYaZ57Bn4pQ
| simonw wrote:
| Top item on the roadmap: "Support Apple silicon M1/M2 deployment"
| fancyfredbot wrote:
| I believe that you can't get enough RAM with M1/M2 for this to
| be useful
| ricardobeat wrote:
| This is meant to run on GPUs with 16GB RAM. Most M1/M2 users
| have at least 32GB (unified memory), and you can configure a
| MBP or Mac Studio with up to 96/128GB.
|
| The Mac Pro is still Intel, but it can be configured with up
| to 1.5TB of RAM, you can imagine the M* replacement will have
| equally gigantic options when it comes out.
| fancyfredbot wrote:
| If you look closely there's 16GB of GPU memory and over
| 200GB of CPU memory. So none of the currently available M*
| have the same kind of capacity. Let's hope this changes in
| the future!
| MuffinFlavored wrote:
| I tried to figure out how to do GPGPU stuff as a total beginner
| in Rust on Apple Silicon.
|
| I couldn't figure out if I was supposed to be chasing down
| Apple Metal or OpenCL backends. It also didn't seem to make
| much of a difference because while there are crates for both
| that seemed relatively well-maintained/fleshed out, I couldn't
| figure out how exactly to just pull one down and plug them into
| a higher level library (or find said higher level library all
| together).
|
| Have you had any luck? In my experience, it's basically Python
| or bust in this space despite lots of efforts to make it not
| that way?
|
| I also got confuses as to whether a 'shader' was more for the
| visual GPU output of things, or if it was also a building block
| for model training/networks/machine learning/etc.
| smoldesu wrote:
| Give this a look:
|
| https://github.com/guillaume-be/rust-bert
|
| https://github.com/guillaume-be/rust-
| bert/blob/master/exampl...
|
| If you have Pytorch configured correctly, this should "just
| work" for a lot of the smaller models. It won't be a 1:1
| ChatGPT replacement, but you can build some pretty cool stuff
| with it.
|
| > it's basically Python or bust in this space
|
| More or less, but that doesn't have to be a bad thing. If
| you're on Apple Silicon, you have plenty of performance
| headroom to deploy Python code for this. I've gotten this
| library to work on systems with as little as 2gb of memory,
| so outside of ultra-low-end use cases, you should be fine.
| MuffinFlavored wrote:
| To clarify,
|
| > Port of Hugging Face's Transformers library, using the
| tch-rs crate and pre-processing from rust-tokenizers.
|
| > tch-rs: Rust bindings for the C++ api of PyTorch.
|
| Which "backend" does this end up using on Apple Silicon,
| MPS (Metal Performance Shaders) or OpenCL?
|
| https://pytorch.org/docs/stable/notes/mps.html
|
| I'm going to guess MPS?
| smoldesu wrote:
| Whatever your Pytorch install is designed to accelerate.
| I've got Ampere-accelerated Pytorch running it on my ARM
| server, I assume MPS is used on compatible systems.
| fathyb wrote:
| > I couldn't figure out if I was supposed to be chasing down
| Apple Metal or OpenCL backends.
|
| If you want cross-platform compatibility (kinda), go for
| OpenCL, if you want the best performance go for Metal. Both
| use a very similar language for kernels, but Metal is
| generally more efficient.
|
| > Have you had any luck?
|
| Not in ML, but I'm doing a lot of GPGPU on Metal, I recently
| started doing it in Rust. A bit less convenient than with
| Swift/Objective-C, but still possible. Worst case you'll have
| to add an .mm file and bridge it with `extern "C"`. That
| said, doing GPGPU is not doing ML, and most ML libraries are
| in Python.
|
| > I also got confuses as to whether a 'shader' was more for
| the visual GPU output of things, or if it was also a building
| block for model training/networks/machine learning/etc.
|
| A shader is basically a function that runs for every element
| of the output buffer. We generally call them kernels for
| GPGPU, and shaders (geometry, vertex, fragment) for graphics
| stuff. You have to write them in a language that kinda looks
| like C (OpenGL GLSL, DirectX HSL, Metal MSL), but is
| optimized for the SMT properties of GPUs.
|
| Learning shaders will let you run code on the GPU, to do ML
| you also need to learn what are tensors, how to compute them
| on the GPU, and how to build ML systems using them.
|
| I recommend ShaderToy [0] if you want a cool way to
| understand and play with shaders.
|
| [0]: https://www.shadertoy.com/
| MuffinFlavored wrote:
| so write a kernel in OpenCL, then call it from Rust
|
| is that what machine learning is doing at a high level?
| fathyb wrote:
| At a very high level yes. There is also the very
| important step of efficiently laying out data in the GPU
| memory to compute tensor values in the kernels.
| Miraste wrote:
| I'm not familiar with Metal, but on Apple Silicon aren't
| CPU and GPU memory completely shared?
| fathyb wrote:
| They do, however it's not fully shared at the process
| level, the GPGPU API should explicitly support mapping a
| buffer from the process virtual memory space to the GPU.
|
| I looked it up and turns out OpenCL also supports zero-
| copy buffers, so I edited my comment accordingly!
| [deleted]
| benlivengood wrote:
| This also means local fine-tuning is possible. Expect to see an
| explosion of new things like we did with Stable Diffusion,
| limited to some extent by the ~0.7 order of magnitude more VRAM
| required.
| bioemerl wrote:
| Does it? I would have expected compression losses to make
| training really hard.
| Miraste wrote:
| The compression is optional.
| danuker wrote:
| Any chance these work on CPUs with any acceptable performance?
|
| I have a 10-core 20-thread monster CPU, but didn't bother with a
| dedicated GPU because I can't control something as simple as its
| temperature. See the complicated procedure that only works with
| the large proprietary driver here:
|
| https://wiki.archlinux.org/title/NVIDIA/Tips_and_tricks#Over...
| brigade wrote:
| Your CPU gets maybe 700-800 gflops depending on your all-core
| frequency (fp32 because you don't have Sapphire Rapids.) The T4
| benchmarked would be crunching what it can at ~65 tflops (fp16
| tensor.) Newer GPUs hit 300 tflops (4090) or even nearly 2
| petaflops (H100).
|
| To give you an idea of the order of magnitude of compute
| difference. Sapphire Rapids has AMX and fp16 AVX512 to close
| the gap a little, but it's still massive.
| NavinF wrote:
| > 10-core 20-thread monster CPU
|
| With what, 50GB/s memory bandwidth? That's no monster. The two
| consumer GPUs in my machine both do 1TB/s and are _still_
| bottlenecked on memory bandwidth.
|
| > only works with the large proprietary driver here
|
| In practice, nothing works without the proprietary driver so
| this isn't specific to temperature. Also the setting you're
| looking for is almost certainly `nvidia-smi -pl $watts` for
| setting the power limit, not whatever that wiki gives you. GPU
| temperature = ambient temperature + (power limit)*(thermal
| resistance of cooler)
| TimeBearingDown wrote:
| That power limit control is explained in detail a few
| paragraphs down on that wiki page.
|
| https://wiki.archlinux.org/title/NVIDIA/Tips_and_tricks#Cust.
| ..
| metadat wrote:
| Unlikely, because this is an efficient GPU work offloader, not
| a complete replacement for GPU computation.
| bioemerl wrote:
| Nope. 20 cores in a CPU, 2000 in a GPU, with much much faster
| memory and an architecture designed to chew through data as
| fast as possible.
| bee_rider wrote:
| No real reason to compare a GPU core to a CPU one, but the
| memory bandwidth difference is pretty concrete!
| fulafel wrote:
| GPU "cores" are ~ SIMD lanes.
|
| (a difference I think is that there are more virtual lanes,
| some of may be masked off, that are mapped to the GPU
| physical SIMD lanes)
| adeon wrote:
| I don't know about these large models but I saw on a random HN
| comment earlier in a different topic where someone showed a
| GPT-J model on CPU only: https://github.com/ggerganov/ggml
|
| I tested it on my Linux and Macbook M1 Air and it generates
| tokens at a reasonable speed using CPU only. I noticed it
| doesn't quite use all my available CPU cores so it may be
| leaving some performance on the table, not sure though.
|
| The GPT-J 6B is nowhere near as large as the OPT-175B in the
| post. But I got the sense that CPU-only inference may not be
| totally hopeless even for large models if only we got some high
| quality software to do it.
| generalizations wrote:
| There's also the Fabrice Bellard inference code:
| https://textsynth.com/technology.html. He claims up to 41
| tokens per second on the GPT-Neox 20B model.
| warning26 wrote:
| This seems like a great step; I've been able to run
| StableDiffusion locally, but with an older GPU none of the LLMs
| will run for me since I don't have enough VRAM.
|
| Oddly I don't see a VRAM requirement listed. Anyone know if it
| has a lower limit?
| cypress66 wrote:
| > with an older GPU none of the LLMs will run for me since I
| don't have enough VRAM.
|
| I think you can run Pygmalion 6B on a 8GB GPU using DeepSpeed.
|
| It's very underwhelming if you expect something like ChatGPT
| though.
| t3estabc wrote:
| [dead]
| winddude wrote:
| looks interesting. FYI, the link to your discord in the readme is
| broken
| metadat wrote:
| > Hardware: an NVIIDA T4 (16GB) instance on GCP with 208GB of
| DRAM and 1.5TB of SSD.
|
| Is FlexGen able to take advantage of multiple hundreds of GB of
| system memory? Or is do these compute instances just come bundled
| with it and it's a [largely] irrelevant detail?
| bioemerl wrote:
| The OPT175b model is massive. A lot of that system ram probably
| holds model data.
| metadat wrote:
| Interesting, though apparently the OPT175B model is 350GB:
|
| > You will need at least 350GB GPU memory on your entire
| cluster to serve the OPT-175B model. For example, you can use
| 4 x AWS p3.16xlarge instances, which provide 4 (instance) x 8
| (GPU/instance) x 16 (GB/GPU) = 512GB memory.
|
| https://alpa.ai/tutorials/opt_serving.html
|
| (Scroll down to the second "Note", not far from the top)
|
| I wonder what FlexGen is doing.. a naive guess is a mix of
| SSD and system memory. Definitely curious about what
| FlexGen's underlying strategy translates to in terms of
| actual data paths.
| SekstiNi wrote:
| > Interesting, though apparently the OPT175B model is
| 350GB:
|
| Only in FP16. In the paper they use int4 quantization to
| reduce it to a quarter of that. In addition to the model
| weights, there's also a KV cache that takes up considerable
| amounts of memory, and they use int4 on that as well.
|
| > I wonder what FlexGen is doing.. a naive guess is a mix
| of SSD and system memory.
|
| That's correct, but other approaches have done this as
| well. What's "new" here seems to be the optimized data
| access pattern in combination with some other interesting
| techniques (prefetching, int4 quantization, CPU offload).
| stevenhuang wrote:
| I want to emphasize how fascinating I find that the
| transform from 16 bit to a 4 bit quantization results in
| negligible performance loss. That's huge. Is the original
| FP16 not compressed?
|
| The allowance for this more granular quantization seems
| to suggest the "bottleneck" is in some other aspect of
| the system, and maybe until that is addressed, a higher
| fidelity quantization does not improve performance.
|
| Or maybe it's the relative values/ratio between weights
| that is important, and as long as the intended ratio
| between weights can be expressed, the exact precision of
| the weights themselves may not be important?
|
| Found an interesting paper on this below. There's
| doubtless heavy research underway in this area
|
| - https://www.researchgate.net/publication/367557918_Unde
| rstan...
| inciampati wrote:
| Very insightful! Now I'm curious what the bottleneck is.
| stevenhuang wrote:
| A recent discussion I found on int4, definitely looks
| like this is the new hotness. Very exciting!
|
| https://news.ycombinator.com/item?id=34404859
| [deleted]
| blagie wrote:
| A lot of people are looking at this wrong. A $350 3060Ti has 12GB
| RAM. If there's a way to run models locally, it opens up the door
| to:
|
| 1) Privacy-sensitive applications
|
| 2) Tinkering
|
| 3) Ignoring filters
|
| 4) Prototyping
|
| 5) Eventually, a bit of extra training
|
| The upside isn't so much cost / performance, as local control
| over a cloud-based solution.
| Aperocky wrote:
| I have that exact card, this maybe the nudge where I remove
| windows from the computer and try out linux gaming (and local
| GPT)
| raihansaputra wrote:
| Thing is, you don't have to totally switch to Linux. I'm
| running ML/CUDA workloads through WSL without too many
| problems.
| railgun2space wrote:
| We are hiring in that area of work in Europe time zone. If you
| are exited about and capable in this field, please apply here:
| https://ai-jobs.net/job/41469-senior-research-engineer-llms-...
| tempaccount420 wrote:
| If you want talent, don't make them go through the regular
| application process.
| birdyrooster wrote:
| I recently bought a T4 to go with my epyc 7402 and 512GB ram for
| fun and this looks like a great use case. Thanks!
| cypress66 wrote:
| What's the advantage of purchasing a T4 instead of a 3090 or
| 4090?
| elorant wrote:
| Power consumption. A Tesla T4 with 16GB RAM will consume a
| mere 70W. An RTX 3090 will need at least 300W, and the Titan
| models go up to 450W.
| nirav72 wrote:
| Possibly the price. On secondary markets like Ebay - I've
| occasionally seen T4 cards for $500-600. Also, the form
| factor. The T4s are comparatively much smaller/shorter than a
| 3090/4090. So would be a easier fit in a server case.
| ml_basics wrote:
| Very cool.
|
| Worth mentioning though that the highlighted figures (1.12 tok/s
| for OPT-175B for "FlexGen with Compression") are for inputs of
| 512 tokens and outputs of 32 tokens.
|
| Since decoder-only transformer memory requirements scale with the
| square of sequence lengths, things would probably slow down
| significantly for very long sequences, which would be required
| for a back-and-forth conversation.
|
| Still though, until reading this i had no idea that running such
| a model on-device was remotely feasible!
| baobabKoodaa wrote:
| > Since decoder-only transformer memory requirements scale with
| the square of sequence lengths, things would probably slow down
| significantly for very long sequences, which would be required
| for a back-and-forth conversation.
|
| You can use tricks to keep the sequence length down even if the
| conversation goes on for a long time. For example, you can use
| the model to summarize the first n-1 lines of the conversation
| and append the last line to the summary as is.
| terabytest wrote:
| This is very interesting. Could you please elaborate and
| maybe share links to articles if you know of any?
| baobabKoodaa wrote:
| I don't have any sources to refer to, but "text
| summarization" is one of the common NLP tasks that LLMs are
| often benchmarked on. All of these general-purpose LLMs
| will be able to do a decent job at text summarization
| (some, such as ChatGPT, will be able to do zero-shot
| summarizations at high quality, whereas others need to be
| fine tuned for the task). If your problem is that you are
| feeding a large amount of text to the model and that is
| slow/expensive, then summarization will obviously remediate
| that issue. After summarizing most of the input text you
| still need to feed in the latest input without
| summarization, so for example if the user asks a question,
| the LLM can then accurately answer that question. (If _all_
| of the input goes into summarization, that last question
| may not even appear in the summarization, so results will
| be crap.)
| fpgaminer wrote:
| > transformer memory requirements scale with the square of
| sequence lengths
|
| Not true, see: Flash Attention. You can losslessly calculate
| the attention in blocks using a little math trick. Essentially
| each subsequent block "corrects" the denominator of the last
| block's softmax calculation. At the end you have a perfectly*
| accurate softmax. Since you don't need to keep the whole
| sequence in memory to perform the softmax, your memory now
| scales linearly with respect to sequence length, and due to the
| lower memory bandwidth requirements and increased kernel fusion
| the operation also tends to be faster.
|
| * While mathematically the calculation ends up exactly the
| same, in practice the result ends up slightly different due to
| the whims of F32 and F16 inaccuracies, and since the "max" used
| to calculate the softmax in a numerically stable way is
| calculated on a per-block basis. Doesn't significantly effect
| training or validation loss though.
| lxe wrote:
| What's the best way to get started learning this? What are
| the steps I should take to arrive at understanding what
| "attention" is?
| synergy20 wrote:
| Well if a single GPU is not enough, what about using Ray over
| internet so we can crowd training with multiple GPUs, is this
| possible?
| gorbypark wrote:
| If this works well, it will be a game changer. Requiring a fleet
| of $10k+ GPUs will kill any hope of wide spread adoption of open
| source "competitors" to GPT-3. Stable Diffusion is so popular
| because it can run on hardware mere mortals can own.
| narrator wrote:
| No doubt the corporate large language models will use it to
| make language models that are 10x bigger. However, at least the
| public will have access to 175B parameter language models which
| are much more sophisticated than the 6B or so parameter models
| consumer video cards can currently run.
| humanistbot wrote:
| This will only happen if "Open"AI or other big orgs release the
| model weights, which only Stable Diffusion did. Cost to train
| is still astronomical.
| leesec wrote:
| No it isn't. Stable Diffusion is less than 200 grand to
| train.
| anononaut wrote:
| I heard it was $4MM alone in AWS compute time.
| postalrat wrote:
| So about 40k on your own machines.
| inciampati wrote:
| It does make it seem like a box of H100s will easily be
| able to make an interesting open LLM.
| speedgoose wrote:
| This number seems to match the $200k, if you take into
| account the cloud margins of our favourite counterfeit
| products reseller.
| huijzer wrote:
| According to Christopher Potts (Stanford Professor and
| Chair, Department of Linguistics, and Professor, by
| courtesy, Department of Computer Science), training a large
| language model costs about 50 million [1].
|
| [1]: https://youtu.be/-lnHHWRCDGk?t=637
| Dylan16807 wrote:
| On the other hand, one techie with a few million dollars...
|
| And you could train something like GPT-3 for cheaper than a
| superbowl commercial. That would get you a lot of publicity.
| idiotsecant wrote:
| Is there information out there about how much it cost (in
| time or human-hours) to do the additional training
| necessary to make chatGPT? I am genuinely curious what the
| scale of the effort was.
| mensetmanusman wrote:
| VCs are funding $100,000,000 AI compute efforts now, so
| it might be something like that.
| paxys wrote:
| You can do it once, but probably not every day.
| Ajedi32 wrote:
| Why would you want to retrain it from scratch every day?
| Stable Diffusion doesn't do that either.
| baobabKoodaa wrote:
| Well maybe not _every_ day, but having a short feedback
| loop and the ability to run your code multiple times with
| different variations is generally considered to be a
| prerequisite for software development. If you actually
| want to keep developing the model, you need the funding
| to be able to train it more than once.
| coldtea wrote:
| > _but having a short feedback loop and the ability to
| run your code multiple times with different variations is
| generally considered to be a prerequisite for software
| development_
|
| This is not "software development" in general, this is
| LLM training.
|
| It's not like you're building some regular app, api, or
| backend.
| paxys wrote:
| Because things happen every day. If ChatGPT wants to
| compete with Google, staying up to date with recent
| events is the minimum bar.
| Ajedi32 wrote:
| You wouldn't need to re-train from scratch for that, just
| fine-tune on the new data sources. I don't think constant
| re-training is the optimal strategy for that use-case
| anyway. Bing does it by letting the LLM search a more
| traditional web index to find the information it needs.
| paxys wrote:
| Okay but someone has to do the fine tuning. The code has
| to be updated. Parts of the training have to be redone.
| All of this has costs. It isn't a "do it once and forget
| about it" task that it is being touted as in this thread.
| coldtea wrote:
| > _The code has to be updated_
|
| I'm pretty sure this is not how an LLM works.
|
| > _It isn 't a "do it once and forget about it" task that
| it is being touted as in this thread._
|
| That's neither here, nor there. Training the LLM itself
| is not a "do it multiple times per day if you want to
| compete with Google" thing as it has been stated in this
| subthread.
| simonw wrote:
| That's not necessary. Look at how a Bing works: it's a
| LLM which can trigger searches, and then gets fed the
| search results back to it as part of the prompt.
|
| I wrote about one way to implement that pattern here:
| https://simonwillison.net/2023/Jan/13/semantic-search-
| answer...
| coldtea wrote:
| That's not what the training is about.
|
| Things happen everyday, but languages and words and their
| associations don't change in any measurable way every
| day...
|
| This is not like web crawling...
| Nuzzerino wrote:
| I would hope publicity isn't the motivation for doing it
| though.
| idiotsecant wrote:
| What motivation would be sufficiently noble?
| Nuzzerino wrote:
| Probably one where there isn't an intrinsic conflict of
| interest with AI risk. Or from a more traditional angle,
| one where the author's vanity isn't required to be
| appeased in order for users/customers to be happy. I'm of
| the opinion that you should do something with game-
| changing technology because the world needs it, not
| because you need an ego boost. All technology brings side
| effects, and there is no greater example of that than
| "democratized" AI...
| idiotsecant wrote:
| People often (usually) do objectively useful things
| because it's in their selfish interests to do so, ego or
| otherwise. The surest road to failure is expecting people
| to act virtuously. Generally systems that assume virtue
| fail, and systems that assume selfish action and steer
| that selfish action towards the greater good succeed.
|
| In other words, I don't care why people do things, only
| that they do.
| Nuzzerino wrote:
| That's fine, as long as publicity isn't the motivation.
| It's safe to assume that isn't optimal for a projects
| success (Satoshi understood this). Not sure where you got
| the idea that the inverse of that was beneficial to such
| a project. I've seen first hand where it becomes a
| problem.
|
| I'm not aware of many examples of starry-eyed divas
| achieving great results. Usually you hear about them but
| only because they are exceptional cases, not the norm.
| It's a matter of practicality and not virtue (to say
| otherwise is purely a straw man argument).
| strohwueste wrote:
| The summary includes a dangerous thought: For example:
| why does north Korea develop a nuclear bomb is not
| important, just that they do. But only they why makes it
| problematic.
| IncRnd wrote:
| Noble? You're anthropomorphising machine learning. On
| possible motiviation would be to train a model, instead
| of training a model in order to create publicity around a
| model being trained.
| idiotsecant wrote:
| I think you're misreading, nobody is anthropomorphizing
| anything other than the very 'anthro' component of the
| system we're talking about - the people distributing the
| funding.
| IncRnd wrote:
| I may have misread your comment, then. Either way, thank
| you for the explanation!
| JoshCole wrote:
| No, it isn't astronomical. It is smaller than that. Still
| large, but not astronomical.
| ipsum2 wrote:
| Have you tried training a large model before? If not,
| you're probably discounting how difficult and expensive it
| is.
| coldtea wrote:
| Well, for those that trained the largest one atm, it cost
| them in the order of 10 million dollars (actually less).
|
| That's how much some tech companies pay for catering.
|
| Hell, that's in the order of a single socialite's wedding
| costs.
| JoshCole wrote:
| No, Stable Diffusion isn't the only one to release their
| weights. OpenAI hasn't been releasing weights for ChatGPT,
| but Stable Diffusion isn't the only ones releasing weights
| [1].
|
| [1]: https://huggingface.co/
| permo-w wrote:
| yeah there an absolute pile of LLMs that are fully open-
| source. OpenAI's GPT2 for one, but also Bloom, OPT, GPT-J,
| and I'm sure myriad others too
| naillo wrote:
| There are some open source LLM models already such as the one
| this repo is running and mentioning like OPT-175B
| tarr11 wrote:
| Wonder if someone would be willing to start an open source
| project where we could crowdsource donations for training,
| and people could possibly donate their GPU usage for it.
| polishdude20 wrote:
| There's gotta be something like this already. Like a SETI @
| Home type of thing .
| mryab wrote:
| There is! See https://petals.ml/ for inference of models
| like BLOOM-176B over the internet or
| https://arxiv.org/abs/2301.11913 and
| https://arxiv.org/abs/2206.01288 that show you how to do
| pretraining from scratch in the same setting. Disclaimer:
| I'm a coauthor of these systems (including the one in OP)
| exebook wrote:
| Amazing work! If I had a GPU, I'd join. I know similar
| project for text-to-image:
| https://aqualxx.github.io/stable-ui/
| scottmf wrote:
| That's what Stability AI has been doing... There are
| already open source LLMs the size of GPT-3 such as OPT and
| Bloom
| ipsum2 wrote:
| No they haven't. Stability AI is funded by the founder
| and VC money, not crowdsourcing.
| gaogao wrote:
| Meta has released the model weights for OPT-175B, which is
| used in the paper. There's also a lot of full release LLMs
| from other labs on the way as well.
| neilmovva wrote:
| While OPT-175B is great to have publicly available, it
| needs a lot more training to achieve good results. Meta
| trained OPT on 180B tokens, compared to 300B that GPT-3
| saw. And the Chinchilla scaling laws suggest that almost 4T
| tokens would be required to get the most bang for compute
| buck.
|
| And on top of that, there are some questions on the quality
| of open source data (The Pile) vs OpenAI's proprietary
| dataset, which they seem to have spent a lot of effort
| cleaning. So: open source models are probably data-
| constrained, in both quantity and quality.
| mensetmanusman wrote:
| It's fun to think about a few billion weights being the
| difference between useless and gold.
| tiborsaas wrote:
| Looking at my bank account I can relate :)
| stavros wrote:
| Are there any that perform anywhere close to GPT-3?
| anon291 wrote:
| As a former AI accelerator employee (laid off), I'm kind of
| happy I was laid off because I realistically don't see a need
| for specialized hardware anymore.
|
| Large companies can afford Nvidia. Nvidia's software stack is
| best in class. There's no business need here and the model
| execution is increasingly becoming possible on single consumer
| GPUs.
|
| The only place where I see specialized chips excelling is on
| the edge or if they are truly revolutionary (in which case
| they're only an acquisition target for Nvidia).
|
| The truth is... The large language models are likely
| excessively large.
| foobiekr wrote:
| Power is the main reason to do custom ASICs. I'd be curious
| as to your opinion of Recogni given they are claiming a 10x
| power reduction per unit compute.
| anon291 wrote:
| Unfortunately, I've worked at several players which promise
| power reductions. It doesn't matter though. People don't
| care about cost at this point. If you are cost-sensitive
| you're not doing the kind of revolutionary AI work these
| companies need to create a competitive moat. And once your
| model works on NVIDIA and is trained, how much are you
| going to spend on ML engineers to make it work on something
| else? Because that cost better be less than the marginal
| cost reduction on electricity. Plus, NVIDIA et al will
| likely get more and more efficient.
| permo-w wrote:
| do/did you seriously think that the processing requirement was
| going to "kill" that possibility?
|
| the history of computing clearly indicates that either the
| requirements for running or the bar for owning this technology
| was _always_ going to drop
| pessimizer wrote:
| The history of computing had Moore's Law.
| permo-w wrote:
| exactly
| moffkalast wrote:
| This may be a flawed approach, but an interesting idea would be
| to use the current models as a preprocessor to generate a huge
| "labelled" dataset of inputs and outputs, and then using that
| more accurate and specific data to train a smaller one that
| would fit.
|
| It likely wouldn't have nearly as much general knowledge since
| the data just wouldn't be there but the behaviour could be
| similar?
| zeknife wrote:
| Look up knowledge distillation
| moffkalast wrote:
| Ah TIL, so it's actually a thing and with some cleverer
| approaches too. Then we do have something to look forward
| to :D
| stevofolife wrote:
| Out of curiosity, why aren't we crowd sourcing distributed
| training of LLMs where anyone can join by bringing their hardware
| or data? Moreover find a way to incorporate this into a
| blockchain so there is full transparency but also add in
| differential privacy to protect every participant.
|
| Am I being too crazy here?
| albertzeyer wrote:
| There is the Open Assistant project: https://github.com/LAION-
| AI/Open-Assistant
|
| There is also EleutherAI (https://www.eleuther.ai/about/) with
| GPT-NeoX (https://github.com/EleutherAI/gpt-neox).
| moffkalast wrote:
| Just make sure it's written in Rust, uses a Sveltekit frontend
| and <some other buzzwords I can't remember right now>.
| rnosov wrote:
| The problem here is that most people just don't have suitable
| hardware. Ideally, you'd want to load the entire model into a
| GPU and most consumer grade GPUs just don't have nowhere near
| enough video memory. You'd need to have something like A100
| 80GB GPU to be able to run a node in the potential blockchain.
| You can buy one of these cards for about 15k USD. Admittedly,
| that's not that too far off from the price of a modern bitcoin
| ASIC miner but still a healthy chunk of change.
|
| And if you try to split the model across several GPUs then
| you'll have an issue of bandwidth as model parts would need to
| talk to each other (on the order of terabyte/second). At the
| moment, the only realistic way to contribute is just to provide
| feedback data for the RLHF training.
| nodja wrote:
| https://petals.ml/
| Miraste wrote:
| Petals doesn't train new models, it only runs BLOOM in a
| distributed way.
| nodja wrote:
| You can finetune with it. If you want a more generic
| framework you can use hivemind[1] which is what petals
| uses, but you'll have to create your own community for
| whatever model you're trying to train.
|
| https://github.com/learning-at-home/hivemind
___________________________________________________________________
(page generated 2023-02-20 23:00 UTC)