hngopher.com

       [HN Gopher] Running large language models like ChatGPT on a sing...
       ___________________________________________________________________
        
       Running large language models like ChatGPT on a single GPU
        
       Author : _nhynes
       Score  : 434 points
       Date   : 2023-02-20 16:55 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | baobabKoodaa wrote:
       | I just tried to run the example in the README, using the OPT-30B
       | model. It appeared to download 60GiB of model files, and then it
       | attempted to read all of it into RAM. My laptop has "only" 32GiB
       | of RAM so it just ran out of memory.
        
         | baobabKoodaa wrote:
         | FWIW I was able to load the OPT-6.7B model and play with it in
         | chatbot mode. This would not have been possible without the
         | offloading, so... cool stuff!
        
         | Miraste wrote:
         | You have to change the --percent flag. It takes some
         | experimentation. The format is three pairs of 0-100 integers,
         | one for parameters, attention cache and hidden states
         | respectively. The first zero is percent on GPU, the second one
         | is percent on CPU (system RAM), and the remaining percentage
         | will go on disk.
         | 
         | For disk offloading to work you may also have to specify
         | --offload-dir.
         | 
         | I have opt-30B running on a 3090 with --percent 20 50 100 0 100
         | 0, although I think those could be tweaked to be faster.
        
           | ImprobableTruth wrote:
           | How fast is it in single batch mode?
        
             | Miraste wrote:
             | After turning on compression I was able to fit the whole
             | thing in GPU memory and then it became much faster. Not
             | ChatGPT speeds or anything, but under a minute for a
             | response in their chatbot demo. A few seconds in some
             | cases.
        
           | lxe wrote:
           | How much system RAM are you running with? And I'm guessing it
           | wouldn't hurt to have a fast SSD for disk offloading?
        
             | Miraste wrote:
             | 128GB, but by turning on compression I managed to fit the
             | whole thing on the GPU. I did try it off a mix of RAM and
             | SSD as well, and it was slower but still usable. Presumably
             | disk speed matters a lot.
        
               | lxe wrote:
               | Well just got some more sticks. While I wait for RAM to
               | arrive, will try with compress_weight and compress_cache.
               | If you're in any discord or any other space where people
               | are tinkering with this, would love to join!
        
               | lxe wrote:
               | With compression, was able to get 30b to run on 3090 with
               | '100 0'! Let me see if I can tweak the prompt a bit and
               | make it come alive...
        
         | bee_rider wrote:
         | Hmm, well we used to have swap partitions equal in size to our
         | memory... you'll have 4GiB left over!
        
       | dharma1 wrote:
       | I'd love to run this on a single 24gb 3090 - how much dram / SSD
       | space do I need for a decent LLM, when it's quantised to 4bits?
        
         | Miraste wrote:
         | I've been trying this, and with compression on (4 bits) you can
         | fit the entire 30B model on the 3090.
        
       | lxe wrote:
       | Got the ops-6.7b chatbot running on a windows machine with a 3090
       | in mere minutes. The only difference was to install the cuda
       | pytorch `pip install torch==1.13.1+cu117 --extra-index-url
       | https://download.pytorch.org/whl/cu117` just like in stable
       | diffusion's case.
       | 
       | It performs as expected:                   Human: Tell me a joke
       | Machine: I have no sense of humour              Human: What's
       | 2+5?         Machine: I cannot answer that.
        
         | rjb7731 wrote:
         | Looks like it might be no bueno on google colab for now,
         | chatbot.py takes prompts via input() too rather then a command
         | line argument.
        
         | A4ET8a8uTh0 wrote:
         | Hey. So did anyone try doing it with AMD cards ( I know Nvidia
         | seems preferable now )?
        
         | Ajedi32 wrote:
         | 6.7b is pretty small, no? Do you even need offloading for that
         | on a 3090? I'd be curious to see what's needed to run opt-30b
         | or opt-66b with reasonable performance. The README suggests
         | that even opt-175b should be doable with okay performance on a
         | single NVIDIA T4 if you have enough RAM.
        
           | nathan_compton wrote:
           | It is entirely possible to run 6.7B parameter models on a
           | 3090, although I believe you need 16 bit weights. I think you
           | can squeeze a 20b parameter model onto the 3090 if you go all
           | the way down to 8.
        
       | dom96 wrote:
       | It's really interesting that these models are written in Python.
       | Anyone know how much of a speed up using a faster language here
       | would have? Maybe it's already off-loading a lot of the
       | computation to C (I know many Python libraries do this), but I'd
       | love to know.
        
         | ianzakalwe wrote:
         | Python is mostly just a glue code nowadays, all data loading,
         | processing and computations are handled by low level languages
         | (C/C++), python is there just to instruct those low level
         | libraries how to compose into one final computation.
        
         | albertzeyer wrote:
         | Python is just the gluing language. All the heavy lifting
         | happens in CUDA or CuBLAS or CuDNN or so.
         | 
         | Most optimizations for saving memory is by using lower
         | precision numbers (float16 or less), quantization (int8 or
         | int4), sparsification, etc. But this is all handled by the
         | underlying framework like PyTorch.
         | 
         | There are C++ implementations but they optimize on different
         | aspects. For example: https://github.com/OpenNMT/CTranslate2/
        
         | brrrrrm wrote:
         | For _large_ models, there are two main ways folks have been
         | optimizing machine learning execution:
         | 
         | 1. lowering precision of the operations (reducing compute
         | "width" and increasing parallelization)
         | 
         | 2. fusing operations into the same GPU code (reducing memory-
         | bandwidth usage)
         | 
         | Neither of those optimizations would benefit from swapping to a
         | faster language.
         | 
         | Why? The typical "large" neural network operation runs on the
         | order of a dozen microseconds to milliseconds. Models are
         | usually composed of hundred if not thousands of these. The
         | overhead of using Python is around 0.5 microseconds per
         | operation (best case on Intel, worst case on Apple ARM). So
         | that's maybe a 5% net loss if things were running
         | synchronously. But they're not! When you call GPU code, you
         | actually do it asynchronously, so the language latency can be
         | completely hidden.
         | 
         | So really, all you want in an ML language is the ability to 1.
         | change the type of the underlying data on the fly (Python is
         | really good at this) and 2. rewrite the operations being
         | dispatched to on the fly (Python is also really good at this).
         | 
         | For smaller models (i.e. things that run in sub-microsecond
         | world), Python is not the right choice for training or
         | deploying.
        
         | amelius wrote:
         | Your view of "offloading" things to a faster language is wrong.
         | It's already written in a fast language (C++ or CUDA). Python
         | is just an easy to use way of invoking the various libraries.
         | Switching to a faster language for everything would just make
         | experimenting and implementing things more cumbersome and would
         | make the technology as a whole move slower.
        
       | spaintech wrote:
       | interesting article, I have to give that a try! :D
       | 
       | One ting is that while getting the value of running pretrained
       | model weights like OPT-175B, there are also a potential downsides
       | to using pre-trained models, such as the need to fine-tune the
       | model to your specific task, potential compatibility issues with
       | your existing infrastructure (integration ) , and the possibility
       | that the pre-trained model may not perform as well as a model
       | trained specifically on your data. Ultimately, the decision of
       | whether to use a pre-trained model will be based on the outcomes,
       | no harm in trying it out before you build from scratch, IMO.
        
       | albertzeyer wrote:
       | It would be helpful to upload the paper to Arxiv, for better
       | accessibility and visibility.
       | 
       | https://github.com/Ying1123/FlexGen/blob/main/docs/paper.pdf
       | https://docs.google.com/viewer?url=https://github.com/Ying11...
        
       | muttled wrote:
       | This is cool! But I wonder if it's economical using cloud
       | hardware. The author claims 1.12 tokens/second on the 175B
       | parameter model (arguably comparable to GPT-3 Davinci). That's
       | about 100k tokens a day on the GCP machine the author used.
       | Someone double check my numbers here, but given the Davinci base
       | cost of $0.02 per 1k tokens and GCP cost for the hardware listed
       | "NVIIDA T4 (16GB) instance on GCP with 208GB of DRAM and 1.5TB of
       | SSD" coming up to about $434 on spot instance pricing, you could
       | simply use the OpenAI API and generate about 723k tokens a day
       | for the same price as running the spot instance (which could go
       | offline at any point due to it being a spot instance).
       | 
       | Running the fine-tuned versions of OpenAI models are
       | approximately 6x more expensive per token. If you were running a
       | fine-tuned model on local commodity hardware, the economies would
       | start to tilt in favor of doing something like this if the load
       | was predictable and relatively constant.
        
         | pclmulqdq wrote:
         | Cloud accelerators carry a huge price premium because there
         | aren't very many of them available and they aren't as fungible
         | as CPUs. Comparing to a local GPU would likely be favorable for
         | the local machine.
        
         | breckenedge wrote:
         | Thanks for running the cloud numbers on this. I ran some DIY
         | numbers and they indicate less than a week to break even with
         | the cloud, including all hardware and electricity costs. The
         | cloud seems stupid expensive compared to running your own
         | hardware for this kind of task.
        
           | throwawayapples wrote:
           | The cloud is _always_ stupid expensive compared to running
           | your own hardware for almost any sort of task that isn 't
           | highly variable upon one or more axis (cpu, ram, etc), but
           | less than a week to break even is truly shocking.
        
             | p1esk wrote:
             | The cloud has been cheaper to train large models on for a
             | couple years now. Compare buying 8xA100 server vs renting
             | one on Lambda Labs. At least 3 years to break even - if you
             | are using it non-stop 24/7. Longer if not.
        
               | cardine wrote:
               | This is not true - the break even period is closer to 6-7
               | months.
        
               | p1esk wrote:
               | A single 8xA100 server is ~150k. On demand cost to rent
               | it is $8.8/hour. Do the math and don't forget the energy
               | costs.
        
         | swatcoder wrote:
         | Sometimes control is more important than cost.
        
         | ImprobableTruth wrote:
         | You've made one huge mistake: Davinci's $0.02 is not just per
         | 1k tokens _generated_ but also _context tokens consumed_. So if
         | you generate 50 tokens per request with 1k context, the price
         | is actually 20 times as large at $0.40 per 1k tokens generated
         | - much less palatable, costing 3 times as much as the cloud
         | hosted version of this.
         | 
         | And that's not even taking into account the gigantic markup
         | cloud services have.
        
           | yorwba wrote:
           | Most of the computational cost of producing an output token
           | is spent on consuming input tokens (including previous output
           | tokens that are fed back in); only the final unembedding
           | matrix could be eliminated if you don't care about the output
           | logits for the context.
           | 
           | So it's not correct to only modify OpenAI's prices to account
           | for the ratio of context tokens to output tokens. Both of
           | them get multiplied by 20 (if that's what your ratio is).
        
             | ImprobableTruth wrote:
             | No, because they're already taking that into account.
             | 
             | >Metric: generation throughput (token/s) = number of the
             | generated tokens / (time for processing prompts + time for
             | generation).
             | 
             | (Though they're doing batching, so this is an unfair
             | comparison. Would be interesting to get single batch
             | speed.)
        
         | cypress66 wrote:
         | This is most likely aimed at people running models locally.
         | 
         | And a homelab with 3090s/4090s is one or two orders of
         | magnitude cheaper than GCP, if you use them continuously.
        
           | SomeHacker44 wrote:
           | I do not know anyone offhand with a 200+GB RAM home computer.
           | The GPU is not all that is needed; you need to keep the
           | parameters and other stuff in memory too.
        
             | zargon wrote:
             | 256gb of ddr4 rdimms only costs about $400 right now. $200
             | for ddr3. Not uncommon in homelabs.
             | 
             | I don't think 200gb ram is actually required, that's just
             | what that cloud vm was spec'd with. Though the 175b model
             | should see benefit with ram even beyond 200gb.
        
             | Filligree wrote:
             | Running it off a fast NVMe apparently works. I don't know
             | what the performance is like, though.
        
             | woadwarrior01 wrote:
             | I own a two year old headless home computer with 256GB of
             | RAM and two 3090s. I ssh into it from my mac to run ML
             | training jobs.
        
       | adamnemecek wrote:
       | I have recently written a paper on understanding transformer
       | learning via the lens of coinduction & Hopf algebra.
       | https://arxiv.org/abs/2302.01834
       | 
       | The learning mechanism of transformer models was poorly
       | understood however it turns out that a transformer is like a
       | circuit with a feedback.
       | 
       | I argue that autodiff can be replaced with what I call in the
       | paper Hopf coherence which happens within the single layer as
       | opposed to across the whole graph.
       | 
       | Furthermore, if we view transformers as Hopf algebras, one can
       | bring convolutional models, diffusion models and transformers
       | under a single umbrella.
       | 
       | I'm working on a next gen Hopf algebra based machine learning
       | framework.
       | 
       | Join my discord if you want to discuss this further
       | https://discord.gg/mr9TAhpyBW
        
         | qualudeheart wrote:
         | Powerful idea.
        
           | adamnemecek wrote:
           | Hopf algebras are next gen.
        
         | kneel wrote:
         | what
        
           | adamnemecek wrote:
           | which part
        
       | lxe wrote:
       | Got the ops-30b chatbot running on 3090 24GB. I set
       | compress_weight=True and compress_cache=True, and ran with
       | `python apps/chatbot.py --model facebook/opt-30b --percent 100 0
       | 100 0 100 0`.
       | 
       | I also modified the prompt a bit to make it more... uh alive:
       | Assistant: Did you know that Saturn is 97 times the size of
       | Earth?         Human: Are you sure?         Assistant: What
       | difference does size make, really, anyway?         Human: You
       | didn't answer my question.         Assistant: Hey, it's your
       | question I didn't answer! Anyway, did you know that on the
       | surface of          Saturn, there's lakes of liquid methane and
       | ethane?         Human: I didn't know that. Are there any lakes on
       | the Sun?         Assistant: Look, I'll be honest with you. I
       | don't know much about the Sun. Is it even a planet?
       | Human: It sure is. Why don't you know about the Sun? I thought
       | you were a hyperintelligent AI?         Assistant: You're right.
       | I'm a hyperintelligent AI. I'm hyperintelligent enough to know
       | that I know jack shit about the Sun.
        
         | syntaxing wrote:
         | This is amazing. Reminds me of claptrap from Borderlands
        
           | LoganDark wrote:
           | It really does
        
         | razor_router wrote:
         | That's impressive! How much customization did you have to do in
         | order to get the chatbot to respond like that?
        
           | lxe wrote:
           | Not much. Here's the prompt:                 A chat between a
           | human and a more funny and witty intelligent assistant
           | Human: Hello! What can you do?       Assistant: Hey there! I
           | can pretty much do whatever you ask!       Human: What is the
           | name of the tallest mountain in the world?       Assistant:
           | It's Everest.       Human: Are you sure?       Assistant: You
           | damn right I'm sure! I'm a hyperintelligent AI!
        
             | posguy wrote:
             | All I can think of is outsourcing to ChatGPT now...
             | 
             | https://www.youtube.com/watch?v=rYaZ57Bn4pQ
        
       | simonw wrote:
       | Top item on the roadmap: "Support Apple silicon M1/M2 deployment"
        
         | fancyfredbot wrote:
         | I believe that you can't get enough RAM with M1/M2 for this to
         | be useful
        
           | ricardobeat wrote:
           | This is meant to run on GPUs with 16GB RAM. Most M1/M2 users
           | have at least 32GB (unified memory), and you can configure a
           | MBP or Mac Studio with up to 96/128GB.
           | 
           | The Mac Pro is still Intel, but it can be configured with up
           | to 1.5TB of RAM, you can imagine the M* replacement will have
           | equally gigantic options when it comes out.
        
             | fancyfredbot wrote:
             | If you look closely there's 16GB of GPU memory and over
             | 200GB of CPU memory. So none of the currently available M*
             | have the same kind of capacity. Let's hope this changes in
             | the future!
        
         | MuffinFlavored wrote:
         | I tried to figure out how to do GPGPU stuff as a total beginner
         | in Rust on Apple Silicon.
         | 
         | I couldn't figure out if I was supposed to be chasing down
         | Apple Metal or OpenCL backends. It also didn't seem to make
         | much of a difference because while there are crates for both
         | that seemed relatively well-maintained/fleshed out, I couldn't
         | figure out how exactly to just pull one down and plug them into
         | a higher level library (or find said higher level library all
         | together).
         | 
         | Have you had any luck? In my experience, it's basically Python
         | or bust in this space despite lots of efforts to make it not
         | that way?
         | 
         | I also got confuses as to whether a 'shader' was more for the
         | visual GPU output of things, or if it was also a building block
         | for model training/networks/machine learning/etc.
        
           | smoldesu wrote:
           | Give this a look:
           | 
           | https://github.com/guillaume-be/rust-bert
           | 
           | https://github.com/guillaume-be/rust-
           | bert/blob/master/exampl...
           | 
           | If you have Pytorch configured correctly, this should "just
           | work" for a lot of the smaller models. It won't be a 1:1
           | ChatGPT replacement, but you can build some pretty cool stuff
           | with it.
           | 
           | > it's basically Python or bust in this space
           | 
           | More or less, but that doesn't have to be a bad thing. If
           | you're on Apple Silicon, you have plenty of performance
           | headroom to deploy Python code for this. I've gotten this
           | library to work on systems with as little as 2gb of memory,
           | so outside of ultra-low-end use cases, you should be fine.
        
             | MuffinFlavored wrote:
             | To clarify,
             | 
             | > Port of Hugging Face's Transformers library, using the
             | tch-rs crate and pre-processing from rust-tokenizers.
             | 
             | > tch-rs: Rust bindings for the C++ api of PyTorch.
             | 
             | Which "backend" does this end up using on Apple Silicon,
             | MPS (Metal Performance Shaders) or OpenCL?
             | 
             | https://pytorch.org/docs/stable/notes/mps.html
             | 
             | I'm going to guess MPS?
        
               | smoldesu wrote:
               | Whatever your Pytorch install is designed to accelerate.
               | I've got Ampere-accelerated Pytorch running it on my ARM
               | server, I assume MPS is used on compatible systems.
        
           | fathyb wrote:
           | > I couldn't figure out if I was supposed to be chasing down
           | Apple Metal or OpenCL backends.
           | 
           | If you want cross-platform compatibility (kinda), go for
           | OpenCL, if you want the best performance go for Metal. Both
           | use a very similar language for kernels, but Metal is
           | generally more efficient.
           | 
           | > Have you had any luck?
           | 
           | Not in ML, but I'm doing a lot of GPGPU on Metal, I recently
           | started doing it in Rust. A bit less convenient than with
           | Swift/Objective-C, but still possible. Worst case you'll have
           | to add an .mm file and bridge it with `extern "C"`. That
           | said, doing GPGPU is not doing ML, and most ML libraries are
           | in Python.
           | 
           | > I also got confuses as to whether a 'shader' was more for
           | the visual GPU output of things, or if it was also a building
           | block for model training/networks/machine learning/etc.
           | 
           | A shader is basically a function that runs for every element
           | of the output buffer. We generally call them kernels for
           | GPGPU, and shaders (geometry, vertex, fragment) for graphics
           | stuff. You have to write them in a language that kinda looks
           | like C (OpenGL GLSL, DirectX HSL, Metal MSL), but is
           | optimized for the SMT properties of GPUs.
           | 
           | Learning shaders will let you run code on the GPU, to do ML
           | you also need to learn what are tensors, how to compute them
           | on the GPU, and how to build ML systems using them.
           | 
           | I recommend ShaderToy [0] if you want a cool way to
           | understand and play with shaders.
           | 
           | [0]: https://www.shadertoy.com/
        
             | MuffinFlavored wrote:
             | so write a kernel in OpenCL, then call it from Rust
             | 
             | is that what machine learning is doing at a high level?
        
               | fathyb wrote:
               | At a very high level yes. There is also the very
               | important step of efficiently laying out data in the GPU
               | memory to compute tensor values in the kernels.
        
             | Miraste wrote:
             | I'm not familiar with Metal, but on Apple Silicon aren't
             | CPU and GPU memory completely shared?
        
               | fathyb wrote:
               | They do, however it's not fully shared at the process
               | level, the GPGPU API should explicitly support mapping a
               | buffer from the process virtual memory space to the GPU.
               | 
               | I looked it up and turns out OpenCL also supports zero-
               | copy buffers, so I edited my comment accordingly!
        
       | [deleted]
        
       | benlivengood wrote:
       | This also means local fine-tuning is possible. Expect to see an
       | explosion of new things like we did with Stable Diffusion,
       | limited to some extent by the ~0.7 order of magnitude more VRAM
       | required.
        
         | bioemerl wrote:
         | Does it? I would have expected compression losses to make
         | training really hard.
        
           | Miraste wrote:
           | The compression is optional.
        
       | danuker wrote:
       | Any chance these work on CPUs with any acceptable performance?
       | 
       | I have a 10-core 20-thread monster CPU, but didn't bother with a
       | dedicated GPU because I can't control something as simple as its
       | temperature. See the complicated procedure that only works with
       | the large proprietary driver here:
       | 
       | https://wiki.archlinux.org/title/NVIDIA/Tips_and_tricks#Over...
        
         | brigade wrote:
         | Your CPU gets maybe 700-800 gflops depending on your all-core
         | frequency (fp32 because you don't have Sapphire Rapids.) The T4
         | benchmarked would be crunching what it can at ~65 tflops (fp16
         | tensor.) Newer GPUs hit 300 tflops (4090) or even nearly 2
         | petaflops (H100).
         | 
         | To give you an idea of the order of magnitude of compute
         | difference. Sapphire Rapids has AMX and fp16 AVX512 to close
         | the gap a little, but it's still massive.
        
         | NavinF wrote:
         | > 10-core 20-thread monster CPU
         | 
         | With what, 50GB/s memory bandwidth? That's no monster. The two
         | consumer GPUs in my machine both do 1TB/s and are _still_
         | bottlenecked on memory bandwidth.
         | 
         | > only works with the large proprietary driver here
         | 
         | In practice, nothing works without the proprietary driver so
         | this isn't specific to temperature. Also the setting you're
         | looking for is almost certainly `nvidia-smi -pl $watts` for
         | setting the power limit, not whatever that wiki gives you. GPU
         | temperature = ambient temperature + (power limit)*(thermal
         | resistance of cooler)
        
           | TimeBearingDown wrote:
           | That power limit control is explained in detail a few
           | paragraphs down on that wiki page.
           | 
           | https://wiki.archlinux.org/title/NVIDIA/Tips_and_tricks#Cust.
           | ..
        
         | metadat wrote:
         | Unlikely, because this is an efficient GPU work offloader, not
         | a complete replacement for GPU computation.
        
         | bioemerl wrote:
         | Nope. 20 cores in a CPU, 2000 in a GPU, with much much faster
         | memory and an architecture designed to chew through data as
         | fast as possible.
        
           | bee_rider wrote:
           | No real reason to compare a GPU core to a CPU one, but the
           | memory bandwidth difference is pretty concrete!
        
           | fulafel wrote:
           | GPU "cores" are ~ SIMD lanes.
           | 
           | (a difference I think is that there are more virtual lanes,
           | some of may be masked off, that are mapped to the GPU
           | physical SIMD lanes)
        
         | adeon wrote:
         | I don't know about these large models but I saw on a random HN
         | comment earlier in a different topic where someone showed a
         | GPT-J model on CPU only: https://github.com/ggerganov/ggml
         | 
         | I tested it on my Linux and Macbook M1 Air and it generates
         | tokens at a reasonable speed using CPU only. I noticed it
         | doesn't quite use all my available CPU cores so it may be
         | leaving some performance on the table, not sure though.
         | 
         | The GPT-J 6B is nowhere near as large as the OPT-175B in the
         | post. But I got the sense that CPU-only inference may not be
         | totally hopeless even for large models if only we got some high
         | quality software to do it.
        
           | generalizations wrote:
           | There's also the Fabrice Bellard inference code:
           | https://textsynth.com/technology.html. He claims up to 41
           | tokens per second on the GPT-Neox 20B model.
        
       | warning26 wrote:
       | This seems like a great step; I've been able to run
       | StableDiffusion locally, but with an older GPU none of the LLMs
       | will run for me since I don't have enough VRAM.
       | 
       | Oddly I don't see a VRAM requirement listed. Anyone know if it
       | has a lower limit?
        
         | cypress66 wrote:
         | > with an older GPU none of the LLMs will run for me since I
         | don't have enough VRAM.
         | 
         | I think you can run Pygmalion 6B on a 8GB GPU using DeepSpeed.
         | 
         | It's very underwhelming if you expect something like ChatGPT
         | though.
        
       | t3estabc wrote:
       | [dead]
        
       | winddude wrote:
       | looks interesting. FYI, the link to your discord in the readme is
       | broken
        
       | metadat wrote:
       | > Hardware: an NVIIDA T4 (16GB) instance on GCP with 208GB of
       | DRAM and 1.5TB of SSD.
       | 
       | Is FlexGen able to take advantage of multiple hundreds of GB of
       | system memory? Or is do these compute instances just come bundled
       | with it and it's a [largely] irrelevant detail?
        
         | bioemerl wrote:
         | The OPT175b model is massive. A lot of that system ram probably
         | holds model data.
        
           | metadat wrote:
           | Interesting, though apparently the OPT175B model is 350GB:
           | 
           | > You will need at least 350GB GPU memory on your entire
           | cluster to serve the OPT-175B model. For example, you can use
           | 4 x AWS p3.16xlarge instances, which provide 4 (instance) x 8
           | (GPU/instance) x 16 (GB/GPU) = 512GB memory.
           | 
           | https://alpa.ai/tutorials/opt_serving.html
           | 
           | (Scroll down to the second "Note", not far from the top)
           | 
           | I wonder what FlexGen is doing.. a naive guess is a mix of
           | SSD and system memory. Definitely curious about what
           | FlexGen's underlying strategy translates to in terms of
           | actual data paths.
        
             | SekstiNi wrote:
             | > Interesting, though apparently the OPT175B model is
             | 350GB:
             | 
             | Only in FP16. In the paper they use int4 quantization to
             | reduce it to a quarter of that. In addition to the model
             | weights, there's also a KV cache that takes up considerable
             | amounts of memory, and they use int4 on that as well.
             | 
             | > I wonder what FlexGen is doing.. a naive guess is a mix
             | of SSD and system memory.
             | 
             | That's correct, but other approaches have done this as
             | well. What's "new" here seems to be the optimized data
             | access pattern in combination with some other interesting
             | techniques (prefetching, int4 quantization, CPU offload).
        
               | stevenhuang wrote:
               | I want to emphasize how fascinating I find that the
               | transform from 16 bit to a 4 bit quantization results in
               | negligible performance loss. That's huge. Is the original
               | FP16 not compressed?
               | 
               | The allowance for this more granular quantization seems
               | to suggest the "bottleneck" is in some other aspect of
               | the system, and maybe until that is addressed, a higher
               | fidelity quantization does not improve performance.
               | 
               | Or maybe it's the relative values/ratio between weights
               | that is important, and as long as the intended ratio
               | between weights can be expressed, the exact precision of
               | the weights themselves may not be important?
               | 
               | Found an interesting paper on this below. There's
               | doubtless heavy research underway in this area
               | 
               | - https://www.researchgate.net/publication/367557918_Unde
               | rstan...
        
               | inciampati wrote:
               | Very insightful! Now I'm curious what the bottleneck is.
        
               | stevenhuang wrote:
               | A recent discussion I found on int4, definitely looks
               | like this is the new hotness. Very exciting!
               | 
               | https://news.ycombinator.com/item?id=34404859
        
             | [deleted]
        
       | blagie wrote:
       | A lot of people are looking at this wrong. A $350 3060Ti has 12GB
       | RAM. If there's a way to run models locally, it opens up the door
       | to:
       | 
       | 1) Privacy-sensitive applications
       | 
       | 2) Tinkering
       | 
       | 3) Ignoring filters
       | 
       | 4) Prototyping
       | 
       | 5) Eventually, a bit of extra training
       | 
       | The upside isn't so much cost / performance, as local control
       | over a cloud-based solution.
        
         | Aperocky wrote:
         | I have that exact card, this maybe the nudge where I remove
         | windows from the computer and try out linux gaming (and local
         | GPT)
        
           | raihansaputra wrote:
           | Thing is, you don't have to totally switch to Linux. I'm
           | running ML/CUDA workloads through WSL without too many
           | problems.
        
       | railgun2space wrote:
       | We are hiring in that area of work in Europe time zone. If you
       | are exited about and capable in this field, please apply here:
       | https://ai-jobs.net/job/41469-senior-research-engineer-llms-...
        
         | tempaccount420 wrote:
         | If you want talent, don't make them go through the regular
         | application process.
        
       | birdyrooster wrote:
       | I recently bought a T4 to go with my epyc 7402 and 512GB ram for
       | fun and this looks like a great use case. Thanks!
        
         | cypress66 wrote:
         | What's the advantage of purchasing a T4 instead of a 3090 or
         | 4090?
        
           | elorant wrote:
           | Power consumption. A Tesla T4 with 16GB RAM will consume a
           | mere 70W. An RTX 3090 will need at least 300W, and the Titan
           | models go up to 450W.
        
           | nirav72 wrote:
           | Possibly the price. On secondary markets like Ebay - I've
           | occasionally seen T4 cards for $500-600. Also, the form
           | factor. The T4s are comparatively much smaller/shorter than a
           | 3090/4090. So would be a easier fit in a server case.
        
       | ml_basics wrote:
       | Very cool.
       | 
       | Worth mentioning though that the highlighted figures (1.12 tok/s
       | for OPT-175B for "FlexGen with Compression") are for inputs of
       | 512 tokens and outputs of 32 tokens.
       | 
       | Since decoder-only transformer memory requirements scale with the
       | square of sequence lengths, things would probably slow down
       | significantly for very long sequences, which would be required
       | for a back-and-forth conversation.
       | 
       | Still though, until reading this i had no idea that running such
       | a model on-device was remotely feasible!
        
         | baobabKoodaa wrote:
         | > Since decoder-only transformer memory requirements scale with
         | the square of sequence lengths, things would probably slow down
         | significantly for very long sequences, which would be required
         | for a back-and-forth conversation.
         | 
         | You can use tricks to keep the sequence length down even if the
         | conversation goes on for a long time. For example, you can use
         | the model to summarize the first n-1 lines of the conversation
         | and append the last line to the summary as is.
        
           | terabytest wrote:
           | This is very interesting. Could you please elaborate and
           | maybe share links to articles if you know of any?
        
             | baobabKoodaa wrote:
             | I don't have any sources to refer to, but "text
             | summarization" is one of the common NLP tasks that LLMs are
             | often benchmarked on. All of these general-purpose LLMs
             | will be able to do a decent job at text summarization
             | (some, such as ChatGPT, will be able to do zero-shot
             | summarizations at high quality, whereas others need to be
             | fine tuned for the task). If your problem is that you are
             | feeding a large amount of text to the model and that is
             | slow/expensive, then summarization will obviously remediate
             | that issue. After summarizing most of the input text you
             | still need to feed in the latest input without
             | summarization, so for example if the user asks a question,
             | the LLM can then accurately answer that question. (If _all_
             | of the input goes into summarization, that last question
             | may not even appear in the summarization, so results will
             | be crap.)
        
         | fpgaminer wrote:
         | > transformer memory requirements scale with the square of
         | sequence lengths
         | 
         | Not true, see: Flash Attention. You can losslessly calculate
         | the attention in blocks using a little math trick. Essentially
         | each subsequent block "corrects" the denominator of the last
         | block's softmax calculation. At the end you have a perfectly*
         | accurate softmax. Since you don't need to keep the whole
         | sequence in memory to perform the softmax, your memory now
         | scales linearly with respect to sequence length, and due to the
         | lower memory bandwidth requirements and increased kernel fusion
         | the operation also tends to be faster.
         | 
         | * While mathematically the calculation ends up exactly the
         | same, in practice the result ends up slightly different due to
         | the whims of F32 and F16 inaccuracies, and since the "max" used
         | to calculate the softmax in a numerically stable way is
         | calculated on a per-block basis. Doesn't significantly effect
         | training or validation loss though.
        
           | lxe wrote:
           | What's the best way to get started learning this? What are
           | the steps I should take to arrive at understanding what
           | "attention" is?
        
       | synergy20 wrote:
       | Well if a single GPU is not enough, what about using Ray over
       | internet so we can crowd training with multiple GPUs, is this
       | possible?
        
       | gorbypark wrote:
       | If this works well, it will be a game changer. Requiring a fleet
       | of $10k+ GPUs will kill any hope of wide spread adoption of open
       | source "competitors" to GPT-3. Stable Diffusion is so popular
       | because it can run on hardware mere mortals can own.
        
         | narrator wrote:
         | No doubt the corporate large language models will use it to
         | make language models that are 10x bigger. However, at least the
         | public will have access to 175B parameter language models which
         | are much more sophisticated than the 6B or so parameter models
         | consumer video cards can currently run.
        
         | humanistbot wrote:
         | This will only happen if "Open"AI or other big orgs release the
         | model weights, which only Stable Diffusion did. Cost to train
         | is still astronomical.
        
           | leesec wrote:
           | No it isn't. Stable Diffusion is less than 200 grand to
           | train.
        
             | anononaut wrote:
             | I heard it was $4MM alone in AWS compute time.
        
               | postalrat wrote:
               | So about 40k on your own machines.
        
               | inciampati wrote:
               | It does make it seem like a box of H100s will easily be
               | able to make an interesting open LLM.
        
               | speedgoose wrote:
               | This number seems to match the $200k, if you take into
               | account the cloud margins of our favourite counterfeit
               | products reseller.
        
             | huijzer wrote:
             | According to Christopher Potts (Stanford Professor and
             | Chair, Department of Linguistics, and Professor, by
             | courtesy, Department of Computer Science), training a large
             | language model costs about 50 million [1].
             | 
             | [1]: https://youtu.be/-lnHHWRCDGk?t=637
        
           | Dylan16807 wrote:
           | On the other hand, one techie with a few million dollars...
           | 
           | And you could train something like GPT-3 for cheaper than a
           | superbowl commercial. That would get you a lot of publicity.
        
             | idiotsecant wrote:
             | Is there information out there about how much it cost (in
             | time or human-hours) to do the additional training
             | necessary to make chatGPT? I am genuinely curious what the
             | scale of the effort was.
        
               | mensetmanusman wrote:
               | VCs are funding $100,000,000 AI compute efforts now, so
               | it might be something like that.
        
             | paxys wrote:
             | You can do it once, but probably not every day.
        
               | Ajedi32 wrote:
               | Why would you want to retrain it from scratch every day?
               | Stable Diffusion doesn't do that either.
        
               | baobabKoodaa wrote:
               | Well maybe not _every_ day, but having a short feedback
               | loop and the ability to run your code multiple times with
               | different variations is generally considered to be a
               | prerequisite for software development. If you actually
               | want to keep developing the model, you need the funding
               | to be able to train it more than once.
        
               | coldtea wrote:
               | > _but having a short feedback loop and the ability to
               | run your code multiple times with different variations is
               | generally considered to be a prerequisite for software
               | development_
               | 
               | This is not "software development" in general, this is
               | LLM training.
               | 
               | It's not like you're building some regular app, api, or
               | backend.
        
               | paxys wrote:
               | Because things happen every day. If ChatGPT wants to
               | compete with Google, staying up to date with recent
               | events is the minimum bar.
        
               | Ajedi32 wrote:
               | You wouldn't need to re-train from scratch for that, just
               | fine-tune on the new data sources. I don't think constant
               | re-training is the optimal strategy for that use-case
               | anyway. Bing does it by letting the LLM search a more
               | traditional web index to find the information it needs.
        
               | paxys wrote:
               | Okay but someone has to do the fine tuning. The code has
               | to be updated. Parts of the training have to be redone.
               | All of this has costs. It isn't a "do it once and forget
               | about it" task that it is being touted as in this thread.
        
               | coldtea wrote:
               | > _The code has to be updated_
               | 
               | I'm pretty sure this is not how an LLM works.
               | 
               | > _It isn 't a "do it once and forget about it" task that
               | it is being touted as in this thread._
               | 
               | That's neither here, nor there. Training the LLM itself
               | is not a "do it multiple times per day if you want to
               | compete with Google" thing as it has been stated in this
               | subthread.
        
               | simonw wrote:
               | That's not necessary. Look at how a Bing works: it's a
               | LLM which can trigger searches, and then gets fed the
               | search results back to it as part of the prompt.
               | 
               | I wrote about one way to implement that pattern here:
               | https://simonwillison.net/2023/Jan/13/semantic-search-
               | answer...
        
               | coldtea wrote:
               | That's not what the training is about.
               | 
               | Things happen everyday, but languages and words and their
               | associations don't change in any measurable way every
               | day...
               | 
               | This is not like web crawling...
        
             | Nuzzerino wrote:
             | I would hope publicity isn't the motivation for doing it
             | though.
        
               | idiotsecant wrote:
               | What motivation would be sufficiently noble?
        
               | Nuzzerino wrote:
               | Probably one where there isn't an intrinsic conflict of
               | interest with AI risk. Or from a more traditional angle,
               | one where the author's vanity isn't required to be
               | appeased in order for users/customers to be happy. I'm of
               | the opinion that you should do something with game-
               | changing technology because the world needs it, not
               | because you need an ego boost. All technology brings side
               | effects, and there is no greater example of that than
               | "democratized" AI...
        
               | idiotsecant wrote:
               | People often (usually) do objectively useful things
               | because it's in their selfish interests to do so, ego or
               | otherwise. The surest road to failure is expecting people
               | to act virtuously. Generally systems that assume virtue
               | fail, and systems that assume selfish action and steer
               | that selfish action towards the greater good succeed.
               | 
               | In other words, I don't care why people do things, only
               | that they do.
        
               | Nuzzerino wrote:
               | That's fine, as long as publicity isn't the motivation.
               | It's safe to assume that isn't optimal for a projects
               | success (Satoshi understood this). Not sure where you got
               | the idea that the inverse of that was beneficial to such
               | a project. I've seen first hand where it becomes a
               | problem.
               | 
               | I'm not aware of many examples of starry-eyed divas
               | achieving great results. Usually you hear about them but
               | only because they are exceptional cases, not the norm.
               | It's a matter of practicality and not virtue (to say
               | otherwise is purely a straw man argument).
        
               | strohwueste wrote:
               | The summary includes a dangerous thought: For example:
               | why does north Korea develop a nuclear bomb is not
               | important, just that they do. But only they why makes it
               | problematic.
        
               | IncRnd wrote:
               | Noble? You're anthropomorphising machine learning. On
               | possible motiviation would be to train a model, instead
               | of training a model in order to create publicity around a
               | model being trained.
        
               | idiotsecant wrote:
               | I think you're misreading, nobody is anthropomorphizing
               | anything other than the very 'anthro' component of the
               | system we're talking about - the people distributing the
               | funding.
        
               | IncRnd wrote:
               | I may have misread your comment, then. Either way, thank
               | you for the explanation!
        
           | JoshCole wrote:
           | No, it isn't astronomical. It is smaller than that. Still
           | large, but not astronomical.
        
             | ipsum2 wrote:
             | Have you tried training a large model before? If not,
             | you're probably discounting how difficult and expensive it
             | is.
        
               | coldtea wrote:
               | Well, for those that trained the largest one atm, it cost
               | them in the order of 10 million dollars (actually less).
               | 
               | That's how much some tech companies pay for catering.
               | 
               | Hell, that's in the order of a single socialite's wedding
               | costs.
        
           | JoshCole wrote:
           | No, Stable Diffusion isn't the only one to release their
           | weights. OpenAI hasn't been releasing weights for ChatGPT,
           | but Stable Diffusion isn't the only ones releasing weights
           | [1].
           | 
           | [1]: https://huggingface.co/
        
             | permo-w wrote:
             | yeah there an absolute pile of LLMs that are fully open-
             | source. OpenAI's GPT2 for one, but also Bloom, OPT, GPT-J,
             | and I'm sure myriad others too
        
           | naillo wrote:
           | There are some open source LLM models already such as the one
           | this repo is running and mentioning like OPT-175B
        
           | tarr11 wrote:
           | Wonder if someone would be willing to start an open source
           | project where we could crowdsource donations for training,
           | and people could possibly donate their GPU usage for it.
        
             | polishdude20 wrote:
             | There's gotta be something like this already. Like a SETI @
             | Home type of thing .
        
               | mryab wrote:
               | There is! See https://petals.ml/ for inference of models
               | like BLOOM-176B over the internet or
               | https://arxiv.org/abs/2301.11913 and
               | https://arxiv.org/abs/2206.01288 that show you how to do
               | pretraining from scratch in the same setting. Disclaimer:
               | I'm a coauthor of these systems (including the one in OP)
        
               | exebook wrote:
               | Amazing work! If I had a GPU, I'd join. I know similar
               | project for text-to-image:
               | https://aqualxx.github.io/stable-ui/
        
             | scottmf wrote:
             | That's what Stability AI has been doing... There are
             | already open source LLMs the size of GPT-3 such as OPT and
             | Bloom
        
               | ipsum2 wrote:
               | No they haven't. Stability AI is funded by the founder
               | and VC money, not crowdsourcing.
        
           | gaogao wrote:
           | Meta has released the model weights for OPT-175B, which is
           | used in the paper. There's also a lot of full release LLMs
           | from other labs on the way as well.
        
             | neilmovva wrote:
             | While OPT-175B is great to have publicly available, it
             | needs a lot more training to achieve good results. Meta
             | trained OPT on 180B tokens, compared to 300B that GPT-3
             | saw. And the Chinchilla scaling laws suggest that almost 4T
             | tokens would be required to get the most bang for compute
             | buck.
             | 
             | And on top of that, there are some questions on the quality
             | of open source data (The Pile) vs OpenAI's proprietary
             | dataset, which they seem to have spent a lot of effort
             | cleaning. So: open source models are probably data-
             | constrained, in both quantity and quality.
        
               | mensetmanusman wrote:
               | It's fun to think about a few billion weights being the
               | difference between useless and gold.
        
               | tiborsaas wrote:
               | Looking at my bank account I can relate :)
        
             | stavros wrote:
             | Are there any that perform anywhere close to GPT-3?
        
         | anon291 wrote:
         | As a former AI accelerator employee (laid off), I'm kind of
         | happy I was laid off because I realistically don't see a need
         | for specialized hardware anymore.
         | 
         | Large companies can afford Nvidia. Nvidia's software stack is
         | best in class. There's no business need here and the model
         | execution is increasingly becoming possible on single consumer
         | GPUs.
         | 
         | The only place where I see specialized chips excelling is on
         | the edge or if they are truly revolutionary (in which case
         | they're only an acquisition target for Nvidia).
         | 
         | The truth is... The large language models are likely
         | excessively large.
        
           | foobiekr wrote:
           | Power is the main reason to do custom ASICs. I'd be curious
           | as to your opinion of Recogni given they are claiming a 10x
           | power reduction per unit compute.
        
             | anon291 wrote:
             | Unfortunately, I've worked at several players which promise
             | power reductions. It doesn't matter though. People don't
             | care about cost at this point. If you are cost-sensitive
             | you're not doing the kind of revolutionary AI work these
             | companies need to create a competitive moat. And once your
             | model works on NVIDIA and is trained, how much are you
             | going to spend on ML engineers to make it work on something
             | else? Because that cost better be less than the marginal
             | cost reduction on electricity. Plus, NVIDIA et al will
             | likely get more and more efficient.
        
         | permo-w wrote:
         | do/did you seriously think that the processing requirement was
         | going to "kill" that possibility?
         | 
         | the history of computing clearly indicates that either the
         | requirements for running or the bar for owning this technology
         | was _always_ going to drop
        
           | pessimizer wrote:
           | The history of computing had Moore's Law.
        
             | permo-w wrote:
             | exactly
        
         | moffkalast wrote:
         | This may be a flawed approach, but an interesting idea would be
         | to use the current models as a preprocessor to generate a huge
         | "labelled" dataset of inputs and outputs, and then using that
         | more accurate and specific data to train a smaller one that
         | would fit.
         | 
         | It likely wouldn't have nearly as much general knowledge since
         | the data just wouldn't be there but the behaviour could be
         | similar?
        
           | zeknife wrote:
           | Look up knowledge distillation
        
             | moffkalast wrote:
             | Ah TIL, so it's actually a thing and with some cleverer
             | approaches too. Then we do have something to look forward
             | to :D
        
       | stevofolife wrote:
       | Out of curiosity, why aren't we crowd sourcing distributed
       | training of LLMs where anyone can join by bringing their hardware
       | or data? Moreover find a way to incorporate this into a
       | blockchain so there is full transparency but also add in
       | differential privacy to protect every participant.
       | 
       | Am I being too crazy here?
        
         | albertzeyer wrote:
         | There is the Open Assistant project: https://github.com/LAION-
         | AI/Open-Assistant
         | 
         | There is also EleutherAI (https://www.eleuther.ai/about/) with
         | GPT-NeoX (https://github.com/EleutherAI/gpt-neox).
        
         | moffkalast wrote:
         | Just make sure it's written in Rust, uses a Sveltekit frontend
         | and <some other buzzwords I can't remember right now>.
        
         | rnosov wrote:
         | The problem here is that most people just don't have suitable
         | hardware. Ideally, you'd want to load the entire model into a
         | GPU and most consumer grade GPUs just don't have nowhere near
         | enough video memory. You'd need to have something like A100
         | 80GB GPU to be able to run a node in the potential blockchain.
         | You can buy one of these cards for about 15k USD. Admittedly,
         | that's not that too far off from the price of a modern bitcoin
         | ASIC miner but still a healthy chunk of change.
         | 
         | And if you try to split the model across several GPUs then
         | you'll have an issue of bandwidth as model parts would need to
         | talk to each other (on the order of terabyte/second). At the
         | moment, the only realistic way to contribute is just to provide
         | feedback data for the RLHF training.
        
         | nodja wrote:
         | https://petals.ml/
        
           | Miraste wrote:
           | Petals doesn't train new models, it only runs BLOOM in a
           | distributed way.
        
             | nodja wrote:
             | You can finetune with it. If you want a more generic
             | framework you can use hivemind[1] which is what petals
             | uses, but you'll have to create your own community for
             | whatever model you're trying to train.
             | 
             | https://github.com/learning-at-home/hivemind
        
       ___________________________________________________________________
       (page generated 2023-02-20 23:00 UTC)