[HN Gopher] Accelerating Generative AI with PyTorch II: GPT, Fast
___________________________________________________________________
Accelerating Generative AI with PyTorch II: GPT, Fast
Author : polyrand
Score : 159 points
Date : 2023-11-30 18:35 UTC (4 hours ago)
(HTM) web link (pytorch.org)
(TXT) w3m dump (pytorch.org)
| AmazingTurtle wrote:
| 240tok/s is crazy
| chillee wrote:
| Hey, author of the blog post here. It's mentioned in the blog
| post, but one of the intentions of this repo is that it's more of
| a "tutorial" than it is a library/framework. My hope is that
| people will copy-paste and modify it for their own needs :)
|
| Code can also be found here: https://github.com/pytorch-labs/gpt-
| fast
|
| And a twitter thread summary here:
| https://twitter.com/cHHillee/status/1730293330213531844
| buildbot wrote:
| Great work and a really useful resource! Comprehensive guides
| on improving PyTorch performance are pretty hard to come by,
| and I learned a couple new tricks from this!
| ilaksh wrote:
| What GPU was used when testing this?
|
| Is this faster than HuggingFace's Text Generation inference
| container?
| chillee wrote:
| We used an A100-80GB GPU. We didn't compare explicitly to
| Huggingface TGI but I think you should be able to compare the
| tokens/s achieved.
|
| One note is that this release is optimized for _latency_ ,
| while I think HF TGI might be more optimized for
| _throughput_.
| smith7018 wrote:
| Great work! Do you know if it's possible to port this over to
| pytorch's Apple Silicon/MPS support?
| Dowwie wrote:
| What kind of workstation would you build/buy for local GPT
| development with a budget of $3000? Is remote dev a viable
| alternative to local workstations?
| woodson wrote:
| I'd go with a remote dev solution. Training/finetuning of
| large models requires much more resources anyway, so the GPUs
| in the local machine would be unused most of the time.
| leobg wrote:
| Not OP, but I asked myself that same question two years ago.
| Then I looked at the energy prices in Germany and knew I had
| no chance against cloud GPUs. Maybe you live in a country
| with lower energy prices, like Bermuda (or any other country
| on earth), in which case this may not be as important to you.
| A side benefit of going cloud that you can pick and choose
| the right GPU for whatever project you're working on, and
| you're really just paying while you're running them. Also, no
| hardware or Cuda drivers that may divert your attention.
| ftufek wrote:
| Local workstation is much cheaper in the long run.
|
| Even ignoring that, most of the development is running
| experiments. You're gonna be hesitant to run lots of
| experiments if they each cost money whereas when you pay
| upfront for the hardware, you're gonna have the incentive to
| fully utilize it with lots of experiments.
|
| I'd go with rtx 4090 and deal with memory limitation through
| software tricks. It's an underrated card that's as performant
| as cards that are magnitude pricier. It's great way to get
| started with that budget.
| Philpax wrote:
| Depending on what you're doing, 2x used 3090s are the same
| price and offer you more VRAM. That's what I'm planning on
| doing, in any case - being able to run 70B LLMs entirely on
| the GPU is more useful than being able to run 34B faster.
| biddit wrote:
| Agreed. I recently completed a new build with two 3090
| GPUs and really appreciate being able to run 70b models.
| Dowwie wrote:
| which cpu did you go with?
| biddit wrote:
| i7-14700k
|
| z790 chipset w/ mobo that supports x8/x8 bifurcation
|
| 96gb ddr5 @5600mhz
| icelancer wrote:
| Yeah multiple 3090s is the best budget way to go for
| sure. Also older server boards with tons of PCIe lanes if
| you can swing rack mounted hardware and have some
| technical skills.
| biddit wrote:
| I agree with you but right now RTX 4090 cards are pushing
| $2000, which doesn't leave much budget left. I'd suggest
| picking up a used 3090 card from eBay, which are currently
| around $800. This will still give 24gb of VRAM like the
| 4090.
| icelancer wrote:
| Strong endorse here. I pick up used RTX 3090s from
| Facebook Marketplace and eBay at $800 maximum. Can
| usually find them locally for $700-750, and typically can
| test them too, which is fine (though I've had no issues
| yet).
| dharmab wrote:
| I'm using an AMD 6900XT with ROCm and it's fast enough ti be
| usable, for a fraction of the price of a 3090 or 4090.
| icelancer wrote:
| I would do remote dev using vast.ai and other cheap cloud
| computing resources to ensure you want to do this and have
| utility for it, then build your own. 3090s are typically the
| most budget friendly, and if you have any IT chops (and
| tolerance for noise), then server rack-mounted hardware,
| PSUs, and riser cables tend to be the most efficient with
| tons of PCIe lanes (which is a hidden issue people have with
| consumer-grade gaming PCs as they scale).
| modeless wrote:
| I got a 13900k + 4090 workstation for ~$3500. But I hear what
| people are doing is getting 2x (or more) 3090s instead,
| because they are cheap used, and having more VRAM and VRAM
| bandwidth is the important thing at the moment, even if it is
| split between cards.
|
| I'm happy with my 4090 though. Dealing with splitting between
| GPUs sounds like a chore and also I like the gaming abilities
| of the 4090.
| wolftickets wrote:
| Just wanted to share, the charts and gifs are exceptionally
| well done. Informative, concise, and easy to read.
| chillee wrote:
| Thanks! I've also written a couple other things along a
| similar vein you might like at https://horace.io/writing.html
| (particularly https://horace.io/brrr_intro.html) and also
| some of the things I've tweeted:
| https://twitter.com/cHHillee/highlights
| xmichael909 wrote:
| Holy hotdogs, this look amazing. So ahh. I'll jump right to it -
| where can I run this online without having to do a bunch of work
| setting it up? I have several python projects that could take
| advantage of this! (;
| andy99 wrote:
| This is a great article. Regarding
|
| > While these projects are performant, they often come with
| tradeoffs in ease of use, such as requiring model conversion to
| specific formats or building and shipping new dependencies.
|
| I think it should be acknowledged that (at least IMO) pytorch
| model formats are not very portable and this is a big part of the
| problem. It would be nice to see industry move towards a better
| format (gguf?) that can easily be ported between frameworks and
| not leave you stuck using torch to load it. Likewise, pytorch is
| a massive dependency to include with a project, especially for
| simple inference, so while other projects have new dependencies,
| they can often be a lot lighter than for a pytorch model, again
| particularly for inference code.
| chillee wrote:
| Yeah, for sure. I think for deployment purposes, many times
| these model conversions are necessary (such as if you don't
| want to use Python).
|
| However, I do think these model conversions are often a
| significant pain for users.
|
| So, in some sense, the goal here is to show that the
| performance component and the "convert your model for
| deployment" component can be disentangled.
|
| We also have work on allowing you to "export" an AOT-compiled
| version of your model with torch.compile, and that should allow
| you to deploy your models to run in other settings.
| andy99 wrote:
| Thanks for the reply. "show that the performance component
| and the "convert your model for deployment" component can be
| disentangled" makes sense.
|
| Also, I liked the part of the article about torch.compile
| producing faster matrix-vector multiplication than cublas.
| I've seen the same thing on CPU, that it's way faster to just
| write and manually optimize a loop over a bunch of dot
| products than it is to use BLAS routines because of how
| simple the "matmul" actually is. I don't know how widely
| known that is.
| dnnssl2 wrote:
| What are some of the better use cases of fast inference? From my
| experience using ChatGPT, I don't need it to generate faster than
| I can read, but waiting for code generation is painful because
| I'm waiting for the whole code block to format correctly, be
| available to copy or execute (in the case of code interpreter).
| Anything else fall under this pattern?
| wedn3sday wrote:
| One obvious use case is that it makes per-token generation much
| cheaper.
| dnnssl2 wrote:
| That's not so much a use case, but I get what you're saying.
| It's nice that you can find optimizations to shift down the
| pareto frontier of across the cost and latency dimension. The
| hard tradeoffs are for cases like inference batching where
| it's cheaper and higher throughput but slower for the end
| consumer.
|
| What's a good use case for an order of magnitude decrease in
| price per token? Web scale "analysis" or cleaning of
| unstructured data?
| jasonjmcghee wrote:
| Programmatic and multi-step use cases. If you need chain-of-
| thought or similar, tool use, etc. Generating data.
|
| Most use cases outside of classic chat.
|
| For example, I made an on-demand educational video project, and
| the slowest part was by far the content generation. RAG, TTS,
| Image generation, text rendering, and video processing were all
| a drop in the bucket, in comparison.
|
| It would be an even wider gap now, and TTS is super-realtime,
| and image generation can be single step.
| rfw300 wrote:
| The main thing is chat is just one application of LLMs. Other
| applications are much more latency sensitive. Imagine, for
| instance, an LLM-powered realtime grammar checker in an editor.
| ClarityJones wrote:
| Perhaps this is naive, but in my mind it can be useful for
| learning.
|
| - Hook LLM to VMs
|
| - Ask for code that [counts to 10]
|
| - Run code on VM
|
| - Ask different LLM to Evaluate Results.
|
| - Repeat for sufficient volume.
|
| - Train.
|
| The faster it can generate results the faster those results can
| be tested against the real world, e.g. a VM, users on X, other
| models with known accuracies.
| dnnssl2 wrote:
| If you were to serve this on a datacenter server, is the client
| to server roundtrip networking the slowest part of the inference?
| Curious if it would be faster to run this cloud GPUs on better
| hardware but farther compute, or locally with worse hardware.
| chillee wrote:
| Surprisingly, no. And part of this is that text generation is
| _really_ expensive. Unlike traditional ML inference (like with,
| resnets), you don 't just pass your data through your model
| once. You need to pass it over and over again (once for each
| token you generate).
|
| So, in practice, a full "text completion request" can often
| take on the order of seconds, which dwarfs the client <->
| server roundtrip.
| dnnssl2 wrote:
| Is this still the case for sliding window attention/streaming
| LLMs, where you have a fixed length attention window rather
| than infinitely passing in new tokens for quadratic scaling?
| You even get better performance due to purposely downsampling
| non-meaningful attention sink tokens.
| chillee wrote:
| I cover it a bit in the blog post, but unless you have a
| _really_ long context length (like 32k+), your primary
| computational cost doesn 't come from attention but rather
| from loading your weights from VRAM into registers.
|
| I mean, practically speaking, completions from say, ChatGPT
| or Claude take seconds to finish :)
| dnnssl2 wrote:
| How does one select a good candidate for the draft model in
| speculative decoding? I imagine that there's some better
| intuition than just selecting the next parameter count down (i.e
| 70B -> 13B, 13B -> 7B).
|
| Also how does that interact with MoE models? Do you have a mini
| version of the MoE, with smaller experts?
| chillee wrote:
| This is indeed a bit of a dark art. Essentially, you want a
| balance between "is significantly faster than base model" and
| "generates similar stuff to the base model".
|
| Anecdotally, folks often seem to use say, 70B base + 7B as
| verifier. But I think there's a lot of room for experimentation
| and improvement here.
|
| You could... say, take a 70B model and maybe just chop off the
| last 90% of layers and then fine-tune. Or perhaps you could use
| a model that's trained to generate 8 tokens at once. Or perhaps
| you could just use statistical "n-gram" predictor.
| brucethemoose2 wrote:
| This is similar to exllamav2, and exllamav2's quantization is
| also excellent.
| claytonjy wrote:
| One of the notable tricks the various LLM serving frameworks
| provide is a special approaches to batching, e g. continuous,
| persistent, or in-flight batching depending on the inference
| framework. At some level they each allow you to start a new
| generation while in the middle of one or more previous
| generations.
|
| Is that possible with "just" pytorch? Could it be added to gpt-
| fast?
| chillee wrote:
| Yeah it's certainly possible, but it's not the focus of this
| implementation, which is more latency focused (so BS=1).
___________________________________________________________________
(page generated 2023-11-30 23:00 UTC)