[HN Gopher] vLLM: 24x faster LLM serving than HuggingFace Transf...
___________________________________________________________________
vLLM: 24x faster LLM serving than HuggingFace Transformers
Author : wskwon
Score : 132 points
Date : 2023-06-20 19:17 UTC (3 hours ago)
(HTM) web link (vllm.ai)
(TXT) w3m dump (vllm.ai)
| two_in_one wrote:
| I wonder if this sort of memory management can be made for
| Pytorch transformers as under the hood optimization.
| thewataccount wrote:
| This is really cool to see.
|
| > Large: Takes up to 1.7GB for a single sequence in LLaMA-13B.
|
| > Dynamic: Its size depends on the sequence length, which is
| highly variable and unpredictable. As a result, efficiently
| managing the KV cache presents a significant challenge. We find
| that existing systems waste 60% - 80% of memory due to
| fragmentation and over-reservation.
|
| This mentions improvements for throughput which is great, and it
| mentions memory savings. I'm a bit confused how 80% of the memory
| could be wasted by the KV cache when the vast majority of the
| memory is usually holding the model itself?
|
| How much memory savings does this translate to effectively for
| say a 30B 4bit model?
| zhisbug wrote:
| This really depends on what GPUs you use. If you GPUs has very
| small amount of memory, vLLM will help more.
|
| vLLM addresses the memory bottleneck for saving KV caches and
| hence increases the throughput.
| Solvency wrote:
| Semi-related question: this page is full of little charts and
| diagrams. There are thousands of similar
| projects/sites/experiment sites with their own charts and
| diagrams. But it seems like there are always subtle-to-large
| differences in them that indicate they're made with totally
| different libraries.
|
| Are there just thousands of homebrewn non-standard chart &
| diagram builders out there? How does one even begin to pick a
| standard to whip out quickies like these? Google SEO makes it
| virtually impossible to get to substance.
| daedbe wrote:
| I often see charts produced using matplotlib or plotly - often
| you can tell based on the colour schemes used. For example, the
| bar chart at the bottom of this paper looks like it was made
| with plotly. I think the reason for such variance in the style
| of charts is largely due to the flexibility frameworks such as
| matplotlib provide: you can control basically every aspect of a
| chart and use any number of predefined or custom stylesheets to
| change the look and feel.
| wskwon wrote:
| We used matplotlib for the performance charts, and used a free
| website to convert google slides to the animation gifs.
| kristjansson wrote:
| The color scheme on these implies Google Drawing, but I don't
| know how they made them into animations - maybe just manually?
| mattnewton wrote:
| Google slides I think.
| marcopicentini wrote:
| Is it available an hosted demo?
|
| What are use cases for which open source models are equivalent of
| GPT 3.5?
| wskwon wrote:
| You can think of LMSYS Vicuna: https://chat.lmsys.org as our
| hosted demo, as it actually uses vLLM as the backend.
| bioemerl wrote:
| I'm spoiled by 4 bit and unfortunately it doesn't appear to be
| supposed here so this isn't of much use to me, but it's awesome
| to see people working on the inference speed side of things
| regardless.
| george_123 wrote:
| this approach to managing KV cache can work with 4bit. imagine
| the speedup of pagedattention with quantization..
| zhisbug wrote:
| yep, it is agonistic to 4-bit. You can deploy a 4-bit model
| and still use vllm + pagedattention to double or even triple
| your serving throughput.
| ynniv wrote:
| If this were submitted as a new comment it would be at the
| top of the page.
| brucethemoose2 wrote:
| Reading between the lines, it sounds like some of the speedup
| comes from VRAM savings on an otherwise close to full GPU?
|
| This is definitely cool and needed, but it might not be so
| dramatic running 3-5 but quant on a less full GPU.
| scv119 wrote:
| Pretty cool stuff and the result are amazing. Hoping we will see
| virtual memory get standardized in pytorch or cuda.
| gwph wrote:
| Ion Stoica's lab continues to be a powerhouse of innovation.
| Previous successes of Stoica and his students include (but are
| certainly not limited to) Apache Spark, Ray, Apache Mesos and
| Alluxio.
| kossTKR wrote:
| Does this mean that GPT-4/65b level performance is closer to
| running on a say a m1/m2 with only 24+ gigabytes of ram?
| wskwon wrote:
| Not really. vLLM optimizes the throughput of your LLM, but does
| not reduce the minimum required amount of resource to run your
| model.
| jokoon wrote:
| Now do the same for image classifiers. I tried a few of them,
| they're just horribly slow.
|
| This is pretty outrageous considering the first robust image
| image classifiers appeared around 2007.
| wskwon wrote:
| vLLM has been adopted by LMSYS for serving Vicuna and Chatbot
| Arena.
___________________________________________________________________
(page generated 2023-06-20 23:00 UTC)