[HN Gopher] SVDQuant: 4-Bit Quantization Powers 12B Flux on a 16...
___________________________________________________________________
SVDQuant: 4-Bit Quantization Powers 12B Flux on a 16GB 4090 GPU
with 3x Speedup
Author : lmxyy
Score : 145 points
Date : 2024-11-09 07:46 UTC (15 hours ago)
(HTM) web link (hanlab.mit.edu)
(TXT) w3m dump (hanlab.mit.edu)
| mesmertech wrote:
| Demo on actual 4090 with flux schnell for next few hours:
| https://5jkdpo3rnipsem-3000.proxy.runpod.net/
|
| Its basically H100 speeds with 4090, 4.80it/s. 1.1 sec for flux
| schenll(4 steps) and 5.5 seconds for flux dev(25 steps). Compared
| to normal speeds(comfyui fp8 with "--fast" optimization") which
| is 3 seconds for schnell and 11.5 seconds for dev
| yakorevivan wrote:
| Hey, can you share the inference code please? Thanks..
| superkuh wrote:
| https://github.com/mit-han-lab/nunchaku
| oneshtein wrote:
| Cannot compile it locally on Fedora 40:
| nunchaku/third_party/spdlog/include/spdlog/common.h(144):
| error: namespace "std" has no member "function" using
| err_handler = std::function<void(const std::string
| &err_msg)>; ^
| mesmertech wrote:
| Yea its a pain, I'm trying to make an api endpoint for a
| website I own, and working on a docker image. This is
| what I have for now that "just" works:
|
| the conda always yes thing makes sure that you can just
| paste the script and it all works instead of having to
| press "y" for each install. Also if you don't feel like
| installing a wheel from random person on the internet,
| replace that step with "pip install -e ." as the repo
| suggests. I compiled that one with cuda 12.4 cause that
| was the part takes the most time and is what most often
| seems to be breaking.
|
| Also I'm not sure if this will work on Fedora, I tried
| this on a runpod machine with 4090(apparently it only
| works on few cards, 3090, 4090, a100 etc) with Cuda 12.4
| on host machine and
| "runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-
| ubuntu22.04" this image as base.
|
| EDIT: using pastebin instead as HN doesn't seem to jive
| with code blocks: https://pastebin.com/zK1z0UdM
| oneshtein wrote:
| Almost working: [2024-11-09 19:33:55.214]
| [info] Initializing QuantizedFluxModel [2024-11-09
| 19:33:55.359] [info] Loading weights from
| ~/.cache/huggingface/hub/models--mit-han-lab--svdquant-mo
| dels/snapshots/d2a46e82a378ec70e3329a2219ac4331a444a999/s
| vdq-int4-flux.1-schnell.safetensors [2024-11-09
| 19:34:01.432] [warning] Unable to pin memory: invalid
| argument [2024-11-09 19:34:02.143] [info] Done.
| terminate called after throwing an instance of
| 'CUDAError' what(): CUDA error: pointer does not
| correspond to a registered memory region (at
| /nunchaku/src/Serialization.cpp:32)
| mesmertech wrote:
| prolly make sure your host machine cuda is also 12.4 and
| if not, update the other cuda versions I have on the
| pastebin to the one you have. I don't think it works with
| cuda 11.8 tho, remember trying it once
|
| but yea, can't help you outside of runpod, I haven't even
| tried this on my home PCs yet. for my usecase of
| serverless API, it seems to work
| bufferoverflow wrote:
| Damn, it runs very fast.
| AzN1337c0d3r wrote:
| It's worth noting this is laptop 4090 GPU which is more like in
| the range of desktop 4070 performance.
| mesmertech wrote:
| This specific link I shared is the quant running on a 4090 I
| rented on runpod, I have no affiliation with the repo itself
| qeternity wrote:
| The compute differential between an H100 and a 4090 is not
| huge. The main single GPU benefits are larger memory (and thus
| memory bandwidth) and native fp8. But these matter less for
| diffusion models.
| mesmertech wrote:
| Thats what I thought as well, but FP8 is much faster on h100,
| like 2x-3x. You can check it/s here:
| https://github.com/aredden/flux-fp8-api
|
| Its why fal, replicate, pretty much all big diffusion api
| providers use h100
|
| tldr; 4090 is max 3.51 it/s even with all the current
| optimizations. h100 is 11.5it/s with all optimizations, and
| even without its 6.1 it/s
| notarealllama wrote:
| I'm convinced the path to ubiquity (such as embedded in
| smartphones) is quantization.
|
| I had to int4 a llama model to get it to properly run on my 3060.
|
| I'm curious, how much resolution / significant digits do we
| actually need for most genAI work? If you can draw a circle with
| 3.14, maybe it's good enough for fast and ubiquitous usage.
| sigmoid10 wrote:
| Earlier this year there was a paper from Microsoft where they
| trained a 1.58 bit (every parameter being ternary) LLM that
| matched the performance of 16 bit models. There's also other
| research that you can prune up to 50% of layers with minimal
| loss of performance. Our current training methods are just
| incredibly crude and we will probably look back on those in the
| future and wonder how this ever worked at all.
| llm_trw wrote:
| None of those papers actually use quantized training, they
| are all about quantized inference.
|
| Which is rather unfortunate as it means that the difference
| between what you can train locally and what you can run
| locally is growing ever larger.
| danielEM wrote:
| Indeed. I think "AI gold rush" sucks anyone with any skills
| in this area into it with relatively good pay, so there are
| no, or almost no people outside of big tech and startups to
| counterbalance direction where it moves. And as a side
| note, big tech is and always was putting their agenda first
| in developing any tech or standards and that usually makes
| milking on investments as long as possible, not necessary
| moving things forward.
| llm_trw wrote:
| There's more to it than that.
|
| If you could train models faster, you'd be able to build
| larger, more powerful models that outperform the
| competition.
|
| The fact that Llama 3 is significantly over trained than
| what was considered ideal even three years ago shows
| there's a strong appetite for efficient training. The
| lack of progress isn't due to a lack of effort. No one
| has managed to do this yet because no one has figured out
| how.
|
| I built 1-trit quantized models as a side project nearly
| a decade ago. Back then, no one cared because models
| weren't yet using all available memory, and on devices
| where memory was fully utilized, compute power was the
| limiting factor. I spend much longer trying to figure out
| how to get 1-trit training to work and I never could. Of
| all the papers and people in the field I've talked to, no
| one else has either.
| sixfiveotwo wrote:
| > I spend much longer trying to figure out how to get
| 1-trit training to work and I never could.
|
| What did you try? What were the research directions at
| the time?
| llm_trw wrote:
| This is a big question that needs a research paper worth
| of explanation. Feel free to email me if you care enough
| to have a more in-depth discussion.
| sixfiveotwo wrote:
| Sorry, I understand it was a bit intrusively direct. To
| bring some context, I toyed a little with neural networks
| a few years ago and wondered myself about this topic of
| training a so called quantized network (I wanted to write
| a small multilayer perceptron based library parameterized
| by the coefficient type - floating point or integer of
| different precision), but didn't implement it. Since you
| mentioned your own work in that area, it picked my
| interest, but I don't want to waste your time
| unnecessarily.
| p1esk wrote:
| People did care back then. This paper had jumpstarted the
| whole model compression field (which used to be a hot
| area of research in early 90s):
| https://arxiv.org/abs/1511.00363
|
| Before that, in 2012, Alexnet had to be partially split
| into two submodels, running on two GPUs (using a form of
| interlayer grouped convolutions) because it could not fit
| in 3GB of a single 580 card.
|
| Ternary networks appeared in 2016. Unless you mean you
| actually tried to train in ternary precision - clearly
| not possible with any gradient based optimization
| methods.
| sigmoid10 wrote:
| That's wrong. I don't know where you got that information
| from, because it is literally the opposite of what is shown
| in the Microsoft paper mentioned above. They explicitly
| introduced this extreme quantization during training from
| scratch and show how it can be made stable.
| llm_trw wrote:
| I got it from section 2.2
|
| > The number of model parameters is slightly higher in
| the BitLinear setting, as we both have 1.58-bit weights
| as well as the 16-bit shadow weights. However, this fact
| does not change the number of trainable/optimized
| parameters in practice.
|
| https://arxiv.org/html/2407.09527v1
| buildbot wrote:
| Exactly as xnornet was doing way back in 2016 - shadow
| 32bit weights, quantized to 1 bit during the forward
| pass.
|
| https://arxiv.org/abs/1603.05279
|
| I personally have a pretty negative opinion of the bitnet
| paper.
| xrd wrote:
| Can someone explain this sentence from the article:
| Diffusion models, however, are computationally bound, even for
| single batches, so quantizing weights alone yields limited gains.
| flutetornado wrote:
| GPU workloads are either compute bound (floating point
| operations) or memory bound (bytes being transferred across
| memory hierarchy.)
|
| Quantizing in general helps with the memory bottleneck but does
| not help in reducing computational costs, so it's not as useful
| for improving performance of diffusion models, that's what it's
| saying.
| pkAbstract wrote:
| Exactly. The smaller bit widths from quantization might
| marginally decrease the compute required for each operation,
| but they do not reduce the overall volume of operations. So,
| the effect of quantization is generally more impactful on
| memory use than compute.
| superkuh wrote:
| Except in this case they quantized both the parameters and
| the activations leading to decreased compute time too.
| llm_trw wrote:
| Diffusion requires a lot more computation to get results
| compared to transformers. Naively when I'm running a
| transformer locally I get about 30% GPU utilization, when I'm
| running a diffusion model I'm getting 100%.
|
| This means that the only saving you're getting in speed for a
| diffusion model is being able to do more effective flops since
| the floats are smaller, e.g. instead of doing one 32bit
| multiplication, you're doing 8 4bit ones.
|
| By comparison for transformers you not only gain the flop
| increase, but also the improvement in memory shuffling that
| they do, e.g. it also takes you 8 times less time to load the
| memory into working memory from vram.
|
| The above is a vast over simplification and in practice will
| have more asterisks than you can shake a stick at.
| DeathArrow wrote:
| But doesn't quantization give worse results? Don't you trade
| quality for memory footprint?
| timnetworks wrote:
| They're saying this method essential does not, even when mixed
| with low rank models on top. "Notably, while the original BF16
| model requires per-layer CPU offloading on the 16GB laptop
| 4090, our INT4 model fits entirely in GPU memory, resulting in
| a 10.1x speedup by avoiding offloading."
|
| This is the whole magic, the rest of the workflow doesn't need
| to unload and flush memory, causing big delays for jobs.
| scottmas wrote:
| Possible to run this in ComfyUI?
| vergessenmir wrote:
| The repo has sample code and it is fairly easy to create a node
| that will do it.
|
| You won't however have access to usual sampler, latent image,
| Lora nodes to do anything beyond basic t2i
| doctorpangloss wrote:
| Why? There is nothing to customize with Flux.
| djoldman wrote:
| This is one in a long line of posts saying "we took a model and
| made it smaller" and now it can run with different requirements.
|
| It is important to keep in mind that modifying a model changes
| the performance of the resulting model, where performance is
| "correctness" or "quality" of output.
|
| Just because the base model is very performant does not mean the
| smaller model is.
|
| This means that another model that is the same size as the new
| quantized model may outperform the quantized model.
|
| Suppose there are equal sized big models A and B with their
| smaller quantized variants a and b. A being a more performant
| model than B does not guarantee a being more performant than b.
| superkuh wrote:
| Not really. They quantized the activations here with their
| inference program which decreased compute as well as RAM usage
| (and required bandwidth). That's a big step.
| ttul wrote:
| While I think I agree that there are many posts here on
| HackerNews announcing a new model compression technique, your
| characterization above understates the technical innovations
| and practical impacts described in this MIT paper.
|
| Unlike traditional model compression work that simply applies
| existing techniques, SVDQuant synthesizes several ideas in a
| comprehensive new approach to model quantization:
|
| - Developing a novel outlier absorption mechanism using low-
| rank decomposition -- this aspect alone seems quite novel,
| although the math is admittedly way beyond my level
|
| - Combining SVD with smoothing in a way that specifically
| addresses the unique challenges of diffusion models
|
| - Creating an innovative kernel fusion technique (they call it
| "Nunchaku") that makes the theoretical benefits practically
| realizable, because without this, the extra computation
| required to implement the above steps would simply slow the
| model back down to baseline
|
| This isn't just incremental improvement - the paper achieves
| several breakthrough results:
|
| - First successful 4-bit quantization of both weights AND
| activations for diffusion models
|
| - 3.5x memory reduction for 12B parameter models while
| maintaining image quality
|
| - 3.0x speedup over existing 4-bit weight-only quantization
| approaches
|
| - Enables running 12B parameter models on consumer GPUs that
| previously couldn't handle them
|
| And, I'll add, as someone who has been following the diffusion
| space quite actively for the last two years, the amount of
| creativity that can be unleashed when models are accessible to
| people with consumer GPUs is nothing short of astonishing.
|
| The authors took pains to validate their approach by testing it
| against three models (Flux, PixArt-Sigma, and SDXL) and along
| several quality-comparison axes (FID score, Image Reward,
| LPIPS, and PSNR). They also did a proper ablation study to see
| the contribution of each component in their approach to image
| quality.
|
| What particularly excites me about this paper is not the
| ability to run a model that eats 22GB of VRAM in just 7GB. The
| exciting thing is the prospect of running a 60GB model in 20GB
| of VRAM. I'm not sure whether anyone has or is planning to
| train such a monster, but I suspect that Midjourney, OpenAI,
| and Google all have significantly larger models running in
| their infrastructure than what can be run on consumer hardware.
| The more dimensions you can throw at image and video
| generation, the better things get.
| djoldman wrote:
| I definitely agree that there may be some interesting
| advancements here.
|
| I am trying to call attention to the models used for
| evaluation comparison. There are 3 factors: inference
| speed/latency, model size in total loaded VRAM, and model
| performance in terms of output.
|
| Comparisons should address all of these considerations,
| otherwise it's easy to hide deficiencies.
| Jackson__ wrote:
| The site literally has a quick visual comparison near the
| top, which shows that theirs is the closest to 16bit
| performance compared to the others. I don't get what more
| you'd want.
|
| https://cdn.prod.website-
| files.com/64f4e81394e25710d22d042e/...
| djoldman wrote:
| These are comparisons to other quantizing methods. That
| is fine.
|
| What I want to see is comparisons to NON-quantized models
| all with around the same VRAM along with associated
| inference latencies.
|
| Also, we would want to see the same quantizing schemes
| applied to other base models.. because perhaps the
| paper's proposed quantizing scheme only beats others
| using a particular base model.
| snovv_crash wrote:
| They tested the quantisation on 3 different models.
|
| They also show it has little to no effect relative to
| fp16 on these models.
|
| IMO that's enough. Comparison against smaller models is
| much less useful because you can't use the same random
| seeds. So you end up with a very objective "this is
| worse" based purely on aesthetic preferences of one
| person vs another. You already see this with Flux Schnell
| vs. the larger Flux models.
| refulgentis wrote:
| I'm really confused, this looks like concern trolling
| because there's a live demo for exactly this A/B testing,
| that IIRC was near the top of the article, close enough
| it was the first link I clicked.
|
| But you're quite persistent in that they need to address
| this, so it seems much more likely they silently added it
| after your original post, or you didn't click through,
| concern trolling would stay more vague
| aaronblohowiak wrote:
| >What I want to see is comparisons to NON-quantized
| models
|
| isnt that the first image in the diagram / the 22GB model
| that took 111 seconds?
| boulos wrote:
| As others have replied, this is reasonable general feedback,
| but in this specific case the work was done carefully. Table 1
| from the linked paper (https://arxiv.org/pdf/2411.05007)
| includes a variety of metrics, while an entire appendix is
| dedicated to quality comparisons.
|
| By showing their work side-by-side with other quantization
| schemes, you can also see a great example of the flavor of
| different results you can get with these slight tweaks (e.g.,
| ViDiT INT8) _and_ that their quantization does a much better
| job in _reproducing_ the "original" (Figure 15).
|
| In this application, it's not strictly true that you _care_ to
| have the same results, but this work does a pretty good job of
| it.
| djoldman wrote:
| Agreed.
|
| Once a model has been trained, I believe the main metrics
| people care about are
|
| 1. inference speed
|
| 2. memory requirements
|
| 3. quality of output.
|
| There are usually tradeoffs here. Generally you get a lower
| memory requirement (a good thing), sometimes faster inference
| (a good thing), but usually a lower quality of output.
|
| I don't think reproduction of original output is the typical
| goal.
| tbalsam wrote:
| Did you...did you read the technical details? This is almost
| all they talk about, this method was created to get around.
|
| Take a look, it's good stuff! Basically a LoRA to reconstruct
| outliers lost by quantization, helping keep the performance of
| the original model.
| atlex2 wrote:
| Seriously nobody thought to use SVD on these weight matrices
| before?
| liuliu wrote:
| I did try, but in a wrong way (try to SVD quantization error to
| recover quality (I.e. SVD(W - Q(W)))). The lightbulb moment in
| this paper is to do SVD on W and then quantize the remaining.
___________________________________________________________________
(page generated 2024-11-09 23:00 UTC)