[HN Gopher] SVDQuant: 4-Bit Quantization Powers 12B Flux on a 16...
       ___________________________________________________________________
        
       SVDQuant: 4-Bit Quantization Powers 12B Flux on a 16GB 4090 GPU
       with 3x Speedup
        
       Author : lmxyy
       Score  : 145 points
       Date   : 2024-11-09 07:46 UTC (15 hours ago)
        
 (HTM) web link (hanlab.mit.edu)
 (TXT) w3m dump (hanlab.mit.edu)
        
       | mesmertech wrote:
       | Demo on actual 4090 with flux schnell for next few hours:
       | https://5jkdpo3rnipsem-3000.proxy.runpod.net/
       | 
       | Its basically H100 speeds with 4090, 4.80it/s. 1.1 sec for flux
       | schenll(4 steps) and 5.5 seconds for flux dev(25 steps). Compared
       | to normal speeds(comfyui fp8 with "--fast" optimization") which
       | is 3 seconds for schnell and 11.5 seconds for dev
        
         | yakorevivan wrote:
         | Hey, can you share the inference code please? Thanks..
        
           | superkuh wrote:
           | https://github.com/mit-han-lab/nunchaku
        
             | oneshtein wrote:
             | Cannot compile it locally on Fedora 40:
             | nunchaku/third_party/spdlog/include/spdlog/common.h(144):
             | error: namespace "std" has no member "function"       using
             | err_handler = std::function<void(const std::string
             | &err_msg)>;                                        ^
        
               | mesmertech wrote:
               | Yea its a pain, I'm trying to make an api endpoint for a
               | website I own, and working on a docker image. This is
               | what I have for now that "just" works:
               | 
               | the conda always yes thing makes sure that you can just
               | paste the script and it all works instead of having to
               | press "y" for each install. Also if you don't feel like
               | installing a wheel from random person on the internet,
               | replace that step with "pip install -e ." as the repo
               | suggests. I compiled that one with cuda 12.4 cause that
               | was the part takes the most time and is what most often
               | seems to be breaking.
               | 
               | Also I'm not sure if this will work on Fedora, I tried
               | this on a runpod machine with 4090(apparently it only
               | works on few cards, 3090, 4090, a100 etc) with Cuda 12.4
               | on host machine and
               | "runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-
               | ubuntu22.04" this image as base.
               | 
               | EDIT: using pastebin instead as HN doesn't seem to jive
               | with code blocks: https://pastebin.com/zK1z0UdM
        
               | oneshtein wrote:
               | Almost working:                 [2024-11-09 19:33:55.214]
               | [info] Initializing QuantizedFluxModel       [2024-11-09
               | 19:33:55.359] [info] Loading weights from
               | ~/.cache/huggingface/hub/models--mit-han-lab--svdquant-mo
               | dels/snapshots/d2a46e82a378ec70e3329a2219ac4331a444a999/s
               | vdq-int4-flux.1-schnell.safetensors       [2024-11-09
               | 19:34:01.432] [warning] Unable to pin memory: invalid
               | argument       [2024-11-09 19:34:02.143] [info] Done.
               | terminate called after throwing an instance of
               | 'CUDAError'         what():  CUDA error: pointer does not
               | correspond to a registered memory region (at
               | /nunchaku/src/Serialization.cpp:32)
        
               | mesmertech wrote:
               | prolly make sure your host machine cuda is also 12.4 and
               | if not, update the other cuda versions I have on the
               | pastebin to the one you have. I don't think it works with
               | cuda 11.8 tho, remember trying it once
               | 
               | but yea, can't help you outside of runpod, I haven't even
               | tried this on my home PCs yet. for my usecase of
               | serverless API, it seems to work
        
         | bufferoverflow wrote:
         | Damn, it runs very fast.
        
         | AzN1337c0d3r wrote:
         | It's worth noting this is laptop 4090 GPU which is more like in
         | the range of desktop 4070 performance.
        
           | mesmertech wrote:
           | This specific link I shared is the quant running on a 4090 I
           | rented on runpod, I have no affiliation with the repo itself
        
         | qeternity wrote:
         | The compute differential between an H100 and a 4090 is not
         | huge. The main single GPU benefits are larger memory (and thus
         | memory bandwidth) and native fp8. But these matter less for
         | diffusion models.
        
           | mesmertech wrote:
           | Thats what I thought as well, but FP8 is much faster on h100,
           | like 2x-3x. You can check it/s here:
           | https://github.com/aredden/flux-fp8-api
           | 
           | Its why fal, replicate, pretty much all big diffusion api
           | providers use h100
           | 
           | tldr; 4090 is max 3.51 it/s even with all the current
           | optimizations. h100 is 11.5it/s with all optimizations, and
           | even without its 6.1 it/s
        
       | notarealllama wrote:
       | I'm convinced the path to ubiquity (such as embedded in
       | smartphones) is quantization.
       | 
       | I had to int4 a llama model to get it to properly run on my 3060.
       | 
       | I'm curious, how much resolution / significant digits do we
       | actually need for most genAI work? If you can draw a circle with
       | 3.14, maybe it's good enough for fast and ubiquitous usage.
        
         | sigmoid10 wrote:
         | Earlier this year there was a paper from Microsoft where they
         | trained a 1.58 bit (every parameter being ternary) LLM that
         | matched the performance of 16 bit models. There's also other
         | research that you can prune up to 50% of layers with minimal
         | loss of performance. Our current training methods are just
         | incredibly crude and we will probably look back on those in the
         | future and wonder how this ever worked at all.
        
           | llm_trw wrote:
           | None of those papers actually use quantized training, they
           | are all about quantized inference.
           | 
           | Which is rather unfortunate as it means that the difference
           | between what you can train locally and what you can run
           | locally is growing ever larger.
        
             | danielEM wrote:
             | Indeed. I think "AI gold rush" sucks anyone with any skills
             | in this area into it with relatively good pay, so there are
             | no, or almost no people outside of big tech and startups to
             | counterbalance direction where it moves. And as a side
             | note, big tech is and always was putting their agenda first
             | in developing any tech or standards and that usually makes
             | milking on investments as long as possible, not necessary
             | moving things forward.
        
               | llm_trw wrote:
               | There's more to it than that.
               | 
               | If you could train models faster, you'd be able to build
               | larger, more powerful models that outperform the
               | competition.
               | 
               | The fact that Llama 3 is significantly over trained than
               | what was considered ideal even three years ago shows
               | there's a strong appetite for efficient training. The
               | lack of progress isn't due to a lack of effort. No one
               | has managed to do this yet because no one has figured out
               | how.
               | 
               | I built 1-trit quantized models as a side project nearly
               | a decade ago. Back then, no one cared because models
               | weren't yet using all available memory, and on devices
               | where memory was fully utilized, compute power was the
               | limiting factor. I spend much longer trying to figure out
               | how to get 1-trit training to work and I never could. Of
               | all the papers and people in the field I've talked to, no
               | one else has either.
        
               | sixfiveotwo wrote:
               | > I spend much longer trying to figure out how to get
               | 1-trit training to work and I never could.
               | 
               | What did you try? What were the research directions at
               | the time?
        
               | llm_trw wrote:
               | This is a big question that needs a research paper worth
               | of explanation. Feel free to email me if you care enough
               | to have a more in-depth discussion.
        
               | sixfiveotwo wrote:
               | Sorry, I understand it was a bit intrusively direct. To
               | bring some context, I toyed a little with neural networks
               | a few years ago and wondered myself about this topic of
               | training a so called quantized network (I wanted to write
               | a small multilayer perceptron based library parameterized
               | by the coefficient type - floating point or integer of
               | different precision), but didn't implement it. Since you
               | mentioned your own work in that area, it picked my
               | interest, but I don't want to waste your time
               | unnecessarily.
        
               | p1esk wrote:
               | People did care back then. This paper had jumpstarted the
               | whole model compression field (which used to be a hot
               | area of research in early 90s):
               | https://arxiv.org/abs/1511.00363
               | 
               | Before that, in 2012, Alexnet had to be partially split
               | into two submodels, running on two GPUs (using a form of
               | interlayer grouped convolutions) because it could not fit
               | in 3GB of a single 580 card.
               | 
               | Ternary networks appeared in 2016. Unless you mean you
               | actually tried to train in ternary precision - clearly
               | not possible with any gradient based optimization
               | methods.
        
             | sigmoid10 wrote:
             | That's wrong. I don't know where you got that information
             | from, because it is literally the opposite of what is shown
             | in the Microsoft paper mentioned above. They explicitly
             | introduced this extreme quantization during training from
             | scratch and show how it can be made stable.
        
               | llm_trw wrote:
               | I got it from section 2.2
               | 
               | > The number of model parameters is slightly higher in
               | the BitLinear setting, as we both have 1.58-bit weights
               | as well as the 16-bit shadow weights. However, this fact
               | does not change the number of trainable/optimized
               | parameters in practice.
               | 
               | https://arxiv.org/html/2407.09527v1
        
               | buildbot wrote:
               | Exactly as xnornet was doing way back in 2016 - shadow
               | 32bit weights, quantized to 1 bit during the forward
               | pass.
               | 
               | https://arxiv.org/abs/1603.05279
               | 
               | I personally have a pretty negative opinion of the bitnet
               | paper.
        
       | xrd wrote:
       | Can someone explain this sentence from the article:
       | Diffusion models, however, are computationally bound, even for
       | single batches, so quantizing weights alone yields limited gains.
        
         | flutetornado wrote:
         | GPU workloads are either compute bound (floating point
         | operations) or memory bound (bytes being transferred across
         | memory hierarchy.)
         | 
         | Quantizing in general helps with the memory bottleneck but does
         | not help in reducing computational costs, so it's not as useful
         | for improving performance of diffusion models, that's what it's
         | saying.
        
           | pkAbstract wrote:
           | Exactly. The smaller bit widths from quantization might
           | marginally decrease the compute required for each operation,
           | but they do not reduce the overall volume of operations. So,
           | the effect of quantization is generally more impactful on
           | memory use than compute.
        
             | superkuh wrote:
             | Except in this case they quantized both the parameters and
             | the activations leading to decreased compute time too.
        
         | llm_trw wrote:
         | Diffusion requires a lot more computation to get results
         | compared to transformers. Naively when I'm running a
         | transformer locally I get about 30% GPU utilization, when I'm
         | running a diffusion model I'm getting 100%.
         | 
         | This means that the only saving you're getting in speed for a
         | diffusion model is being able to do more effective flops since
         | the floats are smaller, e.g. instead of doing one 32bit
         | multiplication, you're doing 8 4bit ones.
         | 
         | By comparison for transformers you not only gain the flop
         | increase, but also the improvement in memory shuffling that
         | they do, e.g. it also takes you 8 times less time to load the
         | memory into working memory from vram.
         | 
         | The above is a vast over simplification and in practice will
         | have more asterisks than you can shake a stick at.
        
       | DeathArrow wrote:
       | But doesn't quantization give worse results? Don't you trade
       | quality for memory footprint?
        
         | timnetworks wrote:
         | They're saying this method essential does not, even when mixed
         | with low rank models on top. "Notably, while the original BF16
         | model requires per-layer CPU offloading on the 16GB laptop
         | 4090, our INT4 model fits entirely in GPU memory, resulting in
         | a 10.1x speedup by avoiding offloading."
         | 
         | This is the whole magic, the rest of the workflow doesn't need
         | to unload and flush memory, causing big delays for jobs.
        
       | scottmas wrote:
       | Possible to run this in ComfyUI?
        
         | vergessenmir wrote:
         | The repo has sample code and it is fairly easy to create a node
         | that will do it.
         | 
         | You won't however have access to usual sampler, latent image,
         | Lora nodes to do anything beyond basic t2i
        
         | doctorpangloss wrote:
         | Why? There is nothing to customize with Flux.
        
       | djoldman wrote:
       | This is one in a long line of posts saying "we took a model and
       | made it smaller" and now it can run with different requirements.
       | 
       | It is important to keep in mind that modifying a model changes
       | the performance of the resulting model, where performance is
       | "correctness" or "quality" of output.
       | 
       | Just because the base model is very performant does not mean the
       | smaller model is.
       | 
       | This means that another model that is the same size as the new
       | quantized model may outperform the quantized model.
       | 
       | Suppose there are equal sized big models A and B with their
       | smaller quantized variants a and b. A being a more performant
       | model than B does not guarantee a being more performant than b.
        
         | superkuh wrote:
         | Not really. They quantized the activations here with their
         | inference program which decreased compute as well as RAM usage
         | (and required bandwidth). That's a big step.
        
         | ttul wrote:
         | While I think I agree that there are many posts here on
         | HackerNews announcing a new model compression technique, your
         | characterization above understates the technical innovations
         | and practical impacts described in this MIT paper.
         | 
         | Unlike traditional model compression work that simply applies
         | existing techniques, SVDQuant synthesizes several ideas in a
         | comprehensive new approach to model quantization:
         | 
         | - Developing a novel outlier absorption mechanism using low-
         | rank decomposition -- this aspect alone seems quite novel,
         | although the math is admittedly way beyond my level
         | 
         | - Combining SVD with smoothing in a way that specifically
         | addresses the unique challenges of diffusion models
         | 
         | - Creating an innovative kernel fusion technique (they call it
         | "Nunchaku") that makes the theoretical benefits practically
         | realizable, because without this, the extra computation
         | required to implement the above steps would simply slow the
         | model back down to baseline
         | 
         | This isn't just incremental improvement - the paper achieves
         | several breakthrough results:
         | 
         | - First successful 4-bit quantization of both weights AND
         | activations for diffusion models
         | 
         | - 3.5x memory reduction for 12B parameter models while
         | maintaining image quality
         | 
         | - 3.0x speedup over existing 4-bit weight-only quantization
         | approaches
         | 
         | - Enables running 12B parameter models on consumer GPUs that
         | previously couldn't handle them
         | 
         | And, I'll add, as someone who has been following the diffusion
         | space quite actively for the last two years, the amount of
         | creativity that can be unleashed when models are accessible to
         | people with consumer GPUs is nothing short of astonishing.
         | 
         | The authors took pains to validate their approach by testing it
         | against three models (Flux, PixArt-Sigma, and SDXL) and along
         | several quality-comparison axes (FID score, Image Reward,
         | LPIPS, and PSNR). They also did a proper ablation study to see
         | the contribution of each component in their approach to image
         | quality.
         | 
         | What particularly excites me about this paper is not the
         | ability to run a model that eats 22GB of VRAM in just 7GB. The
         | exciting thing is the prospect of running a 60GB model in 20GB
         | of VRAM. I'm not sure whether anyone has or is planning to
         | train such a monster, but I suspect that Midjourney, OpenAI,
         | and Google all have significantly larger models running in
         | their infrastructure than what can be run on consumer hardware.
         | The more dimensions you can throw at image and video
         | generation, the better things get.
        
           | djoldman wrote:
           | I definitely agree that there may be some interesting
           | advancements here.
           | 
           | I am trying to call attention to the models used for
           | evaluation comparison. There are 3 factors: inference
           | speed/latency, model size in total loaded VRAM, and model
           | performance in terms of output.
           | 
           | Comparisons should address all of these considerations,
           | otherwise it's easy to hide deficiencies.
        
             | Jackson__ wrote:
             | The site literally has a quick visual comparison near the
             | top, which shows that theirs is the closest to 16bit
             | performance compared to the others. I don't get what more
             | you'd want.
             | 
             | https://cdn.prod.website-
             | files.com/64f4e81394e25710d22d042e/...
        
               | djoldman wrote:
               | These are comparisons to other quantizing methods. That
               | is fine.
               | 
               | What I want to see is comparisons to NON-quantized models
               | all with around the same VRAM along with associated
               | inference latencies.
               | 
               | Also, we would want to see the same quantizing schemes
               | applied to other base models.. because perhaps the
               | paper's proposed quantizing scheme only beats others
               | using a particular base model.
        
               | snovv_crash wrote:
               | They tested the quantisation on 3 different models.
               | 
               | They also show it has little to no effect relative to
               | fp16 on these models.
               | 
               | IMO that's enough. Comparison against smaller models is
               | much less useful because you can't use the same random
               | seeds. So you end up with a very objective "this is
               | worse" based purely on aesthetic preferences of one
               | person vs another. You already see this with Flux Schnell
               | vs. the larger Flux models.
        
               | refulgentis wrote:
               | I'm really confused, this looks like concern trolling
               | because there's a live demo for exactly this A/B testing,
               | that IIRC was near the top of the article, close enough
               | it was the first link I clicked.
               | 
               | But you're quite persistent in that they need to address
               | this, so it seems much more likely they silently added it
               | after your original post, or you didn't click through,
               | concern trolling would stay more vague
        
               | aaronblohowiak wrote:
               | >What I want to see is comparisons to NON-quantized
               | models
               | 
               | isnt that the first image in the diagram / the 22GB model
               | that took 111 seconds?
        
         | boulos wrote:
         | As others have replied, this is reasonable general feedback,
         | but in this specific case the work was done carefully. Table 1
         | from the linked paper (https://arxiv.org/pdf/2411.05007)
         | includes a variety of metrics, while an entire appendix is
         | dedicated to quality comparisons.
         | 
         | By showing their work side-by-side with other quantization
         | schemes, you can also see a great example of the flavor of
         | different results you can get with these slight tweaks (e.g.,
         | ViDiT INT8) _and_ that their quantization does a much better
         | job in _reproducing_ the  "original" (Figure 15).
         | 
         | In this application, it's not strictly true that you _care_ to
         | have the same results, but this work does a pretty good job of
         | it.
        
           | djoldman wrote:
           | Agreed.
           | 
           | Once a model has been trained, I believe the main metrics
           | people care about are
           | 
           | 1. inference speed
           | 
           | 2. memory requirements
           | 
           | 3. quality of output.
           | 
           | There are usually tradeoffs here. Generally you get a lower
           | memory requirement (a good thing), sometimes faster inference
           | (a good thing), but usually a lower quality of output.
           | 
           | I don't think reproduction of original output is the typical
           | goal.
        
         | tbalsam wrote:
         | Did you...did you read the technical details? This is almost
         | all they talk about, this method was created to get around.
         | 
         | Take a look, it's good stuff! Basically a LoRA to reconstruct
         | outliers lost by quantization, helping keep the performance of
         | the original model.
        
       | atlex2 wrote:
       | Seriously nobody thought to use SVD on these weight matrices
       | before?
        
         | liuliu wrote:
         | I did try, but in a wrong way (try to SVD quantization error to
         | recover quality (I.e. SVD(W - Q(W)))). The lightbulb moment in
         | this paper is to do SVD on W and then quantize the remaining.
        
       ___________________________________________________________________
       (page generated 2024-11-09 23:00 UTC)