hngopher.com

       [HN Gopher] Fine tune a 70B language model at home
       ___________________________________________________________________
        
       Fine tune a 70B language model at home
        
       Jeremy from Answer.AI here. This is our first project since
       launching our new R&D lab at the start of this year.  It's the #1
       most requested thing I've been hearing from open source model
       builders: the ability to use multiple GPUs with QLoRA training. So
       that's why we decided to make it our first project.  Huge thanks to
       Tim Dettmers for helping us get started to this -- and of course
       for creating QLoRA in the first place!  Let me know if you have any
       questions or thoughts.
        
       Author : jph00
       Score  : 684 points
       Date   : 2024-03-07 22:38 UTC (1 days ago)
        
 (HTM) web link (www.answer.ai)
 (TXT) w3m dump (www.answer.ai)
        
       | ricopags wrote:
       | This is such exciting news! Huge thanks to you for your continued
       | work in making sense of AI.
       | 
       | I wonder if the recent Bitnet 1.58 paper [the use of ternary bits
       | in lieu of fp/int] might be an advancement that could further
       | reduce the computation required for inference?
        
         | jph00 wrote:
         | Yes, along with the many other <4 bit quant methods recently
         | developed -- there's been a wonderful boom in low-bit quant
         | methods in the last 6 months, and we've got our own ideas for
         | taking them further too. Along with QLoRA/FSDP, we're likely to
         | see big advances in model training this year on consumer
         | hardware.
        
       | artninja1988 wrote:
       | So, as I understand it, this is for finetuning a preexisting llm?
       | So not actually training one from scratch. I guess that would be
       | too much to ask for. Nonetheless, cheers to Jeremy and the gang
       | for the work.
        
         | jph00 wrote:
         | For now, it's for finetuning.
         | 
         | The issue of to what degree it might be possible to train a
         | model from scratch using QLoRA is still an open question. The
         | relora paper showed that it can work in some situations, but
         | attempts to scale it up were unsuccessful. The recent DoRA
         | paper perhaps might allow a "re-DoRA" approach to work. If so,
         | that could be combined with quantization to do "re-QDoRA"!
        
           | qsi wrote:
           | The headline and introduction on the linked page say "You can
           | now train a 70b language model at home. We're releasing an
           | open source system, based on FSDP and QLoRA, that can train a
           | 70b model on two 24GB GPUs."
           | 
           | How does "fine tuning" differ from "training?" Reading the
           | linked article I had assumed I could create my own trained
           | LLM at home with two 24GB GPUs.
        
             | keremturgutlu wrote:
             | You most definitely can, the main difference is that only
             | partial ~2% of the parameters get updated during training.
             | Say you start from a model like llama-70B which already
             | knows english and has some world knowledge based on its
             | pretraining dataset. It might not be ideal for drastic
             | domain shifts, such as adapting a model to learn new
             | languages (which might require a new tokenizer and model
             | embeddings) but still might be possible to some extent.
        
               | qsi wrote:
               | Thank you for clarifying. I have been wanting to dip my
               | toes into LLMs at home but obviously I have a steep
               | learning curve ahead of me, and would need considerably
               | beefier hardware!
        
               | chasd00 wrote:
               | It's steep but manageable, absolutely go for it. The more
               | people who understand the tech the better.
        
             | IanCal wrote:
             | You can take an existing 70B model and train it to do a
             | more specific task. You're teaching it the task but you're
             | relying on a foundation model for the base understanding of
             | the world/words/etc.
        
               | qsi wrote:
               | OK, that makes sense. Thank you!
        
             | jph00 wrote:
             | The article actually sneaks in a footnote that answers this
             | (https://www.answer.ai/posts/2024-03-06-fsdp-
             | qlora.html#fn1): "Throughout this article "training" can
             | refer to either pre-training, or fine-tuning".
             | 
             | (Generally, we've told students at fast.ai since 2017 that
             | they should almost never be starting from random weights --
             | most of the time it's best to start with a pretrained model
             | and fine-tune that, even if it's from a somewhat different
             | domain to the problem you're working on.)
        
               | Tomte wrote:
               | Have you changed your mind on ,,The End of Finetuning"
               | (https://www.latent.space/p/fastai ) or did I simply
               | misunderstand that?
               | 
               | Oh, and thanks for quirky stuff like your APL video!
        
               | jph00 wrote:
               | The title of that podcast isn't something I actually said
               | (IIRC). I commented in that interview that I feel we
               | should not consider pre-training and fine-tuning to be as
               | separate as we do now.
        
               | Tomte wrote:
               | So you're generally in favor of mixing training data
               | without separating them in phases, but when I use
               | pretrained weights (as you recommend instead of random
               | weights) I generally do not have access to whatever the
               | neural net was pretrained with by someone else, so I have
               | to make do with my finetuning data, yes?
               | 
               | Thank you!
        
               | pama wrote:
               | Yes.
        
           | hantusk wrote:
           | Digging into the low rank structure of the gradients, instead
           | of the weights seems like a promising direction for training
           | from scratch with less memory requirements: https://twitter.c
           | om/AnimaAnandkumar/status/17656138151468933...
        
             | hantusk wrote:
             | Simo linked some older papers with this same idea:
             | https://twitter.com/cloneofsimo/status/1765796493955674286
        
         | buildbot wrote:
         | Lit-GPT is what I have been using to pretrain models at home:
         | https://github.com/Lightning-AI/litgpt Using the openwebtext
         | example, I can train a 700M param model to 2.6 loss in a few
         | days on dual 4090s. Pretty awesome!
        
         | Tepix wrote:
         | Training a 70b model from scratch uses 80,000 GPU hours (4,6
         | years if you have two of those GPUs).
         | 
         | The electricity would cost more than 10,000EUR in Germany, just
         | for the GPUs.
        
       | buildbot wrote:
       | Nice, I tried to use QLoRA+FSDP in the past with litgpt and
       | obviously at that time it did not work. This is very useful!
        
       | jph00 wrote:
       | One thing I forgot to mention in the post which I think is kinda
       | cool: at the NeurIPS Efficiency Challenge this year, where Tim
       | Dettmers and I both did keynotes, every single top-ranked entry
       | used QLoRA! The challenge was to create the most accurate model
       | on a single GPU in 24 hours.
       | 
       | I think that is a great example of how important and useful QLoRA
       | is. Maybe we should run a dual-GPU challenge next time not that
       | multi-GPU is working...
        
         | 3abiton wrote:
         | Are any of the NIPS resources available online?
        
           | notpublic wrote:
           | Tim Dettmers QLoRA https://nips.cc/virtual/2023/83963
           | 
           | More here: https://nips.cc/virtual/2023/competition/66594
        
       | pella wrote:
       | > the ability to use multiple GPUs with QLoRA training.
       | 
       | Thorough article!
       | 
       | Question: What's your opinion on:
       | 
       | - How viable will NVIDIA's consumer cards be in the long run?
       | 
       | - Besides https://tinygrad.org, what other cost-effective future
       | alternatives could there be?
        
         | bugglebeetle wrote:
         | Unsloth (mentioned in the Answer.AI post) is planning multi-GPU
         | support in a future release.
        
       | carbocation wrote:
       | I wonder whether LoRAs could be useful for U-Net training.
       | Especially thinking of CNN-based U-Net models with pre-trained
       | encoders (but randomly initialized decoders). At least, it seems
       | possible that normal weight updates on the decoder and LoRA
       | training on the encoder could improve efficiency.
        
         | jph00 wrote:
         | Diffusion unet has an "extended" version nowadays that applies
         | to the resnet part as well as the cross-attention:
         | https://github.com/cloneofsimo/lora
        
       | m3kw9 wrote:
       | If they can continuously train it, it could be better than a
       | large context as this is how a AI OS would need to work when you
       | have constant updates to your files
        
         | padolsey wrote:
         | I don't think you'd be fine-tuning a whole model in such cases.
         | That seems over the top, no? I assume you'd get sufficiently
         | far with big context windows, vector search, RAG. Etc.
        
       | keeptrying wrote:
       | If you are gonna be doing stuff like this I'm damn excited for
       | answer.ai!
       | 
       | It'll be the first time we'll have someone who knows AI create
       | leverage to open source it.
       | 
       | Way to go!
        
         | chasd00 wrote:
         | > It'll be the first time we'll have someone who knows AI
         | create leverage to open source it.
         | 
         | It can't be overstated how important this is. Thank you again.
        
       | g42gregory wrote:
       | This is brilliant. Thank you for doing his!
        
       | yalok wrote:
       | Have you guys looked at using sparsification? It would probably
       | require true re-training of the foundation model, to go at high
       | sparse ratios (say 90% weights excluded), which could be done
       | once on expensive GPU - but fine tuning such sparse models would
       | require less RAM hopefully.
       | 
       | The trick with getting more benefit from sparse approach is to do
       | block sparse (iirc, Tim Dettmers used to work on this as well, a
       | few years ago), but large block size (say 16x16) would require
       | much longer retraining to recover for the lost accuracy...
        
         | jph00 wrote:
         | Yes, sparsification is another useful approach for higher
         | efficiency, although block sparse kernels are pretty complex to
         | work with -- especially when combined with quantization and
         | LoRA! Most of the sparsity papers I've seen use "structured"
         | sparsity; i.e removing layers, attention heads, and features.
         | But the upside from this seems somewhat limited so far.
        
           | yalok wrote:
           | I'm not sure about structured sparsity, but for the weights
           | sparsity in my experience going to around 50-70% of excluded
           | weights (even with block sparsity - say 4x4) did not cause
           | any noticeable degradation to training & quality at all
           | (original paper on sparsity from LeCun suggests much higher
           | sparsity ratios - like 90% - but for DNNs I didn't find those
           | attainable if accuracy is important)
           | 
           | The block sparsity can really help with saving RAM - because
           | you only need to keep a short array of indexes for the
           | excluded weights. The trouble is the kernel mult functions
           | become complex, so it's a bit of a trade-off between RAM and
           | GPU cycles.
        
         | AhtiK wrote:
         | Has anyone seen an implementation of 'SpQR: A Sparse-Quantized
         | Representation,' published in June 2023 by Tim Dettmers et al.?
         | https://arxiv.org/abs/2306.03078
        
           | AhtiK wrote:
           | Found it from https://github.com/Vahe1994/SpQR Was somehow
           | expecting it to be at
           | https://github.com/TimDettmers/bitsandbytes. My bad.
        
       | jamesblonde wrote:
       | This is a fantastic breakthrough for those of us who fine-tune
       | LLMs on limited hardware budgets.
       | 
       | I was curious about the choice of FSDP over DeepSpeed. I have
       | been using Axolotl for fine-tuning, and FSDP has been broken
       | there, whilst DeepSpeed is rock solid. Why FSDP over DeepSpeed
       | jph00?
        
         | jph00 wrote:
         | DeepSpeed has more features than FSDP, but it's much more
         | complex to hack on -- FSDP is written directly in python using
         | calls to the PyTorch library, whereas DeepSpeed is 20% C++ and
         | 10% CUDA (according to the GitHub stats).
         | 
         | We've found that FSDP works just as well for our needs, and we
         | appreciated the increased "hackability".
         | 
         | (Axolotl is terrific BTW. I hadn't heard of problems with it
         | with FSDP before -- I'll see if that's something we can help
         | with.)
        
           | jph00 wrote:
           | Good news -- axolotl has just merged support for FSDP/QLoRA
           | training, thanks to a rapid collaboration between the axolotl
           | and Answr.AI teams!
        
           | bradfox2 wrote:
           | There's a long gh issues thread with technium struggling with
           | Mistral 7 and loss spikes. Easy to find googling.
        
       | itsgrimetime wrote:
       | Would be cool to build an "LLM@home" project like folding@home or
       | SETI@home (rip), where tons of folks could donate their GPUs and
       | train something huge and FOSS. I don't know enough about how
       | these models are trained though. Could it be chunked up and
       | distributed in that way, then stitched/merged back together?
        
         | fho wrote:
         | https://stablehorde.net/ comes somewhat close.
        
         | woctordho wrote:
         | https://github.com/learning-at-home/hivemind is also relevant
        
         | humansareok1 wrote:
         | Always figured it would be too slow. Distributed training on
         | clusters is usually done with 1+ gb/s interconnects.
        
         | miohtama wrote:
         | Golem has been building this since 2017
         | 
         | https://www.golem.network/
         | 
         | They also have on option to get paid in crypto for your GPU
         | power.
         | 
         | The challenge is that the AIsoftware architectures are not made
         | "to run over Internet."
        
       | lbj wrote:
       | Can't believe they didn't name this Qolor
        
       | tbenst wrote:
       | Very interesting but hard to interpret until the performance
       | numbers / benchmarks are available. I can already fine-tune a 70B
       | language model at home using CPU + RAM, but it would be so slow
       | as to be almost totally impractical (~20x slower than GPU). It
       | would be great to see a comparison to eg 8 x A100 (available for
       | $32/hr on AWS on-demand) and also CPU + RAM. Presumably it's
       | somewhere in between, but hard to predict where!
        
       | int_19h wrote:
       | This is great, but one thing I really hoped would come sooner is
       | fast training on Metal. As things are, you can get an M1/M2 Ultra
       | (~800 Gb/s memory bandwidth; for comparison, RTX 4090 is ~1050
       | Gb/s) Mac Studio with 128Gb RAM for ~$3500. For large model
       | inference, this is already way more affordable than stacking GPUs
       | while being "fast enough", but training solutions are basically
       | non-existent. I do wonder why; it feels like a low-hanging fruit.
        
         | sqreept wrote:
         | M1, M2, M3 still have very low number of GPU cores. Apple
         | should release some better hardware to take advantage of their
         | recently released MLX library.
        
           | sbinnee wrote:
           | At this moment it looks clear to me that Apple won't go that
           | way. It's enough for them to focus on inference and actual
           | application not the heavy training part. They have been
           | probably training models on a cluster with non Apple silicon
           | and make them available for their chips only for inference.
        
             | ttul wrote:
             | Not to mention entirely outsourcing training workloads to
             | specialist firms. Apple does a lot of secretive outsourcing
             | of things you might think they would or should do in-house.
             | This contrasts with Google and Meta who seem to like
             | keeping everything in-house.
        
           | kergonath wrote:
           | It's true that their GPUs are slower than Nvidia's. But keep
           | in mind that cores are really different and cannot be
           | compared across architectures. You want more Gflops, not
           | necessarily more cores.
        
           | int_19h wrote:
           | They do, but for inference at least, it's memory bandwidth
           | that is the primary limiting factor for home LLMs right now,
           | not raw compute.
        
         | buildbot wrote:
         | Compute limited - an m2 ultra has 27 tflops, a 4090 80+
        
           | erichocean wrote:
           | Memory limited - an m2 ultra has >150GiB, a 4090 24GiB
        
             | lawlessone wrote:
             | So why is nobody doing this?
        
               | ttul wrote:
               | My personal experience with Apple Silicon and machine
               | learning in comparison with Nvidia is that the libraries
               | are often missing various features on Apple, leading to
               | headaches when trying to use various libraries and tools.
               | Apple is working to bridge the gap and I am excited for
               | the gap to be closed because the memory bandwidth on big
               | M2 and M3 machines is monstrous.
        
               | lawlessone wrote:
               | sounds similar to how people have described game dev for
               | Mac. The hardware is there. It just isn't supported.
        
               | imhoguy wrote:
               | Apple could single-handedly kill consumer dGPU market if
               | they have released proper low-level APIs for their
               | M1/2/3. I feel they have some huge coming out in the pipe
               | to tumble the "AI" market.
        
           | yumraj wrote:
           | So it should just take longer..
        
             | AnthonyMouse wrote:
             | If you don't care how long it takes you can get an old
             | server with 128GB of RAM for a lot less than $3500.
        
               | ErneX wrote:
               | But that isn't GPU memory right? On the Mac it is.
        
               | rthnbgrredf wrote:
               | The issue here isn't specifically about the
               | classification of memory, be it "unified memory," RAM, or
               | VRAM. The primary concern is ensuring there's enough
               | memory capacity for the models required for inference.
               | The real question at hand is the Mac's value proposition
               | in terms of inference speed, particularly for models as
               | large as 70 billion parameters. Utilizing a 4090 GPU can
               | facilitate real-time inference, which is the desired
               | outcome for most users. In contrast, a Mac Studio offers
               | close to real-time inference speeds, which might be
               | disappointing for users expecting a real-time experience.
               | Then, there's the option of CPU + RAM-based inference,
               | which suits scenarios where immediate responses aren't
               | crucial, allowing for batch processing of prompts and
               | subsequent retrieval of responses. Considering the price
               | points of both the Mac Studio and high-end GPUs are
               | relatively comparable, it begs the question of the
               | practicality and value of near real-time inference in
               | specific use cases.
        
               | abtinf wrote:
               | Hello gpt.
        
               | rthnbgrredf wrote:
               | I'm not a gpt. But now you could say that this is exactly
               | how a gpt would answer and we get stuck in a loop and
               | there's no obvious way to prove that I'm not a gpt.
        
               | pbhjpbhj wrote:
               | 'Write me something profane?' That probably weeds out
               | commercially available GPTs?
        
               | rthnbgrredf wrote:
               | I'm sorry, but I can't fulfill that request. Is there
               | anything else I might assist you with? ;)
        
               | int_19h wrote:
               | https://chirper.ai/aligned/chirp/65e58aa9001d853c
        
               | abtinf wrote:
               | Interesting. Let's review the comment.
               | 
               | > The issue here isn't specifically about the
               | classification of memory, be it "unified memory," RAM, or
               | VRAM. The primary concern is ensuring there's enough
               | memory capacity for the models required for inference.
               | 
               | The comment chain is about training, not inference.
               | 
               | > The real question at hand is the Mac's value
               | proposition in terms of inference speed, particularly for
               | models as large as 70 billion parameters.
               | 
               | Again, wrong topic.
               | 
               | > Utilizing a 4090 GPU can facilitate real-time
               | inference, which is the desired outcome for most users.
               | 
               | Generic statement. Semantically empty. Typical LLM style.
               | 
               | > In contrast, a Mac Studio offers close to real-time
               | inference speeds, which might be disappointing for users
               | expecting a real-time experience.
               | 
               | Tautological generic statement. Semantically empty.
               | Typical LLM style.
               | 
               | > Then, there's the option of CPU + RAM-based inference,
               | which suits scenarios where immediate responses aren't
               | crucial, allowing for batch processing of prompts and
               | subsequent retrieval of responses.
               | 
               | Contradicts first sentence that "classification of
               | memory" isn't important. Fails to recognize this the same
               | category as previous statement. Subtle shift from first
               | sentence that declared "primary concern is ... membory
               | capacity", to focusing purely on performance. This kind
               | of incoherent shift is common in LLM output.
               | 
               | > Considering the price points of both the Mac Studio and
               | high-end GPUs are relatively comparable, it begs the
               | question of the practicality and value of near real-time
               | inference in specific use cases.
               | 
               | Completes shift from memory capacity to performance.
               | Compares not really comparable things. "Specific use
               | cases" is a tell-tale LLM marker. Semantically empty.
        
               | Lalabadie wrote:
               | Considering that the topic is approachability and energy
               | efficiency, that Mac Studio will do reasonably fast
               | inference while consuming <200W at full load.
               | 
               | The speed is certainly not comparable to dedicated GPUs,
               | but the power efficiency is ridiculous for a very usable
               | speed and no hardware setup.
        
               | Applejinx wrote:
               | This, and then you get to have a Mac Studio.
               | 
               | I have one, where I selected an M1 Ultra and 128G RAM to
               | facilitate just this sort of thing. But in practice, I'm
               | spending much more time using it to edit 4K video, and as
               | a recording studio/to develop audio plugins on, and to
               | livestream while doing these things.
               | 
               | Turns out it's good at these things, and since I have the
               | LLAMA 70b language model at home and can run it directly
               | unquantized (not at blinding speed, of course, but it'll
               | run just fine), I'm naturally interested in learning how
               | to fine tune it :)
        
               | int_19h wrote:
               | Yep, I also got mine specifically for LLMs and ended up
               | using it as a second desktop for other things; actually
               | strongly considering making it my primary at this point.
               | 
               | I still wouldn't recommend it to someone just looking for
               | a powerful desktop, just because $3K is way overpriced
               | for what you get (non-replaceable 1Tb SSD is so
               | _Apple_!). But it 's certainly great if you already have
               | it...
        
               | int_19h wrote:
               | "Real-time" is a very vague descriptor. I get 7-8 tok/s
               | for 70b model inference on my M1 Mac - that's pretty
               | real-time to me. Even Professor-155b runs "good enough"
               | (~3 tok/s) for what I'd consider real-time chat in
               | English.
        
               | AnthonyMouse wrote:
               | > But that isn't GPU memory right? On the Mac it is.
               | 
               | They call it that but it's really LPDDR5, i.e. normal
               | DRAM, using a wide memory bus. Which is the same thing
               | servers do.
               | 
               | The base M3, with "GPU memory", has 100GB/s, which is
               | less than even a cheap desktop PC with dual channel
               | DDR5-6400. The M3 Pro has 150GB/s. By comparison a five
               | year old Epyc system has 8 channels of DDR4-3200 with
               | more than 200GB/s per socket. The M3 Max has 300-400GB/s.
               | Current generation servers have 12 channels of DDR5-4800
               | with 460GB/s per socket, and support multi-socket
               | systems.
               | 
               | The studio has 800GB/s, which is almost as much as the
               | modern dual socket system (for about the same price), but
               | it's not obvious it has enough compute resources to
               | actually use that.
        
               | int_19h wrote:
               | It's fast enough to do realtime (7 tok/s) chat with 120b
               | models.
               | 
               | And yes, of course it's not magic, and in principle
               | there's no reason why a dedicated LLM-box with heaps of
               | fast DDR5 couldn't cost less. But in practice, I'm not
               | aware of any actual offerings in this space for
               | comparable money that do not involve having to mess
               | around with building things yourself. The beauty of Mac
               | Studio is that you just plug it in, and it works.
        
               | AnthonyMouse wrote:
               | > It's fast enough to do realtime (7 tok/s) chat with
               | 120b models.
               | 
               | Please list quantization for benchmarks. I'm assuming
               | that's not the full model because that would need 256GB
               | and I don't see a Studio model with that much memory, but
               | q8 doubles performance and q4 quadruples it (with
               | corresponding loss of quality).
               | 
               | > But in practice, I'm not aware of any actual offerings
               | in this space for comparable money that do not involve
               | having to mess around with building things yourself.
               | 
               | You can just buy a complete server from a vendor or eBay,
               | but this costs more because they'll try to constrain you
               | to a particular configuration that includes things you
               | don't need, or overcharge for RAM etc. Which is basically
               | the same thing Apple does.
               | 
               | Whereas you can buy the barebones machine and then put
               | components in it, which takes like fifteen minutes but
               | can save you a thousand bucks.
        
       | delegate wrote:
       | Maybe I've missed it in the article - but how long would a full
       | training run take on 2 consumer GPUs (local or rented) ? Ballpark
       | - hours, days... ?
        
         | gardnr wrote:
         | The author is discussing fine-tuning a base model. How long it
         | takes really depends on the dataset, the method, and the
         | hyperparameters. DPO, for example, can achieve some great
         | results with a fraction of the steps of other methods.
         | 
         | Just like with unsloth or axolotl, the people that use this
         | will have to make compromises that give results in a reasonable
         | amount of time.
        
       | eurekin wrote:
       | This might be the most interesting constructive approach in "Open
       | Source" LLMs I've seen. Grounded, reasonable and inviting to
       | replicate! I wish academia took that as a standard.
       | 
       | Great job!
        
         | carlossouza wrote:
         | Answer.ai is truly open AI. :)
        
           | rvz wrote:
           | That's what was said about OpenAI, Mistral, before the VCs
           | and investors came in.
           | 
           | After that, the larger flagship AI models were then closed up
           | again and used as an server only offering.
        
             | ericd wrote:
             | I doubt it, Jeremy's been walking the walk for quite a
             | while now, when it comes to opening up access to AI,
             | especially with his excellent, free fast.ai course, it
             | seems pretty clear that his primary motivations are in
             | helping others. (If you're in this thread, Jeremy, thanks
             | for fast.ai, it helped me immensely in getting started in
             | training models).
        
         | 20wenty wrote:
         | For the most part this post was easy to read, and I could feel
         | the collective excitement of the team. I came away feeling like
         | I'd learned something and ready to try it myself. The only time
         | the post gets a little fuzzy is "...store the quantized
         | parameters in a selectable data type, where that storage data
         | type is the same data type as the "computation type" of the
         | mode". I assume "selectable datatype" is the float size of the
         | quantization?
        
       | hathym wrote:
       | Imagine the potential of a Folding@Home-inspired project for AI
       | development. What kind of powerful model could a community of
       | gamers and GPU owners create.
        
       | llmzero wrote:
       | I liked that you link to renting a dual 24GPU for 0.60cents/hour,
       | but how long could it takes to fine tune a 70b model using your
       | system (4 bits for weights)?
       | 
       | If I were a consumer I would be interested in the final price of
       | fine tuning, for example a table with model size, training size,
       | cost of training, and expected loss of quality with this
       | technology.
       | 
       | One obvious question: Can you apply your technology with the
       | recent (-1,0,1) encoding?, I think you will answers that the
       | (-1,0,1) model is not available and you can't try it, but my
       | question is whether once/if that model is available answer.ai
       | will be able to use the same technology that this post to fine
       | tune a big model in two very small GPUs, and then I should ask
       | for a new table with cost/benefits analysis.
       | 
       | Edited: I should add that I find this kind of work very useful
       | for enhancing individual users like me to be able to compete in
       | the applications of LLM market, this is great work and along the
       | lines of the book "from zero to one" (not that I like or dislike
       | the author) to solve the kind of problem that nobody is trying to
       | solve.
       | 
       | Edited: Now that I have a total of 23 points in HN, I will change
       | my password to some random one, just to cure my desire to look
       | for votes, and try to make some work, and again some tomorrow
       | create a new presence in HN.
        
         | jph00 wrote:
         | As mentioned in the post, benchmarking results are coming in a
         | later post. But in short: you can train an epoch of Alpaca in
         | 24 hours or so, which is enough to get very significant change
         | in model behavior.
        
         | danielhanchen wrote:
         | On how long, finetuning is influenced by your dataset size
         | (more = slower), sequence length since attention is O(N^2),
         | data movement etc and most important is how many steps you want
         | to take. For QLoRA, some runs can do a few hundred steps which
         | can complete in minutes to 1 hour. Too many can overfit. So
         | being able to fit it on consumer GPUs can be very cost
         | effective.
         | 
         | On the 1.58bit paper, from what I understand, this requires a
         | total retraining from scratch. Hopefully the researchers will
         | open source their weights :)
         | 
         | On the technicals, weights are encoded in (-1, 0, 1), whilst
         | QLoRA uses a 4bit dynamic mapping of 16 numbers. The only
         | change required would be the torch.matmul(X, W) step, where
         | it'll be torch.bitlinear_matmul(X, W). Before with QLoRA, one
         | has to do torch.matmul(X, dequantize(W)). So one has to
         | implement torch.bitlinear_matmul. The backward is
         | torch.bitlinear_matmul(dY, W.T).
        
           | miohtama wrote:
           | What's the magic in 1.58bit vs. 4 bit that it makes it so
           | much more efficient (claimed)?
        
             | danielhanchen wrote:
             | From what I understand, using (-1, 0, 1) removes
             | multiplications in GPUs. Ie assume you have a weight matrix
             | and multiply it by some activations
             | [-1, 0,  1]                             [0,  1, -1]
             | [10, 20, 30] x [1,  1,  0]
             | 
             | Instead of doing 10 _(-1) + 20_ (0) + 30 _(1) + 10_ (0) +
             | ..., since we know beforehand the weights are simply (-1,
             | 0, 1), we easily flip the sign and do addition, or force
             | the hardware to do addition ie if (-1) do subtraction. If
             | (0) do addition. If (1) do addition.
             | 
             | Floating point multiplication does addition of the
             | exponents and multiplying of the mantissa. So just
             | simplifying:
             | 
             | Float16 has E=5, M=10. Ie around 5 + 10^2 space needed =
             | 105.
             | 
             | Bfloat16 has E=8, M=7. So 8 + 7^2 = 57 space.
             | 
             | Float8(143) E=4, M=3. So 4 + 3^2 = 13 space.
             | 
             | 1.58(16bit) E=5, M=10. Addition only, so shift E say 5 + 10
             | addition = 15.
             | 
             | 1.58(8bit) E=4, M=3. Addition only, so shift E say 4 + 3
             | addition = 7.
             | 
             | Obviously I'm simplifying, but with only additions, 1.58
             | uses say 7 space, whilst FP8 uses 13 space, so in theory 2x
             | more transistors can be crammed, ie 2x more FLOPs than FP8.
        
             | nyrikki wrote:
             | Really simple explanation is that for inference, feed
             | forward networks are threshold circuits and by their nature
             | ANNs are binary output, outputting true and false (same as
             | being a threshold circuit)
             | 
             | So if you train your models with that in mind you're weighs
             | can be reduced to -1,0,1 reducing the space complexity.
             | 
             | I don't think the costs in expressiveness are captured
             | quite yet, but as perplexity doesn't care about
             | correctness, if that is the metric that is important for
             | you it will probably reduce memory requirements for
             | inference.
        
             | chessgecko wrote:
             | also just to add, I think the 1.58 bit is mostly faster for
             | inference because training still had to multiply a lot of
             | floating point gradients by integer activations, hold
             | floating point weights/gradients for round, and deal with
             | norms and stuff. could be wrong about that though
        
         | swader999 wrote:
         | I like how you think about social media.
        
         | airstrike wrote:
         | _> Now that I have a total of 23 points in HN, I will change my
         | password to some random one, just to cure my desire to look for
         | votes, and try to make some work, and again some tomorrow
         | create a new presence in HN._
         | 
         | If you use Stylus (or any similar browser extension), I
         | actually wrote a style to hide points for that very reason,
         | replacing karma and scores with `***`
         | 
         | This is actually the second time I see someone mentioning this
         | need, so I've made it into a gist and published it to
         | userstyles, but here's it is also since it's pretty short:
         | @-moz-document domain("news.ycombinator.com") {             /*
         | Hide karma and points on replies */             span.pagetop
         | #karma, span.comhead span.score {                 visibility:
         | hidden;                 position: relative;
         | display: inline-block;                 height: 10px !important;
         | overflow: hidden;             }             span.pagetop #karma
         | {                 width: 0.8rem !important;             }
         | span.comhead span.score {                 width: 0.8rem
         | !important;             }             span.pagetop
         | #karma::before, span.comhead span.score::before {
         | content: "***";                 visibility: visible;
         | overflow: hidden;                 opacity: 0.8;
         | font-family: Helvetica, Arial, sans-serif !important;
         | }         }
         | 
         | https://gist.github.com/airstrike/62584e6ffb6104791c0ae48a8e...
         | 
         | https://userstyles.world/style/15164/hackernews-hide-karma-a...
        
       | vouaobrasil wrote:
       | It would be great if we were a bit more respectful towards our
       | natural resources. Using so much energy to play with language
       | models is a waste of resources.
        
         | firtoz wrote:
         | The point in the article is to use less resources. So, yes?
        
           | vouaobrasil wrote:
           | People think that by making a system use less resources, the
           | entire use of it on a societal level will be reduced.
           | Unfortuantely, we must watch out for more efficiency making
           | more poeple use it, and potentially increasing absolute
           | quantity of energy being used.
        
             | MacsHeadroom wrote:
             | Energy usage is good, actually. Energy scarcity and dirty
             | energy are bad. But energy usage is good.
             | 
             | We should strive to use and produce orders of magnitude
             | more (clean) energy.
        
         | nraford wrote:
         | It would be great if I had a bathtub full of ice cream as well,
         | and if we all lived in a world overflowing with love, respect
         | and joy for all living things. Until then, I'm happy that these
         | kinds of incredible tools are (and increasingly will be) be in
         | more of our hands for close to free. Upwards and onwards!
        
           | vouaobrasil wrote:
           | Seems like with every passing year, we are going downwards,
           | not upwards. Perhaps it only seems the other way around to
           | those with the greatest addictions to technology, who will
           | justify any development to satisfy their cravings.
        
             | guiriduro wrote:
             | Well, I for one am happy that less compute is being wasted
             | on blockchain, and if the total BTUs and tonnes of CO2
             | remain equal while the proportion allocated to AI goes up,
             | that'll also be a good thing. Doing useful stuff, and
             | becoming more efficient (eliminating high carbon wasteful
             | human activities and replacing with AI compute using less
             | overall carbon), is also a net win.
        
         | nl wrote:
         | They are using _gaming_ GPUs. If you want to complain about a
         | waste of natural resources there seems to be a lot of people
         | playing games...
        
           | vouaobrasil wrote:
           | Well, they serve the same function. Modern consumerist
           | society removes almost all real autonomy from people and
           | makes them do fairly meaningless tasks (most jobs). So, it's
           | rather expected that we need to seek out greater and greater
           | amusements (gaming and playing with ridiculous models) so
           | we're kept in a somewhat happy state, instead of realizing
           | the trick of the system that will one day come crashing down
           | due to its unsustainability.
        
           | yard2010 wrote:
           | Video games are like the opposite of waste. This planet can
           | go to hell if I can't enjoy art.
        
             | Sohcahtoa82 wrote:
             | There are some people that consider _all_ forms of
             | entertainment that don 't create some form of extrinsic
             | value to be a waste of time, energy, and materials.
             | 
             | I feel bad for them. They're going to be the ones that lay
             | in their death bed thinking "I wish I had allowed myself to
             | have more fun".
        
         | llmzero wrote:
         | What is a little contradictory is that designing a system to
         | use less resources can increases the number of people fine
         | tuning models so that the final result can be a net global
         | increase in the total energy use. A hypothetical goal could be
         | to reuse fine tuning, that is designing a knowledge graph in
         | which you fine tuning from a previously fine tuned model (like
         | dynamic programming, save the result of previous computations).
         | Lora allow us to store the small matrices with low cost.
        
         | exitb wrote:
         | Running a powerful GPU at full load using coal-generated energy
         | causes two orders of magnitude less emissions than flying on an
         | airliner (per passenger). If you ever flown anywhere in your
         | life, I don't think you can climb on that high horse.
        
         | gkbrk wrote:
         | > Using so much energy to play with language models is a waste
         | of resources.
         | 
         | Why do you get to decide what's wasteful and what's useful?
         | 
         | We have more energy than we could ever hope to use from the sun
         | and from nuclear. The solution isn't telling people they're
         | wasting precious energy that you would put to better use. Just
         | build more nuclear reactors and more solar.
        
           | vouaobrasil wrote:
           | Why do you get to decide which habitats die and which live by
           | using all this modern technology that relies on mining and
           | kills them?
        
             | Applejinx wrote:
             | I mean, this is a fair point but right now you're not
             | talking to a libertarian who believes the technology
             | inevitably governs itself, to the destruction of all around
             | it.
             | 
             | You're talking to more of a civilization-type who believes
             | you have to use regulation and, if necessary, state
             | violence to stop types of mining that kill habitats,
             | because the technology certainly isn't going to up and
             | decide to do that. It's the job of society to figure this
             | stuff out, arrive at such positions and defend them. There
             | are plenty of good reasons to defend the protection of
             | habitats, even for purely self-interested pragmatic
             | reasons.
        
         | infecto wrote:
         | I hate this more recent doomer logic. Who is the arbiter of
         | deciding whats a waste and not a waste? Why use such a basic
         | and uncompelling narrative of telling how others should live
         | their lives? I try to be thoughtful of my purchases and
         | conserve my own resources and happy to talk about it but
         | telling people that "doing x is a waste of resources" is a
         | fools errand. Humanity has always progressed when in a
         | collective group, it won't slow down now even though some
         | individuals might drop out of it. I don't know what the future
         | holds, collectively we will continue to progress and I see the
         | bright side of all the recent momentum in renewable energy and
         | the focus on making systems more efficient.
         | 
         | Not the first time this character has popped up here on HN.
         | 
         | "I write full-time and try to convince people of the danger of
         | AI and advanced technology."
        
           | vouaobrasil wrote:
           | You are probably right...it may not have been the best
           | approach. Well, I get emotional sometimes about the rapid
           | advancement of technology and I do think it is a mistake of
           | humanity to do so.
        
         | Cheer2171 wrote:
         | I started reading your blog linked from your profile. I was
         | disappointed that you used so much energy to play with language
         | by writing up thoughts and hosting them on the internet. Seems
         | like a waste of resources. Why do you think you have the right
         | to burn all that carbon just so you can share your thoughts
         | with the world? Why not just save electricity and take a walk
         | through nature? That's how I think you should use your
         | resources. I think I know better than you about how you should
         | use energy. I think you should have to follow my ideology about
         | energy use.
        
           | vouaobrasil wrote:
           | You are right in a way. I hope to one day give up the
           | internet completely...
        
             | Applejinx wrote:
             | Careful, he'll have you disappear in a puff of logic. And
             | then get killed at the next zebra crossing. Do you want
             | that on your conscience? :)
        
         | yard2010 wrote:
         | Also, whoever is married, please for the love of god merge your
         | facebook accounts. It takes up too much space on the internet.
        
         | Applejinx wrote:
         | Beats crypto in my opinion. I feel like there's a sliding scale
         | for this stuff, and playing with language models is genuinely
         | interesting, though it's harder to predict what the real
         | benefits will be.
         | 
         | I feel certain there will be great benefit, but not in the way
         | AI hypesters expect there to be.
        
       | Kelteseth wrote:
       | Any plans on supporting AMD? In Germany, the price of an 7900XTX
       | is HALF of a NV 4090...
        
         | slices wrote:
         | take a look at recent posts from
         | https://twitter.com/__tinygrad__ re: the state of AMD for AI
         | work
        
       | chompychop wrote:
       | Does this support multimodal language models (E.g.: LLaVA)?
        
       | pama wrote:
       | Thank you for the repo and write up. What tools (if any) did you
       | use for performance tuning once you achieved the main goal of
       | being able to finetune the model?
        
       | zerop wrote:
       | Question - Can I use this to retrain an LLM (70B) weights on my
       | own data? I am using RAG as of now for asking questions on my
       | text, but always wonder if I could retrain an LLM on my own text.
       | Thoughts?
        
         | AnthusAI wrote:
         | Fine tuning is generally not the best way to teach an LLM new
         | knowledge. RAG is still more appropriate. Fine tuning is
         | generally more effective for controlling the format of the
         | responses but it's not going to teach the model a lot of new
         | concepts. The model can learn how to handle new vocabulary
         | through fine tuning but it's not a great way to teach the model
         | new facts. Giving it access to a knowledge base is a better way
         | to do that.
        
       | jl6 wrote:
       | Besides being a great result, the quality and clarity of the
       | technical writing here is excellent.
        
       | iandanforth wrote:
       | This is great, however there were many opportunities to use the
       | word 'nibble' in this post and they were all missed.
        
       | chasd00 wrote:
       | What's the best way for people to contribute to AI open source? I
       | can't produce things like this for many reasons so how can I and
       | others like me do our part to keep SOTA AI open?
        
         | sophrocyne wrote:
         | There is a ton you can do to help SOTA AI remain open.
         | 
         | Join the community building the tools - Help with UI/UX,
         | documentation, keeping up with the latest, and evangelizing
         | whatever method the team building it has devised to keep it
         | sustained.
         | 
         | Being part of the community itself is more valuable than you
         | realize.
        
           | SamPatt wrote:
           | Where are you finding this community?
        
         | ativzzz wrote:
         | Off the top of my head
         | 
         | - try to implement techniques that are doable on home hardware
         | like the one described in OP (requires some $$$ investment) and
         | give feedback or contribute to documentation / guides
         | 
         | - learn about different techniques and do educational writeups
         | or documentation (like https://vickiboykis.com/2024/02/28/gguf-
         | the-long-way-around/)
         | 
         | - build a tool / library that wrap academic techinques and
         | expose them more easily to end users (like A1111 or comfyUI for
         | stable diffusion)
         | 
         | Anything that can translate the high end research down to
         | something a moderately technical user can use or learn from is
         | a gigantic win
        
         | hamilyon2 wrote:
         | I am random software engineer, but from what I learned high-
         | quality open source data sets seems to be enabler. There is
         | shortage of golden datasets for training and evaluation in
         | every popular and niche area you can imagine.
        
       | sieszpak wrote:
       | 4x 3080???
        
       | openquery wrote:
       | This article is very well written and super informative. One
       | thing I didn't understand is:
       | 
       | > At Answer.AI our north star is making useful AI more
       | accessible. $150,000 to create your own high-quality personalized
       | model definitely doesn't count as accessible!
       | 
       | Renting an A100 on RunPod is ~$1.89 / hour. So you'd need ~80,000
       | A100 hours to train a useful AI model?
        
         | humansareok1 wrote:
         | In the post it explicitly says you can train on 2 3090 level
         | cards which are significantly cheaper and the headline
         | literally says "Finetune" Not "Pretrain"
        
       | jncfhnb wrote:
       | So... why do people want to fine tune LLMs at home? It seems very
       | unlikely to provide value.
       | 
       | * you're probably not going to succeed at injecting new knowledge
       | in a way that feels satisfyingly top of mind to the bot
       | 
       | * you're probably not going to create such a meaningfully new
       | style that it warrants a Lora like in images
       | 
       | What's an example use case?
        
         | Solvency wrote:
         | Illicit fan fiction. Whether it's image or text models.
         | 
         | It's ALWAYS illicit fan fiction.
        
           | CharlesW wrote:
           | Consensual sex between Gilligan and the Professor is not a
           | crime.
        
           | jncfhnb wrote:
           | I mean I've seen the expressive niches on image models of
           | civitai, but do you really need custom fine tuned LLMs for
           | text fanfiction?
           | 
           | Like sure, you need something that is not the friendly
           | question answerer; but do you really need such a broad
           | population as in images to suit your needs? I'm guessing no?
        
         | tracerbulletx wrote:
         | Hmm why would someone on a forum called hacker news want to
         | tinker and explore an exciting new technology. Who can say? One
         | of life's great mysteries really.
        
           | jncfhnb wrote:
           | I'm curious what they're trying to do because I'm curious and
           | I don't see it. You're the one being a dismissive dick here.
        
             | SamPatt wrote:
             | >So... why do people want to fine tune LLMs at home? It
             | seems very unlikely to provide value.
             | 
             | Asking the first question is fine, but your follow-up
             | comment sounds more dismissive than curious.
             | 
             | That's probably why the snarky response.
        
               | jncfhnb wrote:
               | I don't feel that's accurate given specific bullets
               | explaining why I feel that way and asking why others feel
               | differently but ymmv
        
         | mttpgn wrote:
         | I find that available LLMs have difficulty recalling instances
         | in specific works by given authors. For example, if you ask
         | GPT-4 "In which Philip K. Dick novel does the protagonist
         | character consider converting to Judaism and moving to Israel?"
         | it will respond with Dick's best known book _The Man in the
         | High Castle_ and the character Frank Fink. The answer is
         | incorrect. Israel does not exist in the world of that novel;
         | furthermore, the character of Fink already is Jewish. The
         | correct answer is Angel Archer in _The Transmigration of
         | Timothy Archer_.
         | 
         | I have considered the feasibility of fine-tuning an LLM on the
         | writings of a specific author. The idea is that it could aid
         | writing in this way: If I currently am researching a specific
         | author across multiple of their books, I often will get a quote
         | of theirs trapped in my head some length of time after reading
         | it. If I have neglected to jot down (or even to highlight) the
         | source of the quote, I could ask the model where the remembered
         | passage came from and get back a higher-quality response.
        
           | jncfhnb wrote:
           | Eh, but fine tuning is a very awkward tool to solve those
           | knowledge problems imo.
           | 
           | Author style, maybe, I guess.
        
       | curl-up wrote:
       | Does anyone have sources, or experience, about fine tuning
       | primarily to teach the model some factual data, especially when
       | it comes to later "higher level" question answering.
       | 
       | For example, giving the model a bunch of text (academic papers
       | and such) about 19th century writers, then asking things like
       | "Who were the main influences on writer X"?
       | 
       | Obviously simple RAG-like approaches don't work, as such
       | information is rarely available in the text as-is, and needs to
       | be "extrapolated" to some extent. Long context models might work
       | (just dumping everything into the prompt), but are way too
       | expensive for my needs.
        
         | armcat wrote:
         | RAG approaches should work quite well for the examples you
         | mentioned. It's a matter of how you approach the retrieval part
         | - you can opt for a larger recall on retrieval, and leverage
         | the large context window for the LLM to figure out the answer.
         | Even if it's not "as-is", semantically if it's in there, it
         | should be able to find it.
         | 
         | Other things to try out is how you approach the question
         | "expansion" part, for example using Hypothetical Document
         | Embeddings (HyDE); or how you approach the filtering-out part,
         | e.g. using "System 2 Attention",
         | https://arxiv.org/abs/2311.11829.
        
           | curl-up wrote:
           | I tried most of such techniques, but the point is that this
           | information really isn't in there directly, and to perform
           | the question expansion, the model needs to know about the
           | domain already.
           | 
           | For example, imagine that one paper is about how author X was
           | French, in early 19th-c.and how they were one of the first
           | ones to write about topic T. Another paper is about how
           | author Y was inspired by the early 19th-c. French writers
           | writing about T. However, this second article does not
           | mention X at all. Asking about "who were the main influences
           | on X" would not give you the second article.
           | 
           | Of course, I could run "multiple-hop" RAG-like process, where
           | the model keeps asking questions itself and so on in a loop,
           | but this becomes extremely clumsy, and the models (even
           | GPT-4) tend to get out of hand. It is also extremely slow, of
           | course.
        
       | jiwidi wrote:
       | > home
       | 
       | > two 24GB GPUs.
       | 
       | geez
        
       | Nouser76 wrote:
       | Is there any framework/system that distributes the work across
       | multiple GPUs on different computers over a network (LAN or WAN)?
       | I'm not concerned much about latency or generation time, but
       | would love to train or load up huge models and send jobs to run
       | overnight.
        
       | ericd wrote:
       | This is the best news I've seen all month. I think one of the
       | great near-term dangers of AI is the bulk of the economic benefit
       | going mainly to relatively few companies. That risk seems
       | substantially reduced if they have to compete with a great
       | variety of models.
        
       | Tostino wrote:
       | Nice, i've been hoping this would be possible for a while. I'll
       | have to do a new fine-tune of Inkbot on top of one of the 70b
       | models.
       | 
       | What are the max context lengths / batch sizes you can train at
       | with this method for 2x24gb? What about 4x24gb?
        
       | samstave wrote:
       | OK this is going to come out as moronic because I dont have the
       | proper vocab to phrase it:
       | 
       | --
       | 
       | Is it possible to 'array' tokenized wokloads across providers of
       | GPU?
       | 
       | I want to farm-out my 'compute' across [things]
       | 
       | More importantly can there be a marketplace for GPU resources
       | that I can effectively point my local job at?
        
       | Havoc wrote:
       | This is great.
       | 
       | I don't think local will be competitive in future IN
       | GENERAL...but if I have a specific use case and I have a specific
       | training dataset...local with specific training will murder the
       | big commercial models.
        
       | staticman2 wrote:
       | If I wanted to use this software to finetune a 70b model on two
       | 3090s to write fiction, what is the maximum sequence length that
       | would be practical? I'm at the dataset collection stage, but I'm
       | not sure whether to aim for bigger or smaller sequence lengths at
       | the moment.
        
       | JoelJacobson wrote:
       | Do the two 4090 GPUs need to be on the same machine, or is it
       | possible to somehow use two separate machines, each with its own
       | 4090, and link them somehow via e.g. InfiniBand?
        
       | Tepix wrote:
       | Congratulations, fantastic contribution to open source AI. Why
       | does the website headline say "train" instead of "finetune"?
        
       | OOPMan wrote:
       | Ah yes, 24gb top-of-the-line GPUs, I happen to have a whole
       | warehouse full.
       | 
       | /s
        
       ___________________________________________________________________
       (page generated 2024-03-08 23:00 UTC)