hngopher.com

       [HN Gopher] Ask HN: Do LLMs get "better" with more processing po...
       ___________________________________________________________________
        
       Ask HN: Do LLMs get "better" with more processing power and or time
       per request?
        
       Do they make more (recursive) queuries into their training data for
       breadth and depth? Or does the code limit the algorithms by design
       and or by constraints other than the incompleteness of the encoded
       semantics?
        
       Author : frannyg
       Score  : 17 points
       Date   : 2024-02-25 21:08 UTC (1 hours ago)
        
       | neximo64 wrote:
       | Both
       | 
       | It takes time to train them. More = better. Usually about 6
       | months or so. More processing power can allow the model to cram
       | more power in
        
         | HeavyStorm wrote:
         | The OP asks about request time (and, I imagine, processing
         | power) not training
        
       | fykem wrote:
       | More processing power does not make a model better. You can train
       | models on CPUs with same result based on same model architecture
       | and dataset. It'll just take longer to get those results.
       | 
       | What makes models "good" is if the dataset "fits" the model
       | architecture properly and you have given it enough time (epochs)
       | to have a semi accurate prediction ratio (lets say 90% accurate).
       | For image classification models I've done around ~100 epochs for
       | 10,000 items seems to be the best certain data sets will ever
       | get. There will at some point come a time when the continued
       | training of the model is either underfitting/overfitting and no
       | amount of continued training/processing power would help improve
       | it.
        
         | HeavyStorm wrote:
         | The OP asks "per request", not training time.
        
           | chank wrote:
           | Answer is still no and still for the above reason. Compute
           | resources are only relevant to how fast it can answer not the
           | quality.
        
       | HeavyStorm wrote:
       | Without knowing the particulars of a Implementation, it's hard to
       | say. Some can refine results by running the model a few more
       | times, so yeah, better processing and/or more time would help,
       | though probably not by much.
       | 
       | Most models, however, don't, so no special benefit from better
       | processing other than speed
        
       | jdsully wrote:
       | For inference the common answer will be "no", you use the model
       | you get and it takes a constant time to process.
       | 
       | However the truth is that inference platforms do take shortcuts
       | that affect accuracy. E.g. LLama.cpp will down convert fp32
       | intermediates to 8-bit quantized so it can do the work using
       | 8-bit integers. This is degrading the computation's accuracy for
       | performance.
        
       | PeterisP wrote:
       | No, the standard LLM implementations currently used will apply a
       | fixed amount of computations during inference, which is chosen
       | and "baked in" by the model architecture before training. They
       | don't really have the option to "think a bit more" before giving
       | the answer, generating each token makes the exact same amount of
       | matrix multiplications. Well, they probably theoretically could
       | be modified to do it, but we don't do that properly yet, even if
       | some styles of prompts e.g. "let's think step by step" kind of
       | nudge the model in that direction.
       | 
       | The same model will give the same result, and more processing
       | power will simply enable you to get the inference done faster.
       | 
       | On the other hand, more resources may enable (or be required for)
       | a different, better model.
        
         | og_kalu wrote:
         | There's fixed compute per token but more tokens = more compute
        
       | bluecoconut wrote:
       | Directly answering your question requires making some assumptions
       | about what you mean and also what "class" of models you are
       | asking about. Unfortunately I don't think it's just a yes or no,
       | since I can answer in both directions depending on the
       | interpretation.
       | 
       | [No] If you mean "during inference", then the answer is mostly no
       | in my opinion, but it depends on what you are calling a "LLM" and
       | "processing power", haha. This is the interpretation I think you
       | are asking for though.
       | 
       | [Yes] If you mean everything behind an endpoint is an LLM, eg.
       | that includes a RAG system, specialized prompting, special search
       | algorithms for decoding logits into tokens, then actually the
       | answer is obviously a yes, those added things can increase
       | skill/better-ness by using more processing power and increasing
       | latency.
       | 
       | If you mean the raw model itself, and purely inference, then
       | there's sorta 2 classes of answers.
       | 
       | [No] 1. On one side you have the standard LLM (just a gigantic
       | transformer), and these run the same "flop" of compute to predict
       | logits for 1 token's output (at fixed size input), and don't
       | really have a tunable parameter for "think harder" -> this is the
       | "no" that I think your question is mostly asking.
       | 
       | [Yes] 2. For mixture of experts, though they don't do advanced
       | adaptive model techniques, they do sometimes have a "top-K"
       | parameter (eg. top-1 top-2 experts) which "enable" more blocks of
       | weights to be used during inference, in which case you could make
       | the argument that they're gaining skill by running more compute.
       | That said, afaik, everyone seems to run inference with the same N
       | number of experts once set up and don't do dynamic scaling
       | selection.
       | 
       | [Yes] Another interpretation: broadly there's the question of
       | "what factors matter the most" for LLM skill, if you include
       | training compute as part of compute (amortize it or whatever) -->
       | then, per the scaling law papers: it seems like the 3 key things
       | to keep in mind are: [FLOPs, Parameters, Tokens of training
       | data], and in these parameters there is seemingly power-law
       | scaling of behavior, showing that if you can "increase these"
       | then the resulting skill also will keep "improving" (hence an
       | interpretation of "more processing power" (training) and "time
       | per request" (bigger model / inference latency) is correlated to
       | "better" LLMs.
       | 
       | [No] You mention this idea of "more recursive queries into their
       | training data", and its worth noting a trained model no longer
       | has access to the training data. And in fact, the training data
       | that gets sent to the model during training (eg. when gradients
       | are being computed and weights are being updated) is sent on some
       | "schedule" usually (or sampling strategy), and isn't really
       | something that is being adaptively controlled or dynamically
       | "sampled" even during training. So it doesn't have the ability to
       | "look back" (unless a retrieval style architecture or a RAG
       | inference setup)
       | 
       | [Yes] Another thing is the prompting strategy / decoding
       | strategy, hinted at above. eg. you can decode with just taking 1
       | output, or you can take 10 outputs in parallel, rank them somehow
       | (consensus ranking, or otherwise), and then yes, that can also
       | improve (eg. this was contentious when gemini ultra was released,
       | because their benchmarks used slightly different prompting
       | strategies than GPT-4 prompting strategies, which made it even
       | more opaque to determine "better" score per cost (as some meta-
       | metric)) (some terms are chain/tree/graph of thought, etc.)
       | 
       | [Yes (weak)] Next, there's another "concept" of your question
       | about "more processing power leading to better results", which
       | you could argue "in-context learning" is itself more compute
       | (takes flops to run the context tokens through the model (N^2
       | scaling, though with caches)) - and purely by "giving a model"
       | more instructions in the beginning, you increase the compute and
       | memory required, but also often "increase the skill" of the
       | output tokens. So maybe in that regard, even a frozen model is a
       | "yes" it does get smarter (with the right prompt / context).
       | 
       | One interesting detail about current SotA models, even the
       | Mixture of Expert style models, is that they're "Static" in where
       | their weights and "flow" of activations along the "layer"
       | direction. They're dynamic (re-use) weights in the
       | "token"/"causal" ordering direction (the N^2 part). I've
       | personally spent some time (~1 month in Nov last year) working on
       | trying to make more advanced "adaptive models", that use switches
       | like those from the MoE style network, but to route to "the same"
       | QKV attention matrices, so that something like what you describe
       | is possible (make the "number of layers" a dynamic property, and
       | have the model learn to predict after 2 layers, 10 layers, or
       | 5,000 layers, and see if "more time to think" can improve the
       | results, do math with concepts, etc. -- but for there to be
       | dynamic layers, the weights can't be "frozen in place" like they
       | currently are) -- currently I have nothing good to show here
       | though. One interesting finding though (now that I'm rambling and
       | just typing a lot) is that in a static model, you can "shuffle"
       | the layers (eg. swap layer 4's weights with layer 7's weights)
       | and the resulting tokens roughly seem similar (likely caused by
       | the ResNet style backbone). Only the first ~3 layers and last ~3
       | layers seem "important to not permute". It kinda makes me
       | interpret models as using the first few layers to get into some
       | "universal" embedding space, operating in that space "without
       | ordering in layer-order", and then "projecting back" to token
       | space at the end. (rather than staying in token space the whole
       | way through). This is why I think it's possible to do more
       | dynamic routing in the middle of networks, which I think is what
       | you're implying when you say "do they make more recursive queries
       | into their data" (I'm projecting, but when i imagine the idea of
       | "self-reflection" or "thought" like that, inside of a model, I
       | imagine it at this layer -- which, as far as I know, has not been
       | shown/tested in any current LLM / transformer architecture)
        
         | robrenaud wrote:
         | They inner layer permutability is super interesting. Is that
         | result published anywhere? That's consistent with this graph
         | here, which seems to imply different layers are kind of working
         | in very related latent spaces.
         | 
         | If you skip to the graph here that shows the attention + feed
         | forward displacements tending to align (after a 2d projection),
         | is this something known/understood? Are the attention and feed
         | forward displacement vectors highly correlated and mostly
         | pointing in the same direction.
         | 
         | https://shyam.blog/posts/beyond-self-attention/
         | 
         | Skip to the graph above this paragraph: "Again, the red arrow
         | represents the input vector, each green arrow represents one
         | block's self-attention output, each blue arrow represents one
         | block's feed-forward network output. Arranged tip to tail,
         | their endpoint represents the final output from the stack of 6
         | blocks, depicted by the gray arrow."
        
           | bluecoconut wrote:
           | I haven't published it nor have I seen it published.
           | 
           | I can copy paste some of my raw notes / outputs from poking
           | around with a small model (Phi-1.5) into a gist though: https
           | ://gist.github.com/bluecoconut/6a080bd6dce57046a810787f...
        
           | bluecoconut wrote:
           | Those curves of "embedding displacement" are very
           | interesting!
           | 
           | quickly scanning the blog led to this notebook which shows
           | how they're computed and shows other examples too with
           | similar behavior. https://github.com/spather/transformer-
           | experiments/blob/mast...
        
       | tedivm wrote:
       | The same model will not get better by having more processing
       | power or time. However, that's not the full story.
       | 
       | Larger models generally perform better than smaller models (this
       | is a generalization, but a good enough one for now). The problem
       | is that larger models are also slower.
       | 
       | This ends up being a balancing act for model developers. They
       | could get better results but it may end up being a worse user
       | experience. Models size can also limit where the model can be
       | deployed.
        
       | viraptor wrote:
       | One caveat not mentioned yet is that you can get better responses
       | through priming, fewshot and chain of thought. That means if you
       | start talking about a related problem/concept, mention some
       | keywords, then provide a few examples, then ask the LLM to
       | provide chain of thought reasoning, you will get a better answer.
       | Those will extend the runtime and processing power in practice.
        
       | lolinder wrote:
       | There's a misconception in the question that is important to
       | address first: when an LLM is running inference it isn't querying
       | its training data at all, it's just using a function that we
       | created previously (the "model") to predict the next word in a
       | block of text. That's it. All the magic of designing algorithms
       | for LLMs comes _before_ the inference step, during the creation
       | of the model.
       | 
       | Building an LLM model consists of defining its "architecture" (an
       | enormous mathematical function that defines the models _shape_ )
       | and then using a lot of trial and error to guess which
       | "parameters" (constants that we plug in to the function, like 'm'
       | and 'b' in y=mx+b) will be most likely to produce text that
       | resembles the training data.
       | 
       | So, to your question: LLMs tend to perform better the more
       | parameters they have, so larger models will tend to beat smaller
       | models. Larger models _also_ require a lot of processing power
       | and /or time per inferred token, so we do tend to see that better
       | models take more processing power. But this is because _larger
       | models_ tend to be better, not because throwing more compute at
       | an existing model helps it produce better results.
        
       ___________________________________________________________________
       (page generated 2024-02-25 23:00 UTC)