[HN Gopher] Differential Transformer
___________________________________________________________________
Differential Transformer
Author : weirdcat
Score : 401 points
Date : 2024-10-08 11:54 UTC (11 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| pikseladam wrote:
| Did this mean they solved the hallucination problem of
| transformers?
|
| edit: not fully but it gives promising results. quiet an
| improvement actually.
| lafreb wrote:
| The paper says that they've improved hallucination mitigation,
| but not really "solved" the issue.
| Rhapso wrote:
| "Hallucination" isn't really a problem that can be "fixed".
| Its just model error.
|
| The root problem is simply that the model doesn't capture
| reality, just an approximation. What we are incorrectly
| calling "hallucination" is just the best the model has to
| offer.
| dilap wrote:
| it can be fixed in theory if the model knows-what-it-knows,
| to avoid saying things its uncertain about (this is what
| (some) humans do to reduce the frequency w which they say
| untrue things).
|
| theres some promising research using this idea, tho i dont
| have it at hand.
| AnimalMuppet wrote:
| I'm pretty sure there's something I don't understand,
| but:
|
| Doesn't an LLM pick the "most probable next symbol" (or,
| depending on temperature, _one_ of the most probable next
| symbols)? To do that, doesn 't it have to have some idea
| of what the probability is? Couldn't it then, if the
| probability falls below some threshold, say "I don't
| know" instead of giving what it knows is a low-
| probability answer?
| viraptor wrote:
| > Doesn't an LLM pick the "most probable next symbol"
|
| Yes, but that very rarely matters. (Almost never when
| it's brought up in discussions)
|
| > Couldn't it then, if the probability falls below some
| threshold, say "I don't know" instead of giving what it
| knows is a low-probability answer?
|
| A low probability doesn't necessarily mean something's
| incorrect. Responding to your question in French would
| also have very low probability, even if it's correct.
| There's also some nuance around what's classified as a
| hallucination... Maybe something in the training data did
| suggest that answer as correct.
|
| There are ideas similar to this one though. It's just a
| bit more complex than pure probabilities going down.
| https://arxiv.org/abs/2405.19648
| anon291 wrote:
| You need to separate out the LLM, which only produces a
| set of probabilities, from the system, which includes the
| LLM and the sampling methodology. Sampling is currently
| not very intelligent at all.
|
| The next bit of confusion is that the 'probability' isn't
| 'real'. It's not an actual probability but a weight that
| sums up to one, which is close enough to how probability
| works that we call it that. However, sometimes there are
| several good answers and so all the good answers get a
| lower probability because there are 5 of them. A fixed
| threshold is not a good idea in this case. Instead,
| smarter sampling methods are necessary. One possibility
| is that if we do have seeming confusion, to put a
| 'confusion marker' into the text and predict the next
| output and train models to refine the answer as they go
| along. Not sure if any work has been done here, but this
| seems to go along with what you're interested in
| viraptor wrote:
| > However, sometimes there are several good answers and
| so all the good answers get a lower probability because
| there are 5 of them.
|
| That's the result after softmax. If you want to act on
| the raw results, you can still do that.
| skydhash wrote:
| You would need some kind of referential facts that you
| hold as true, then some introspection method to align
| sentences to those. if it can't be done, the output may
| be "I don't know". But even for programming languages
| (simplest useful languages), it would be hard to do.
| PaulHoule wrote:
| My guess is the problem is words with high probabilities
| that happen to be part of a wrong answer.
|
| For one thing the probability of a word occurring is just
| a probability of the word occurring in a certain sample,
| it's not an indicator of truth. (e.g. the most
| problematic concept in philosophy in that just
| introducing it undermines the truth, see "9/11 truther")
| It's also not sufficient to pick a "true" word or always
| pick a "true" word but rather the truthfulness of a
| statement needs to be evaluated based on the statement as
| a whole.
|
| A word might have a low probability because it competes
| with a large number of alternatives that are equally
| likely which is not a reason to stop generation.
| darkPotato wrote:
| My understanding is that the hallucination is, out of all
| the possibilities, the most probable one (ignoring
| temperature). So the hallucination is the most probable
| sequence of tokens at that point. The model may be able
| to predict an "I don't have that information" given the
| right context. But ensuring that in general is an open
| question.
| dTal wrote:
| It doesn't really work like that.
|
| 1) The model outputs a ranked list of all tokens; the
| probability always sums to 1. Sometimes there is a clear
| "#1 candidate", very often there are a number of
| plausible candidates. This is just how language works -
| there are multiple ways to phrase things, and you can't
| have the model give up every time there is a choice of
| synonyms.
|
| 2) Probability of a token is not the same as probability
| of a fact. Consider a language model that knows the
| approximate population of Paris (2 million) but is not
| confident about the exact figure. Feed such a model the
| string "The exact population of Paris is" and it will
| begin with "2" but halfway through the number it will
| have a more or less arbitrary choice of 10 digits. "2.1I
| don't know" is neither a desirable answer, nor a
| plausible one from the model's perspective.
| ithkuil wrote:
| This may work when the next token is a key concept but
| when it's a filler word or a part of one of many
| sequences of words that can convey the same meaning but
| in different ways (synonyms but not only at the word also
| at the sentence levels) then it's harder to know whether
| the probability is low because the word is absolutely
| unlikely or because it's likelihood is spread/shared
| among other truthful statements
| atrus wrote:
| I don't think that fixes it, even in theory, since
| there's always _some_ uncertainty.
| hoosieree wrote:
| LLMs can't hallucinate. They generate the next most
| likely token in a sequence. Whether that sequence matches
| any kind of objective truth is orthogonal to how models
| work.
|
| I suppose depending on your point of view, LLMs either
| _can 't_ hallucinate, or _that 's all they can do_.
| CooCooCaCha wrote:
| Whenever someone takes issue with using the word
| "hallucinate" with LLMs I get the impression they're
| trying to convince me that hallucination is good.
|
| Why do you care so much about this particular issue? And
| why can't hallucination be something we can aim to
| improve?
| ToValueFunfetti wrote:
| >Whether that sequence matches any kind of objective
| truth is orthogonal to how models work.
|
| Empirically, this cannot be true. If it were, it would be
| statistically shocking how often models coincidentally
| say true things. The training does not perfectly align
| the model with truth, but 'orthogonal' is off by a
| minimum of 45 degrees.
| viraptor wrote:
| It matches the training data. Whether the training data
| matches truth (and whether it's correctly understood -
| sarcasm included) is a completely separate thing.
|
| > The training does not perfectly align the model with
| truth, but 'orthogonal'
|
| Nitpicky, but the more dimensions you have, the easier it
| is for almost everything to be orthogonal.
| (https://softwaredoug.com/blog/2022/12/26/surpries-at-hi-
| dime...) That's why averaging embeddings works.
| timcobb wrote:
| Isn't this the same thing that happens when you train a
| human on truths vs falsehoods?
| ToValueFunfetti wrote:
| I went to school to learn about the world and the
| overwhelming majority of that learning was from
| professors and textbooks. Whether the professors' beliefs
| and the textbooks' contents reflected the true properties
| of the world was a completely separate thing, entirely
| outside of my control. But I did come away with a better
| understanding of the world and few would say that
| education is orthogonal to that goal.
|
| If you add two vectors that don't have a truth component
| (ie. are orthogonal to the truth), the resulting vector
| should be no closer to the truth. If you start with
| random weights and perform some operation on them such
| that the new weights have a higher likelihood of
| producing true statements, the operation must not have
| been orthogonal to the truth. Am I wrong there?
| viraptor wrote:
| > But I did come away with a better understanding of the
| world and few would say that education is orthogonal to
| that goal.
|
| That's due to the reward function / environment. But even
| outside extremes like North Korea, lots of education
| environments value conformity over independent analysis.
| ToValueFunfetti wrote:
| Certainly an AI trained on North Korean data would emerge
| with some very suspect beliefs regarding Kim Jong-Un. My
| point is just that aligning something with training data
| is aligning it with truth, to the degree that the
| training data is true and regardless of why it is true.
| educate(me, truth) can hardly be called orthogonal to the
| truth, even if the 'educate' and 'me' terms do nothing to
| prevent educate(me, falsehood).
| visarga wrote:
| This reminds me it's easy to train similarity models,
| hard to train identity/equivalence prediction. Two
| strings can be similar in many ways, like "Address Line
| 1" and "Address Line 2" or "Position_X" and "Position_Y",
| yet distinct in meaning. That one character makes all the
| difference. On the other hand "Vendor Name" is equivalent
| with "Seller Company" even though they are pretty
| different lexically.
|
| The dot product, which is at the core of attention, is
| good for similarity not identity. I think this is why
| models hallucinate - how can they tell the distinction
| between "I have trained on this fact" and "Looks like
| something I trained on".
| tucnak wrote:
| I'm led to believe this is mostly because "known unknowns"
| are not well-represented in the training datasets... I
| think, instead of bothering with refusals and enforcing a
| particular "voice" with excessive RL, they ought to focus
| more on identifying "gaps" in the datasets and feeding them
| back, perhaps they're already doing this with synthetic
| data / distillation.
| spencerchubb wrote:
| it's not "just" model error
|
| during pre-training, there is never an incentive for the
| model to say "I don't know" because it would be penalized.
| the model is incentivized to make an educated guess
|
| large transformer models are _really_ good at approximating
| their dataset. there is no data on the internet about what
| LLMs know. and even if there were such data, it would
| probably become obsolete soon
|
| that being said, maybe a big shift in the architecture
| could solve this. I hope!
| happypumpkin wrote:
| > it would probably become obsolete soon
|
| Suppose there are many times more posts about something
| one generation of LLMs can't do (arithmetic, tic-tac-toe,
| whatever), than posts about how the next generation of
| models _can_ do that task successfully. I think this is
| probably the case.
|
| While I doubt it will happen, it would be somewhat funny
| if training on that text caused a future model to claim
| it can't do something that it "should" be able to because
| it internalized that it was an LLM and "LLMs can't do X."
| spencerchubb wrote:
| also presumes that the LLM knows it is an LLM
| adwn wrote:
| System prompts sometimes contain the information that
| "it" is an LLM.
|
| Maybe in the future, those prompts will include
| motivational phrases, like "You can do it!" or "Believe
| in yourself, then you can achieve anything."
| Vecr wrote:
| They're generally fine tuned not to. I'm not sure how
| long that will hold though.
| spywaregorilla wrote:
| > during pre-training, there is never an incentive for
| the model to say "I don't know" because it would be
| penalized. the model is incentivized to make an educated
| guess
|
| The guess can be "I don't know". The base LLM would
| generally only say I don't know if it "knew" that it
| didn't know, which is not going to be very common. The
| tuned LLM would be the level responsible for trying to
| equate a lack of understanding to saying "I don't know"
| singularity2001 wrote:
| in another paper which popped up recently they
| approximated uncertainty with Entropy and inserted
| "wait!" tokens whenever Entropy was high, simulating
| chain of thought within the system.
| watsonmusic wrote:
| that would be huge!
| HarHarVeryFunny wrote:
| I don't think there's any narrow definition of what
| "hallucination" means. It generally refers to the model giving
| non-factual answers in contexts that are meant to be factual,
| but not all causes of this are going to be fixable without very
| major changes.
|
| The fundamental issue is that most of the time LLMs are going
| to be combining statistics derived from many training samples
| when generating a single continuation, and there is just no
| guarantee that this will result in a semantically coherent
| response. Of course the model's depth of parsing and semantic
| analysis usually means that each generated word is highly
| plausible, but this isn't the same as being factually correct,
| especially so in these cases where the model is drawing on
| multiple sources to create a mashup response, which is the
| normal mode of operation.
| ExxKA wrote:
| Very interesting. Currently working on timeseries with
| Transformers. Let me know if anyone else out there is also
| reading it from that context.
| d3m0t3p wrote:
| Really cool, I'm a CS majoring in AI, but I'm also interested
| in that domain, would you have any recommendation to get
| started ?
| ExxKA wrote:
| Get a lot of data, and just dig in :) No better way to learn.
| magicalhippo wrote:
| _The visualization reveals that Transformer tends to allocate
| only a small proportion of attention scores to the correct
| answer, while disproportionately focusing on irrelevant context._
|
| _[...] Specifically, we partition the query and key vectors into
| two groups and compute two separate softmax attention maps. Then
| the result of subtracting these two maps is regarded as attention
| scores._
|
| _[...] The approach is analogous to noise-canceling headphones
| and differential amplifiers in electrical engineering, where the
| difference between two signals cancels out common-mode noise._
|
| Simple change, with seemingly decent improvements across the
| board.
| msoad wrote:
| Like most things in this new world of Machine Learning, I'm
| really confused why this works?
|
| The analogy to noise-cancelling headphones is helpful but in that
| case we clearly know which is signal and which is noise. Here, if
| we knew why would we even bother to the noise-cancelling work?
| watsonmusic wrote:
| the model is supposed to learn this
| _hl_ wrote:
| Some of the "prior art" here is ladder networks and to some
| handwavy extent residual nets, both of which can be interpreted
| as training the model on reducing the error to its previous
| predictions as opposed to predicting the final result directly.
| I think some intuition for why it works has to do with changing
| the gradient descent landscape to be a bit friendlier towards
| learning in small baby steps, as you are now explicitly
| designing the network around the idea that it will start off
| making lots of errors in its predictions and then get better
| over time.
| HarHarVeryFunny wrote:
| I don't understand either. It seems the general idea is that
| they calculate attention twice, which due to random
| initialization might be expected to give two slightly different
| results. I'd have thought that what these two attention maps
| would have in common would be the signal, and where they would
| differ would be noise, so rather than subtracting them
| (resulting in all noise?!) what you really want is to add (so
| the common signal gets reinforced) and normalize.
| Carlseymanh wrote:
| I think there might be some communalities with system
| engineering, where you subtract the output from the input in
| order to get a control signal that steers the plant to the
| target values. I too fail to see how that would be supposed
| to work in practice.
| kelseyfrog wrote:
| The values between the groups are also going to diverge
| during training due to the structure of the DiffAttn
| equation.
|
| The analogy I can think of is when you're paying attention to
| a variety of things and you actively avoid concentrating on
| something because it will distract you. You don't give it
| zero attention, you give it negative attention.
| blackbear_ wrote:
| With a single softmax you cannot predict exactly 0, but only
| very small numbers. When you have a large number of values to
| add up, this "poisons" the output with a lot of irrelevant
| stuff (the noise mentioned in the paper).
|
| To make things worse, low attention values will have very low
| gradient, thus needing a lot of weight updates to undo that
| kind of mistakes. On the other hand, subtracting the output of
| two softmax allows the model to predict a weight of exactly
| zero for some of the values, while keeping a reasonable
| gradient flowing through.
|
| So the model already knows what is noise, but a single softmax
| makes it harder to exclude it.
|
| Moreover, with a single softmax the output of all heads is
| forced to stay in the convex hull of the value vectors, whereas
| with this variant each head can choose its own lambda, thus
| shifting the "range" of the outputs outside the convex hull
| pre-determined by the values. This makes the model as a whole
| more expressive.
| freeqaz wrote:
| I'm able to follow most of what you're saying. It's unclear
| to me what "convex hull" means though.
|
| Also, where is each softmax happening here? For each
| attention head?
| blackbear_ wrote:
| The convex hull of a set of points is the region "between"
| those points. So the convex hull of three points (that do
| not lie on the same line) is a triangle with those three
| points as vertices. If you add a fourth point inside the
| triangle, the convex hull remains the same, but if you add
| it outside then the convex hull becomes the four-sided
| region with those points as vertices.
|
| In the context of standard transformer attention, each
| output lies in the convex hull ("somewhere between") the
| input values. With the modification of this paper, the
| input values can be scaled a little so that the output of
| different heads can be in different "regions" and thus do
| not interfere with each other (so yes to your third
| question, the two softmaxes are performed separately for
| each head).
| Majromax wrote:
| > It's unclear to me what "convex hull" means though.
|
| The convex hull (https://en.wikipedia.org/wiki/Convex_hull)
| of a set is the smallest convex shape that includes that
| set. Geometrically, it's what you'd get if you "shrink
| wrapped" the thing you're looking at: edges still protrude,
| but any indentations get smoothed over.
|
| In this context, the grandparent comment is pointing out
| that with a traditional transformer block, the resulting
| computed value for a token can never "stick out" past some
| weighted average of the values of attended-to tokens, but
| this differential attention formalism allows that result.
| pizza wrote:
| O_i = softmax(...) * V_i and softmax is between 0 and 1, so
| O_i = alpha * V_i for some alpha between 0 and 1 so that
| makes it convex, and it makes the O_i just a shrunken
| version of V_i. Whereas if you have the diff of softmaxes,
| you get O_i = (alpha - beta) * V_i, which can range from
| -V_i to +V_i, so its output could rescale /or/ flip V_i.
| And yes this is happening in every head in parallel, then
| they get summed.
| kridsdale3 wrote:
| By simply inputting your comment in to 4o, with no other
| context about the paper, I was able to get a pretty good
| analysis of the dual-head concept's implications.
|
| https://chatgpt.com/share/67058973-ba94-8008-bed7-c7f9d08
| dc5...
| spwa4 wrote:
| Uh, this is extracting a LOT from very little data. I
| don't understand where it's coming from but it's
| explanation just keeps going into more and more detail
| ... that doesn't seem to follow from the data it's got.
|
| I just don't see how you could answer these questions
| without trying it out. And chatgtp DEFINITELY isn't doing
| that.
|
| Plus the obvious question I'd pose is not in there.
| What's the difference in performance between this trick
| and just "softmax() - 0.5 * 2" ? That seems very
| relevant.
| robertsdionne wrote:
| It means one of these things:
| https://en.wikipedia.org/wiki/Simplex#Standard_simplex
| dartos wrote:
| > predict a weight of exactly zero for some of the values
|
| Wouldn't this be pretty unlikely, though?
| schopra909 wrote:
| Quite the opposite -- if you have a long sequence only a
| smattering of the words will influence the meaning of the
| current word. Everything else is "noise".
|
| Attention is really good at finding this smattering of
| words (ie assign most weight there). But it struggles to
| put exactly 0 on the other words.
| absoflutely wrote:
| why say lot word when few word do
| dartos wrote:
| Few word no do tho
| 1024core wrote:
| Phew!
| kridsdale3 wrote:
| U+1FAE5
| dartos wrote:
| I mean wouldn't it be unlikely that
|
| SoftmaxA[n] - SoftmaxB[n] is exactly 0?
|
| Even if 2 attention layers learn two different things, I
| would imagine the corresponding weights in each layer
| wouldn't exactly cancel each other out.
| nyrikki wrote:
| While I don't discount the value of this, can you expand on
| the meaning of your claim that it makes the model 'more
| expressive'
|
| Everything I am seeing in this paper is related to reduced
| size and noise, which implies a reduction in expressiveness.
|
| The improvement in needle and a haystack, benchmarks on
| multi-hop questions of in corpus data and multishot in-
| context learning points to this.
|
| This is a wonderful thing if robustness is more important
| than generality, but it doesn't address trimming away
| activations that may be spurious in the general use case but
| may improve an individual domain specificity.
|
| Context would dramatically impact what tradeoffs and more
| desireble, and noise is probably never desirable. But the
| ability of this paper to enable bit size for inference points
| to a reduction in expressiveness.
|
| Perhaps I am too focused on generalization?
| blackbear_ wrote:
| What I meant is that by changing lambda each attention head
| is able to put its outputs in a subspace that is different
| than that of the other heads. This means that the outputs
| of different heads do not mingle with each other, and it's
| easier for the following layer to pick them apart. So I was
| thinking at increased expressiveness because the attention
| output can in principle cover a larger volume.
|
| Maybe expressiveness is not the right term, or not the main
| consequence. I could imagine that having different
| subspaces like that also introduces a degree of robustness
| to out-of-distribution inputs, as this would make it harder
| for the outputs of one attention head to shift towards the
| in-distribution outputs of another head, and thus for the
| following layer to confuse them.
| espadrine wrote:
| It is a neat approach, but one that comes with a tradeoff,
| IIUC: doubling the key heads.
|
| I wonder if a different approach without that issue exists.
| For instance, using max(0, exp(x)-1) instead of exp(x) in the
| softmax attention formula. That way when the query is
| orthogonal to the key (or worse), it does not contribute.
| smallnamespace wrote:
| > using max(0, exp(x)-1) instead of exp(x)
|
| Won't this cause the gradient to vanish on the left half,
| causing problems with training?
| x1000 wrote:
| Could you help explain how we would achieve an attention
| score of exactly 0, in practice? Here's my take:
|
| If we're subtracting one attention matrix from another, we'd
| end up with attention scores between -1 and 1, with a
| probability of effectively 0 for any single entry to exactly
| equal 0.
|
| What's more, the learnable parameter \lambda allows for
| negative values. This would allow the model to learn to
| actually add the attention scores, making a score of exactly
| 0 impossible.
| jszymborski wrote:
| Your comment brings up two interesting variants that could
| be interesting if your goal is to increase the sparsity of
| the attention:
|
| - Rectify the difference of the softmaxes. (min(0, s(A1) -
| lambda s(A2)))
|
| - Apply the Heaviside function to the second softmax.
| (softmax(A1) - lambda H(s(A1) - lambda s(A2))
|
| The second one being a bit more drastic and maybe harder to
| train.
| phire wrote:
| Noise cancelling headphones are probably the wrong analogy
| here.
|
| The better example is the differential signalling used in
| professional audio and many digital signaling protocols like
| Ethernet, HDMI and USB.
|
| Instead of using one wire, referencing to ground, they send the
| signal as the difference between both wires. Both wires end up
| carrying the same signal with inverted polarity. Because both
| wires are running next to each-other any external noice will be
| applied to both equally.
|
| The voltage will change, but the difference in voltage between
| both wires is untouched. And when you subtract the two voltages
| at the receiver end, any noise simply gets subtracted out.
| seamossfet wrote:
| I think when they bring up differential amplifiers they're
| referring more to the DSP technique of how headphone noise
| cancelling works but the actual electrical properties of how
| a differential amplifier does that muddies the message a bit.
|
| It sort of feels closer to heterodyning and "demodulating"
| the signal encoded in the softmax. Those tiny little errors
| we're trying to denoise with this technique are almost closer
| to carrier waves (when encoded to softmax) than noise imo.
| This wouldn't get rid of noise in the training data or noise
| in the dimensionality of the key / value space. It's really
| only removing noise introduced by the process itself.
| seamossfet wrote:
| It sounds like they're just splitting the query / key space
| down the middle. We don't know which dimensions are encoded in
| each matrix, but they're assuming the "noise" introduced in one
| query / key space is equivalent to noise introduced in the
| other space.
|
| If that is the case, then the "signal" in this case would be
| the softmax that encodes the dimensions captured by the query /
| key space. Since the noise ideally is the same in both softmax
| encodings, subtracting them should "cancel out" the noise.
| WithinReason wrote:
| Don't look for an analogy, this just adds a new mathematical
| capability. It enables "negative attention", the network can
| say "I want to subtract the contribution of this token" in the
| attention calculation. Previously it could only reduce how much
| it adds.
|
| The simple way of doing this would be to just remove the
| softmax or use a sigmoid instead, but in practice a softmax
| works better it seems.
| mistercheph wrote:
| I think common mode filtering in balanced audio cables is a
| much better analogy than noise canceling headphones (and where
| this paper gets its name from I assume), you don't know what
| the noise is ahead of time, but if you take two samples with
| one positive and one negative, noise displaces both absolutely,
| which you can take advantage of to denoise the signal (find the
| differential mode).
|
| For example, if you are trying to send a +1V signal on one
| wire, and a -1V signal on the other and a +0.5V noise exists,
| one wire will have +1.5V and the other will have -0.5V,
|
| Take the difference and divide by 2:
|
| (+1.5V - -0.5V) / 2 = +1V or, if your setup is different (-0.5V
| - +1.5V) / 2 = -1V
| chessgecko wrote:
| My hypothesis for why this works that it mitigates the
| downsides of rope
|
| to eli5:
|
| rope is the modern strategy used to give information to the
| model about how far a query and a key are apart when doing
| attention. It's the best strategy we have now, but has a major
| downside, where it makes some connections between tokens that
| are far apart much stronger than you would like them to be.
| Xpos (https://arxiv.org/pdf/2212.10554) is another paper by
| microsoft tackling issues with rope and you can see figure 1 on
| page 4 to get a visual interpretation of the sinusoidal
| attention strength (you would like it to be smooth).
|
| I think a big reason differential transformers is working so
| well, especially on long sequence stuff, because when both q1
| and q2 don't match a token, the rope relative strength will
| still have the same value and the noise will cancel out.
| Leaving intended matches, but at the cost of somewhat dampening
| the original value rope brought.
|
| Just a hypothesis though. It would be easy to test by running
| this experiment against a baseline where both use alibi
| attention (https://arxiv.org/pdf/2108.12409) which has a
| different set of tradeoffs this wouldn't mitigate, but still a
| really interesting result.
| watsonmusic wrote:
| The modification is simple and beautiful. And the improvements
| are quite significant.
| campers wrote:
| The tl;dr on high level performance improvements
|
| "The scaling curves indicate that Diff Transformer requires only
| about 65% of model size or training tokens needed by Transformer
| to achieve comparable language modeling performance."
|
| "Diff Transformer retains high performance even at reduced bit-
| widths, ranging from 16 bits to 6 bits. In comparison,
| Transformer's accuracy significantly drops with 6-bit
| quantization. The 4-bit Diff Transformer achieves comparable
| accuracy as the 6-bit Transformer, and outperforms the 4-bit
| Transformer by about 25% in accuracy."
| digdugdirk wrote:
| Is there any way to replicate this with existing models, or are
| we going to need to wait for models to be trained in this style?
|
| I'm imagining a smaller model examining the output tokens of a
| larger model and metaphorically slapping it on the wrist with a
| ruler if the output tokens start drifting off topic. Not quite
| the same, but an entertaining thought nonetheless.
| causal wrote:
| It's a different attention mechanism with a different map
| setup, so fundamentally a different type of model
| om8 wrote:
| Looks like it is a drop in replacement for attention, but
| models will need to be retrained for this one, yes.
| aDyslecticCrow wrote:
| It may not need to be entirely retrained. The value spans
| and input are the same, and no extra weights are needed.
| You may be able to tune an existing model with this
| attention mechanism and get some of the benefits.
|
| But overall... it's mainly a training change, so training
| is needed to make a difference.
| bionhoward wrote:
| Yes, I believe this is possible, you could clone weights of one
| or more existing models and fine tune them in groups with
| different random seeds for noise/drop to produce reasonable
| outputs under a differential transformer decoding scheme
| whereby tokens with disagreement receive more attention
| (surprisal analysis)
| patcon wrote:
| I wonder what is lost here. Surely there's a trade-off...
|
| I'm wondering if there's any effect of "creativity", or ability
| to interpolate between concepts. Hallucination and creativity
| feel very related to me. I understand hallucinating as simply
| being misaligned with the space humans feel appropriate to
| interpolate between
| watsonmusic wrote:
| not all hallucinations are creativity Imaginate that for a RAG
| application, the model is supposed to follow the given
| documents
| magicalhippo wrote:
| > Surely there's a trade-off...
|
| For one, speed and memory. They have twice as many Q and K
| weights in the attention blocks, leading to a ~10% reduction in
| throughput on their H100 (table 7 in appendix A).
| lennxa wrote:
| they mention similar performance to vanilla transformer with
| significantly reduced param count though
| karmasimida wrote:
| I mean it doesn't necessarily needs 2x QK to match that
| performance, in terms of accuracy, of a regular transformer
| right?
| dartos wrote:
| > Hallucination and creativity feel very related to me.
|
| Why? I see them as just sampling errors.
|
| Sure a mistake can spark inspiration sometimes, but creativity
| is much more than mistakes.
|
| > I understand hallucinating as simply being misaligned with
| the space humans feel appropriate to interpolate between
|
| These language models are next-token predictors. The way the
| next token is predicted is by sampling a probability space
| outputted by the model.
|
| That sampling process can be non deterministic.
|
| Hallucinations are when that sampling results in tokens that
| come together to create a false or otherwise unintended
| statement.
|
| You can just as well think of everything a model outputs as a
| hallucination, but we train the model to output a space what we
| want them to hallucinate is more likely. Otherwise it just
| outputs meaningless noise.
|
| "Hallucinate" is really an awful word for what it's trying to
| describe.
| nextaccountic wrote:
| > Sure a mistake can spark inspiration sometimes, but
| creativity is much more than mistakes.
|
| It looks like creativity has many steps but being able to
| come with novel, unprompted stuff is important, as long as
| you are able to discard the bullshit earlier.
|
| "Hallucination" is only a problem if later layers (or
| additional networks) can't detect and remove it
| dartos wrote:
| > "Hallucination" is only a problem if later layers (or
| additional networks) can't detect and remove it
|
| Yeah I mean sure. Anything is only a problem if it goes
| undetected. The issue is that if you rely on statistical
| model, you'll always have hallucinations, so you can't
| filter statistical output with another statistical model if
| you need real guarantees.
|
| Many products don't need those guarantees though.
| thomastjeffery wrote:
| Hallucinate is an awful word _because of_ what it is trying
| to describe.
|
| Hallucination describes the same feature you just called "non
| deterministic sampling", but exclusively the cases that we
| don't like. It would be really convenient if we could
| actually draw that line, but _we can 't_. If non-determinism
| is a core feature, then that feature will be present in every
| case; including the ones we find desirable, and the ones we
| find undesirable.
| skybrian wrote:
| LLM's are too unpredictable for many practical uses so I'd
| guess better predictability is better. Hopefully the change
| the paper proposes will help!
|
| But here's a case for the other side: sure, most mistakes are
| just errors, but evolution happens via "mistakes." Also,
| LLM's often deliberately add add randomness at inference
| time.
| dartos wrote:
| > evolution happens via "mistakes."
|
| That's a nice slogan, but it's a gross oversimplification.
|
| In the natural world, you can say that mistakes in DNA
| replication leads to evolution, but that's discounting the
| entire process of natural selection.
|
| Same with creativity. Look at Picasso. His was a
| technically brilliant realistic painter at 15, but his work
| later in life evolved to be more abstract and weird. I
| don't think that was the result of mistakes, but rather
| intentionally breaking patterns he learned in his youth.
| skybrian wrote:
| To oversimplify, evolution is a generate-and-test process
| and the evaluation step is critical. Something needs to
| decide which variations are better. Often, with
| generative AI, it's people who judge the results. Still,
| generating interesting examples (the brainstorming phase)
| plays _some_ role in that.
|
| I don't know a whole lot about Picasso's art, but I
| imagine the way he evaluated his own work played an
| important role, in being able to see that sometimes
| creative accidents are interesting.
| slashdave wrote:
| > You can just as well think of everything a model outputs as
| a hallucination
|
| Exactly. Don't forget that an important factor in the success
| of GPT3 was RLHF, which is essentially training the model to
| produce "hallucinations" that are more acceptable on average
| to human trainers.
| pxdm wrote:
| What's the comparison with conventional attention using a more
| aggressive (lower temperature) softmax? I can imagine that for
| the multi-needle retrieval test this may also give a performance
| boost, although at some cost other more creative tasks.
| mota7 wrote:
| I had the same thought: Just eye-balling the graphs, the result
| of the subtraction looks very close to just reducing the
| temperature.
|
| They're effectively doing softmax with a fixed temperature, but
| it's unclear that this work is going to do better than just
| learning a per-head temperature parameter.
|
| c.f. https://arxiv.org/abs/2010.04245 which shows an
| improvement by learning per-head temperature.
|
| The other way to think about this is that it looks like a
| hacked-up kinda-sorta gated attention. If that's the case, then
| doing softmax(alpha _q_1_ k_1^T - log_sigmoid(beta _q_2_
| k_2^T)) might be better? (where alpha,beta are learned
| temperatures).
| nmacias wrote:
| AdderaLLM was _right there_
| vsroy wrote:
| Is the thing that's going on here that softmax can't push a value
| to 0, but by subtracting 2 softmax maps we can output 0s?
| pkoird wrote:
| Or negatives
| vsroy wrote:
| Follow-up question is: Isn't it extremely unlikely to output 0?
| pizza wrote:
| Was just going to mention that it seems that it should be
| possible to make a Flash Attention version of this algorithm and
| was pleasantly surprised to see they already included an
| implementation of one :)
| iandanforth wrote:
| The key bit I didn't understand at first was what happens if the
| two groups of attention learn the same thing; because their
| attention masks are subtracted from one another if they both
| output similar values the attention across the board will drop to
| zero and this will lead to high loss. So the only way to reduce
| loss is if they learn to attend to different things. One of the
| simplest strategies they could learn (and this paper claims that
| they do) is for one group to focus on relevant context and the
| other to focus on irrelevant context. Thus one group learns the
| noise and the other the signal (it's not this cut and dry but is
| a useful simplification for understanding IMO).
| dartos wrote:
| There's probably a small chance that they could both learn the
| same thing, but it's probably not likely enough to be a major
| issue.
| magicalhippo wrote:
| An interesting aspect is that they don't do a plain
| subtraction, but rather subtract a portion of the second
| softmax.
|
| This makes sense, if one considers that the two copies are
| identical then the softmax outputs would be identical and the
| difference is zero everywhere. However, by subtracting a scaled
| copy, the normalization of the difference seems to really boost
| the signal value(s) over the "noise", making the signal stand
| out compared to pre-normalization.
| testdfkjahdfh wrote:
| if two attentions A, B are identical, would (A - lambda * B)
| be just (1-lambda) * A, how does it "boost the signal
| value(s) over the "noise""?
| nextaccountic wrote:
| Maybe the loss function could penalize them learning the same
| thing?
| patcon wrote:
| > what happens if the two groups of attention learn the same
| thing
|
| I wonder if there's a metaphor here for our own experience and
| utility in "surprise".
|
| Like if one attention head is surprised by what another learns,
| up-weight it. But if they both find the same, assume it's not
| very surprising and down-weight it.
|
| Admittedly, "surprise" is something that has a big section of
| my knowledgebase[1][2][3] (both as a subjective feeling and
| adaptive function of our minds, one of the most complex
| adaptive system we know of)
|
| [1] https://plus.maths.org/content/information-surprise
|
| [2] https://blakeelias.name/papers/Multi-Agent-Cooperation-
| Intri...
|
| [3] https://complexity.simplecast.com/episodes/81/transcript
| dartos wrote:
| > By being less distracted by irrelevant context, Diff
| Transformer can mitigate hallucination in question answering and
| text summarization
|
| I'm very interested in this claim. I was under the impression
| that hallucination is unavoidable in these kinds of models. IIRC
| proof for that was trending on HN a couple weeks ago.
| moffkalast wrote:
| It's not possible to get rid of it entirely, but if you can get
| the model to bullshit only 0.1% of the time instead of 5% of
| the time it's a massive improvement.
|
| Most of it should be happening when there's no data to draw
| conclusions from. E.g. STT models make up words in silence,
| vision models find things in lens cap noise, LLMs make up
| explanations when they have no data to pull from.
|
| The real solution would be more along the lines of training
| models to specifically ignore these cases, or in the case of
| LLMs to just know when to say "I don't know".
| ErikBjare wrote:
| Mitigate, not completely fix.
| pshc wrote:
| More broadly I think hallucination is inevitable in pure text
| models. We need model architectures incorporating a stream of
| real-world ground truth such as a live video feed or
| embodiment.
| nowayno583 wrote:
| Does anyone understand why they are taking the difference between
| transformers instead of the sum? It seems to me that in a noise
| reducing solution we would be more interested in the sum, as
| random noise would cancel out and signal would be constructive.
|
| Of course, even if I'm right proper training would account to
| that by inverting signs where appropriate. Still, it seems weird
| to present it as the difference, especially seeing as they
| compare this directly to noise cancelling headphones, where we
| sum both microphones inputs.
| aDyslecticCrow wrote:
| The noise isn't truly random; it's just a matrix of small
| values that shouldn't be taken into account. Subtracting them
| cancels them out.
|
| As pointed out by a different comment, it's actually the
| attention we are interested in that is cancelled out *if they
| are both equal*. This is what the paper mentions in its
| abstract;
|
| > promoting the emergence of sparse attention patterns
|
| In theory, it is quite clever, and their results seem to back
| it up.
| thegeomaster wrote:
| I suspect that plus vs minus is arbitrary in this case (as you
| said, due to being able to learn a simple negation during
| training), but they are presenting it in this way because it is
| more intuitive. Indeed, adding two sources that are noisy in
| the same way just doubles the noise, whereas subtracting
| cancels it out. It's how balanced audio cables work, for
| example.
|
| But with noise cancelling headphones, we don't sum anything
| directly---we emit an inverted sound, and to the human ear,
| this sounds like a subtraction of the two signals. (Audio from
| the audio source, and noise from the microphone.)
| nowayno583 wrote:
| Oh! It's been a good while since I've worked in noise
| cancelling. I didn't know current tech was at the point where
| we could do direct reproduction of the outside noise, instead
| of just using mic arrays! That's very cool, it used to be
| considered totally sci fi to do it fast enough in a small
| headset.
| singularity2001 wrote:
| Anyone remember siamese networks?
| aDyslecticCrow wrote:
| Very clever. I like this kind of nitty-gritty detail work, and
| the change is small enough to be adapted easily by others. Bravo!
|
| I'm a little concerned about the last sentence of the section
| introduction of "2 Differential Transformer". It mentions using
| improvements from previous papers, but in the grammatical
| context, it's unclear if this improvement is added to both the
| normal transformer and their diff transformer. This would
| otherwise sully the comparisons. It's the "main difference"
| wording in the previous sentence that raised a flag for me.
|
| Of course, a good-faith researcher would know this and may not
| feel the need to clarify. But you can never be too careful about
| some published research in this field.
| Chirono wrote:
| The two other changes they mention have been widely adopted,
| and are included in at least some of the models they benchmark
| against. It seems they list them for completeness as changes to
| the original transformer architecture.
| aDyslecticCrow wrote:
| Nicely spotted! Then, I really look forward to seeing this
| method tested by others! Epic stuff.
| vessenes wrote:
| Yes. This looks really, really good to me. Cross the board
| improvements in training time, perplexity improvements per both
| token trained and per model size. I'm reminded of MoE
| architectures, in that world we're choosing an optimal small
| model to process part or all of the inference job; I wonder if
| MoE got some of the same benefits from forcing the Transformer
| to distinguish between alternate possibilities.
|
| In any event, I'd imagine that this will get widely adopted if
| the numbers hold up; like I said, this seems to be basically no
| downside, and should be easy to replicate.
| x49asvk wrote:
| This concept is really interesting to me, I am very very new to
| transformers but would love to learn more about normal
| transformers and differential too. Can anyone suggest any
| resources?
| lucidrains wrote:
| does this not mean we should explore usage of talking heads
| (Shazeer et al) a bit more? https://arxiv.org/abs/2003.02436
| WithinReason wrote:
| _We empirically find that the setting l[?][?] = 0.8 - 0.6 x
| exp(-0.3 * (l - 1)) works well in practice_
|
| I wonder about the story behind that formula...
| Kubuxu wrote:
| Hmm, 0.8 works well, but let's try setting lower layers to
| lower initial value. Let's say 0.2. Ok, I need a formula that
| will go between 0.2 and 0.8, slowly approaching 0.8. Starts
| fiddling with numbers for 20min, I guess this can work.
| kridsdale3 wrote:
| A whole lot of things are tuned optimally by rotating an analog
| dial until things look / sound right.
| stellalo wrote:
| Looks like this makes (at least initially in training) the
| "negative" attention term smaller in the early layers (smaller
| l) compared to later layers (larger l). Which I guess makes
| sense: you probably want to attend a little bit to everything
| before concluding that it's really a few spots you should look
| at.
|
| (Although it seems the author do not discuss this choice
| anywhere in the paper?)
| WithinReason wrote:
| Hmmm, this could be expressed as 2 consecutive attentions in a
| residual branch:
|
| Simplified differential T. looks like: (softmax(Q1K1) - l
| softmax(Q2K2)) V
|
| You can factor this into: x = softmax(Q1K1)V
| x += -l softmax(Q2K2)V
|
| which is like 2 subsequent regular attentions added that are
| sharing V
| kelseyfrog wrote:
| You could also extrapolate this into more than two terms by
| squinting your eyes and saying that l [?] {1, -1} is close
| enough to li [?]R^d | [?]li [?]=1. No idea if it would result
| in better performance, but that's research babyyyy!
| miven wrote:
| Is there an intuitive reason why this ends up working this well
| compared to, say, applying some kind of thresholding to attention
| activations that are below average for a given head to filter
| that same attention noise out?
| islewis wrote:
| > Differential attention takes the difference between two softmax
| attention functions to eliminate attention noise
|
| If I understand correctly, this architecture trades twice as much
| attention memory in exchange for either a higher quality model,
| or less parameters at a similar quality.
|
| > According to the fitted curves, 6.8B-size DIFF Transformer
| achieves a validation loss comparable to 11B-size Transformer,
| requiring only 62.2% of parameters
|
| This raises a few questions for me:
|
| - Would having only 60% of the parameters negate the double space
| for attention, leaving a similar memory profile as a traditional
| transformer?
|
| - Does that tradeoff change noticeably between training and
| inference?
| entropicdrifter wrote:
| I think it _would_ negate the RAM savings, but it would also
| reduce the amount of storage needed at rest and possibly reduce
| initial start up times depending on storage speed and model
| size. So, possibly good for low-end models on consumer devices?
| _hl_ wrote:
| My understanding was that the extra parameters required for the
| second attention mechanism are _included_ in those 6.8B
| parameters (i.e. those are the total parameters of the model,
| not some made-up metric of would-be parameter count in a
| standard transformer). This makes the result doubly impressive!
|
| Here's the bit from the paper:
|
| > We set the number of heads h = dmodel/2d, where d is equal to
| the head dimension of Transformer. So we can align the
| parameter counts and computational complexity.
|
| In other words, they make up for it by having only half as many
| attention heads per layer.
| chessgecko wrote:
| I think they mitigated the extra memory/compute from this by
| using half the number of overall heads and doubling V and O.
| Without actually checking the math I think it should be
| equivalent in flops, not counting the extra (cheap) multiply by
| const and subtract.
| Kubuxu wrote:
| It would double the size of the KV cache, which can be
| significant (multi-GB) at larger context sizes.
| Imnimo wrote:
| I feel like I'm missing a key insight here. I understand the
| problem that regular softmax attention struggles to approach
| assigning zero attention to irrelevant stuff. And I get that
| having this subtraction formula makes it possible to assign
| exactly (or near) zero attention weight without having crazy
| outlier activations. But it seems like it also makes it very easy
| to have negative attention weight (which is equivalent to having
| positive attention weight on the negation of your value vectors).
| Intuitively, it just feels like a difficult balancing act to keep
| all the stuff you don't care about so close to zero.
|
| But Figure 1 clearly shows that it works, so I don't doubt that
| it is in fact possible. I'm just struggling to build a picture of
| how exactly the network accomplishes this.
| watsonmusic wrote:
| negative values can enhance the expressibility
| Jerrrrrrry wrote:
| doubt is the seed of reason
| Grosvenor wrote:
| Regular softmax (and attention) has an error in it.
|
| softmax should be exp()/1+[?]exp()
|
| Notice the 1 added to the denominator.
|
| The difference is at the negative limit, softmax can be 0,
| instead of some epsilon. The same could be done by adding an
| extra zero value in x.
|
| Downside is, you have to retrain your model from scratch to fix
| this.
| impossiblefork wrote:
| I've tried that in a small transformer that I trained from
| scratch and it didn't really make any difference. I also made
| a version where I made this trainable somehow, probably by
| replacing the 1 with a constant associated with the layer,
| and that didn't make any difference either.
|
| I didn't follow Miller's proposal quite as he wrote it though
| and I put the mechanism in all the layers rather than
| avoiding it at the end.
|
| My test doesn't absolutely rule out usefulness-- there's
| always different ways of applying something, but I saw no
| indication of it.
| Grosvenor wrote:
| I guess the next step is to see if you're getting those
| mega activations as he describes.
|
| A/B test the two models and compare?
|
| Would be interesting to see if these activations only show
| up on larger models, or they're some relation to model
| size.
| sigmoid10 wrote:
| >I'm just struggling to build a picture of how exactly the
| network accomplishes this.
|
| I mean, intuitively it would be trivial for the model to just
| optimise lambda to zero during training. Then you essentially
| have built a vanilla transformer with an overcomplicated
| parameter pruning mechanism. Pruning is already pretty well
| established in the literature as something that works
| surprisingly good for reducing parameter counts up to (hold on
| to your papers)... about 40%. In practice the model probably
| doesn't work exactly like that, but I wouldn't be surprised if
| it just approximates the normal transformer in the end anyways.
| machinelearning wrote:
| This is a good problem to solve but the approach is wrong imo.
|
| It has to be done in a hierarchical way to know what you attended
| to + full context.
|
| If the differential vector is being computed with the same input
| as the attention vector how do you know how to modify the
| attention vector correctly
| quantadev wrote:
| Doesn't everything just get tweaked in whatever direction the
| back-propagation derivative says and proportionally to that
| "slope"? In other words, simply by having back-propagation
| system in effect there's never any question about which way to
| adjust the weights, right?
| slashdave wrote:
| I don't get it. Arbitrary linear combinations are already
| accommodated via feed forward. What am I missing?
| michalsustr wrote:
| My hunch is that this effectively creates a differentiable
| minimax "search" "tree" that can be backpropagated through. Not
| a tree -- a dag really -- and not search, but learning. :)
| chessgecko wrote:
| I wonder how much of the value here is from canceling out the
| positional noise rope produces. I would love to see a table
| comparing an alibi version of this to an alibi baseline in
| addition to the rope models here.
|
| Crazy gains though congrats to the researchers
___________________________________________________________________
(page generated 2024-10-08 23:00 UTC)