[HN Gopher] Differential Transformer
       ___________________________________________________________________
        
       Differential Transformer
        
       Author : weirdcat
       Score  : 401 points
       Date   : 2024-10-08 11:54 UTC (11 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | pikseladam wrote:
       | Did this mean they solved the hallucination problem of
       | transformers?
       | 
       | edit: not fully but it gives promising results. quiet an
       | improvement actually.
        
         | lafreb wrote:
         | The paper says that they've improved hallucination mitigation,
         | but not really "solved" the issue.
        
           | Rhapso wrote:
           | "Hallucination" isn't really a problem that can be "fixed".
           | Its just model error.
           | 
           | The root problem is simply that the model doesn't capture
           | reality, just an approximation. What we are incorrectly
           | calling "hallucination" is just the best the model has to
           | offer.
        
             | dilap wrote:
             | it can be fixed in theory if the model knows-what-it-knows,
             | to avoid saying things its uncertain about (this is what
             | (some) humans do to reduce the frequency w which they say
             | untrue things).
             | 
             | theres some promising research using this idea, tho i dont
             | have it at hand.
        
               | AnimalMuppet wrote:
               | I'm pretty sure there's something I don't understand,
               | but:
               | 
               | Doesn't an LLM pick the "most probable next symbol" (or,
               | depending on temperature, _one_ of the most probable next
               | symbols)? To do that, doesn 't it have to have some idea
               | of what the probability is? Couldn't it then, if the
               | probability falls below some threshold, say "I don't
               | know" instead of giving what it knows is a low-
               | probability answer?
        
               | viraptor wrote:
               | > Doesn't an LLM pick the "most probable next symbol"
               | 
               | Yes, but that very rarely matters. (Almost never when
               | it's brought up in discussions)
               | 
               | > Couldn't it then, if the probability falls below some
               | threshold, say "I don't know" instead of giving what it
               | knows is a low-probability answer?
               | 
               | A low probability doesn't necessarily mean something's
               | incorrect. Responding to your question in French would
               | also have very low probability, even if it's correct.
               | There's also some nuance around what's classified as a
               | hallucination... Maybe something in the training data did
               | suggest that answer as correct.
               | 
               | There are ideas similar to this one though. It's just a
               | bit more complex than pure probabilities going down.
               | https://arxiv.org/abs/2405.19648
        
               | anon291 wrote:
               | You need to separate out the LLM, which only produces a
               | set of probabilities, from the system, which includes the
               | LLM and the sampling methodology. Sampling is currently
               | not very intelligent at all.
               | 
               | The next bit of confusion is that the 'probability' isn't
               | 'real'. It's not an actual probability but a weight that
               | sums up to one, which is close enough to how probability
               | works that we call it that. However, sometimes there are
               | several good answers and so all the good answers get a
               | lower probability because there are 5 of them. A fixed
               | threshold is not a good idea in this case. Instead,
               | smarter sampling methods are necessary. One possibility
               | is that if we do have seeming confusion, to put a
               | 'confusion marker' into the text and predict the next
               | output and train models to refine the answer as they go
               | along. Not sure if any work has been done here, but this
               | seems to go along with what you're interested in
        
               | viraptor wrote:
               | > However, sometimes there are several good answers and
               | so all the good answers get a lower probability because
               | there are 5 of them.
               | 
               | That's the result after softmax. If you want to act on
               | the raw results, you can still do that.
        
               | skydhash wrote:
               | You would need some kind of referential facts that you
               | hold as true, then some introspection method to align
               | sentences to those. if it can't be done, the output may
               | be "I don't know". But even for programming languages
               | (simplest useful languages), it would be hard to do.
        
               | PaulHoule wrote:
               | My guess is the problem is words with high probabilities
               | that happen to be part of a wrong answer.
               | 
               | For one thing the probability of a word occurring is just
               | a probability of the word occurring in a certain sample,
               | it's not an indicator of truth. (e.g. the most
               | problematic concept in philosophy in that just
               | introducing it undermines the truth, see "9/11 truther")
               | It's also not sufficient to pick a "true" word or always
               | pick a "true" word but rather the truthfulness of a
               | statement needs to be evaluated based on the statement as
               | a whole.
               | 
               | A word might have a low probability because it competes
               | with a large number of alternatives that are equally
               | likely which is not a reason to stop generation.
        
               | darkPotato wrote:
               | My understanding is that the hallucination is, out of all
               | the possibilities, the most probable one (ignoring
               | temperature). So the hallucination is the most probable
               | sequence of tokens at that point. The model may be able
               | to predict an "I don't have that information" given the
               | right context. But ensuring that in general is an open
               | question.
        
               | dTal wrote:
               | It doesn't really work like that.
               | 
               | 1) The model outputs a ranked list of all tokens; the
               | probability always sums to 1. Sometimes there is a clear
               | "#1 candidate", very often there are a number of
               | plausible candidates. This is just how language works -
               | there are multiple ways to phrase things, and you can't
               | have the model give up every time there is a choice of
               | synonyms.
               | 
               | 2) Probability of a token is not the same as probability
               | of a fact. Consider a language model that knows the
               | approximate population of Paris (2 million) but is not
               | confident about the exact figure. Feed such a model the
               | string "The exact population of Paris is" and it will
               | begin with "2" but halfway through the number it will
               | have a more or less arbitrary choice of 10 digits. "2.1I
               | don't know" is neither a desirable answer, nor a
               | plausible one from the model's perspective.
        
               | ithkuil wrote:
               | This may work when the next token is a key concept but
               | when it's a filler word or a part of one of many
               | sequences of words that can convey the same meaning but
               | in different ways (synonyms but not only at the word also
               | at the sentence levels) then it's harder to know whether
               | the probability is low because the word is absolutely
               | unlikely or because it's likelihood is spread/shared
               | among other truthful statements
        
               | atrus wrote:
               | I don't think that fixes it, even in theory, since
               | there's always _some_ uncertainty.
        
               | hoosieree wrote:
               | LLMs can't hallucinate. They generate the next most
               | likely token in a sequence. Whether that sequence matches
               | any kind of objective truth is orthogonal to how models
               | work.
               | 
               | I suppose depending on your point of view, LLMs either
               | _can 't_ hallucinate, or _that 's all they can do_.
        
               | CooCooCaCha wrote:
               | Whenever someone takes issue with using the word
               | "hallucinate" with LLMs I get the impression they're
               | trying to convince me that hallucination is good.
               | 
               | Why do you care so much about this particular issue? And
               | why can't hallucination be something we can aim to
               | improve?
        
               | ToValueFunfetti wrote:
               | >Whether that sequence matches any kind of objective
               | truth is orthogonal to how models work.
               | 
               | Empirically, this cannot be true. If it were, it would be
               | statistically shocking how often models coincidentally
               | say true things. The training does not perfectly align
               | the model with truth, but 'orthogonal' is off by a
               | minimum of 45 degrees.
        
               | viraptor wrote:
               | It matches the training data. Whether the training data
               | matches truth (and whether it's correctly understood -
               | sarcasm included) is a completely separate thing.
               | 
               | > The training does not perfectly align the model with
               | truth, but 'orthogonal'
               | 
               | Nitpicky, but the more dimensions you have, the easier it
               | is for almost everything to be orthogonal.
               | (https://softwaredoug.com/blog/2022/12/26/surpries-at-hi-
               | dime...) That's why averaging embeddings works.
        
               | timcobb wrote:
               | Isn't this the same thing that happens when you train a
               | human on truths vs falsehoods?
        
               | ToValueFunfetti wrote:
               | I went to school to learn about the world and the
               | overwhelming majority of that learning was from
               | professors and textbooks. Whether the professors' beliefs
               | and the textbooks' contents reflected the true properties
               | of the world was a completely separate thing, entirely
               | outside of my control. But I did come away with a better
               | understanding of the world and few would say that
               | education is orthogonal to that goal.
               | 
               | If you add two vectors that don't have a truth component
               | (ie. are orthogonal to the truth), the resulting vector
               | should be no closer to the truth. If you start with
               | random weights and perform some operation on them such
               | that the new weights have a higher likelihood of
               | producing true statements, the operation must not have
               | been orthogonal to the truth. Am I wrong there?
        
               | viraptor wrote:
               | > But I did come away with a better understanding of the
               | world and few would say that education is orthogonal to
               | that goal.
               | 
               | That's due to the reward function / environment. But even
               | outside extremes like North Korea, lots of education
               | environments value conformity over independent analysis.
        
               | ToValueFunfetti wrote:
               | Certainly an AI trained on North Korean data would emerge
               | with some very suspect beliefs regarding Kim Jong-Un. My
               | point is just that aligning something with training data
               | is aligning it with truth, to the degree that the
               | training data is true and regardless of why it is true.
               | educate(me, truth) can hardly be called orthogonal to the
               | truth, even if the 'educate' and 'me' terms do nothing to
               | prevent educate(me, falsehood).
        
               | visarga wrote:
               | This reminds me it's easy to train similarity models,
               | hard to train identity/equivalence prediction. Two
               | strings can be similar in many ways, like "Address Line
               | 1" and "Address Line 2" or "Position_X" and "Position_Y",
               | yet distinct in meaning. That one character makes all the
               | difference. On the other hand "Vendor Name" is equivalent
               | with "Seller Company" even though they are pretty
               | different lexically.
               | 
               | The dot product, which is at the core of attention, is
               | good for similarity not identity. I think this is why
               | models hallucinate - how can they tell the distinction
               | between "I have trained on this fact" and "Looks like
               | something I trained on".
        
             | tucnak wrote:
             | I'm led to believe this is mostly because "known unknowns"
             | are not well-represented in the training datasets... I
             | think, instead of bothering with refusals and enforcing a
             | particular "voice" with excessive RL, they ought to focus
             | more on identifying "gaps" in the datasets and feeding them
             | back, perhaps they're already doing this with synthetic
             | data / distillation.
        
             | spencerchubb wrote:
             | it's not "just" model error
             | 
             | during pre-training, there is never an incentive for the
             | model to say "I don't know" because it would be penalized.
             | the model is incentivized to make an educated guess
             | 
             | large transformer models are _really_ good at approximating
             | their dataset. there is no data on the internet about what
             | LLMs know. and even if there were such data, it would
             | probably become obsolete soon
             | 
             | that being said, maybe a big shift in the architecture
             | could solve this. I hope!
        
               | happypumpkin wrote:
               | > it would probably become obsolete soon
               | 
               | Suppose there are many times more posts about something
               | one generation of LLMs can't do (arithmetic, tic-tac-toe,
               | whatever), than posts about how the next generation of
               | models _can_ do that task successfully. I think this is
               | probably the case.
               | 
               | While I doubt it will happen, it would be somewhat funny
               | if training on that text caused a future model to claim
               | it can't do something that it "should" be able to because
               | it internalized that it was an LLM and "LLMs can't do X."
        
               | spencerchubb wrote:
               | also presumes that the LLM knows it is an LLM
        
               | adwn wrote:
               | System prompts sometimes contain the information that
               | "it" is an LLM.
               | 
               | Maybe in the future, those prompts will include
               | motivational phrases, like "You can do it!" or "Believe
               | in yourself, then you can achieve anything."
        
               | Vecr wrote:
               | They're generally fine tuned not to. I'm not sure how
               | long that will hold though.
        
               | spywaregorilla wrote:
               | > during pre-training, there is never an incentive for
               | the model to say "I don't know" because it would be
               | penalized. the model is incentivized to make an educated
               | guess
               | 
               | The guess can be "I don't know". The base LLM would
               | generally only say I don't know if it "knew" that it
               | didn't know, which is not going to be very common. The
               | tuned LLM would be the level responsible for trying to
               | equate a lack of understanding to saying "I don't know"
        
               | singularity2001 wrote:
               | in another paper which popped up recently they
               | approximated uncertainty with Entropy and inserted
               | "wait!" tokens whenever Entropy was high, simulating
               | chain of thought within the system.
        
         | watsonmusic wrote:
         | that would be huge!
        
         | HarHarVeryFunny wrote:
         | I don't think there's any narrow definition of what
         | "hallucination" means. It generally refers to the model giving
         | non-factual answers in contexts that are meant to be factual,
         | but not all causes of this are going to be fixable without very
         | major changes.
         | 
         | The fundamental issue is that most of the time LLMs are going
         | to be combining statistics derived from many training samples
         | when generating a single continuation, and there is just no
         | guarantee that this will result in a semantically coherent
         | response. Of course the model's depth of parsing and semantic
         | analysis usually means that each generated word is highly
         | plausible, but this isn't the same as being factually correct,
         | especially so in these cases where the model is drawing on
         | multiple sources to create a mashup response, which is the
         | normal mode of operation.
        
       | ExxKA wrote:
       | Very interesting. Currently working on timeseries with
       | Transformers. Let me know if anyone else out there is also
       | reading it from that context.
        
         | d3m0t3p wrote:
         | Really cool, I'm a CS majoring in AI, but I'm also interested
         | in that domain, would you have any recommendation to get
         | started ?
        
           | ExxKA wrote:
           | Get a lot of data, and just dig in :) No better way to learn.
        
       | magicalhippo wrote:
       | _The visualization reveals that Transformer tends to allocate
       | only a small proportion of attention scores to the correct
       | answer, while disproportionately focusing on irrelevant context._
       | 
       |  _[...] Specifically, we partition the query and key vectors into
       | two groups and compute two separate softmax attention maps. Then
       | the result of subtracting these two maps is regarded as attention
       | scores._
       | 
       |  _[...] The approach is analogous to noise-canceling headphones
       | and differential amplifiers in electrical engineering, where the
       | difference between two signals cancels out common-mode noise._
       | 
       | Simple change, with seemingly decent improvements across the
       | board.
        
       | msoad wrote:
       | Like most things in this new world of Machine Learning, I'm
       | really confused why this works?
       | 
       | The analogy to noise-cancelling headphones is helpful but in that
       | case we clearly know which is signal and which is noise. Here, if
       | we knew why would we even bother to the noise-cancelling work?
        
         | watsonmusic wrote:
         | the model is supposed to learn this
        
         | _hl_ wrote:
         | Some of the "prior art" here is ladder networks and to some
         | handwavy extent residual nets, both of which can be interpreted
         | as training the model on reducing the error to its previous
         | predictions as opposed to predicting the final result directly.
         | I think some intuition for why it works has to do with changing
         | the gradient descent landscape to be a bit friendlier towards
         | learning in small baby steps, as you are now explicitly
         | designing the network around the idea that it will start off
         | making lots of errors in its predictions and then get better
         | over time.
        
         | HarHarVeryFunny wrote:
         | I don't understand either. It seems the general idea is that
         | they calculate attention twice, which due to random
         | initialization might be expected to give two slightly different
         | results. I'd have thought that what these two attention maps
         | would have in common would be the signal, and where they would
         | differ would be noise, so rather than subtracting them
         | (resulting in all noise?!) what you really want is to add (so
         | the common signal gets reinforced) and normalize.
        
           | Carlseymanh wrote:
           | I think there might be some communalities with system
           | engineering, where you subtract the output from the input in
           | order to get a control signal that steers the plant to the
           | target values. I too fail to see how that would be supposed
           | to work in practice.
        
           | kelseyfrog wrote:
           | The values between the groups are also going to diverge
           | during training due to the structure of the DiffAttn
           | equation.
           | 
           | The analogy I can think of is when you're paying attention to
           | a variety of things and you actively avoid concentrating on
           | something because it will distract you. You don't give it
           | zero attention, you give it negative attention.
        
         | blackbear_ wrote:
         | With a single softmax you cannot predict exactly 0, but only
         | very small numbers. When you have a large number of values to
         | add up, this "poisons" the output with a lot of irrelevant
         | stuff (the noise mentioned in the paper).
         | 
         | To make things worse, low attention values will have very low
         | gradient, thus needing a lot of weight updates to undo that
         | kind of mistakes. On the other hand, subtracting the output of
         | two softmax allows the model to predict a weight of exactly
         | zero for some of the values, while keeping a reasonable
         | gradient flowing through.
         | 
         | So the model already knows what is noise, but a single softmax
         | makes it harder to exclude it.
         | 
         | Moreover, with a single softmax the output of all heads is
         | forced to stay in the convex hull of the value vectors, whereas
         | with this variant each head can choose its own lambda, thus
         | shifting the "range" of the outputs outside the convex hull
         | pre-determined by the values. This makes the model as a whole
         | more expressive.
        
           | freeqaz wrote:
           | I'm able to follow most of what you're saying. It's unclear
           | to me what "convex hull" means though.
           | 
           | Also, where is each softmax happening here? For each
           | attention head?
        
             | blackbear_ wrote:
             | The convex hull of a set of points is the region "between"
             | those points. So the convex hull of three points (that do
             | not lie on the same line) is a triangle with those three
             | points as vertices. If you add a fourth point inside the
             | triangle, the convex hull remains the same, but if you add
             | it outside then the convex hull becomes the four-sided
             | region with those points as vertices.
             | 
             | In the context of standard transformer attention, each
             | output lies in the convex hull ("somewhere between") the
             | input values. With the modification of this paper, the
             | input values can be scaled a little so that the output of
             | different heads can be in different "regions" and thus do
             | not interfere with each other (so yes to your third
             | question, the two softmaxes are performed separately for
             | each head).
        
             | Majromax wrote:
             | > It's unclear to me what "convex hull" means though.
             | 
             | The convex hull (https://en.wikipedia.org/wiki/Convex_hull)
             | of a set is the smallest convex shape that includes that
             | set. Geometrically, it's what you'd get if you "shrink
             | wrapped" the thing you're looking at: edges still protrude,
             | but any indentations get smoothed over.
             | 
             | In this context, the grandparent comment is pointing out
             | that with a traditional transformer block, the resulting
             | computed value for a token can never "stick out" past some
             | weighted average of the values of attended-to tokens, but
             | this differential attention formalism allows that result.
        
             | pizza wrote:
             | O_i = softmax(...) * V_i and softmax is between 0 and 1, so
             | O_i = alpha * V_i for some alpha between 0 and 1 so that
             | makes it convex, and it makes the O_i just a shrunken
             | version of V_i. Whereas if you have the diff of softmaxes,
             | you get O_i = (alpha - beta) * V_i, which can range from
             | -V_i to +V_i, so its output could rescale /or/ flip V_i.
             | And yes this is happening in every head in parallel, then
             | they get summed.
        
               | kridsdale3 wrote:
               | By simply inputting your comment in to 4o, with no other
               | context about the paper, I was able to get a pretty good
               | analysis of the dual-head concept's implications.
               | 
               | https://chatgpt.com/share/67058973-ba94-8008-bed7-c7f9d08
               | dc5...
        
               | spwa4 wrote:
               | Uh, this is extracting a LOT from very little data. I
               | don't understand where it's coming from but it's
               | explanation just keeps going into more and more detail
               | ... that doesn't seem to follow from the data it's got.
               | 
               | I just don't see how you could answer these questions
               | without trying it out. And chatgtp DEFINITELY isn't doing
               | that.
               | 
               | Plus the obvious question I'd pose is not in there.
               | What's the difference in performance between this trick
               | and just "softmax() - 0.5 * 2" ? That seems very
               | relevant.
        
             | robertsdionne wrote:
             | It means one of these things:
             | https://en.wikipedia.org/wiki/Simplex#Standard_simplex
        
           | dartos wrote:
           | > predict a weight of exactly zero for some of the values
           | 
           | Wouldn't this be pretty unlikely, though?
        
             | schopra909 wrote:
             | Quite the opposite -- if you have a long sequence only a
             | smattering of the words will influence the meaning of the
             | current word. Everything else is "noise".
             | 
             | Attention is really good at finding this smattering of
             | words (ie assign most weight there). But it struggles to
             | put exactly 0 on the other words.
        
               | absoflutely wrote:
               | why say lot word when few word do
        
               | dartos wrote:
               | Few word no do tho
        
               | 1024core wrote:
               | Phew!
        
               | kridsdale3 wrote:
               | U+1FAE5
        
               | dartos wrote:
               | I mean wouldn't it be unlikely that
               | 
               | SoftmaxA[n] - SoftmaxB[n] is exactly 0?
               | 
               | Even if 2 attention layers learn two different things, I
               | would imagine the corresponding weights in each layer
               | wouldn't exactly cancel each other out.
        
           | nyrikki wrote:
           | While I don't discount the value of this, can you expand on
           | the meaning of your claim that it makes the model 'more
           | expressive'
           | 
           | Everything I am seeing in this paper is related to reduced
           | size and noise, which implies a reduction in expressiveness.
           | 
           | The improvement in needle and a haystack, benchmarks on
           | multi-hop questions of in corpus data and multishot in-
           | context learning points to this.
           | 
           | This is a wonderful thing if robustness is more important
           | than generality, but it doesn't address trimming away
           | activations that may be spurious in the general use case but
           | may improve an individual domain specificity.
           | 
           | Context would dramatically impact what tradeoffs and more
           | desireble, and noise is probably never desirable. But the
           | ability of this paper to enable bit size for inference points
           | to a reduction in expressiveness.
           | 
           | Perhaps I am too focused on generalization?
        
             | blackbear_ wrote:
             | What I meant is that by changing lambda each attention head
             | is able to put its outputs in a subspace that is different
             | than that of the other heads. This means that the outputs
             | of different heads do not mingle with each other, and it's
             | easier for the following layer to pick them apart. So I was
             | thinking at increased expressiveness because the attention
             | output can in principle cover a larger volume.
             | 
             | Maybe expressiveness is not the right term, or not the main
             | consequence. I could imagine that having different
             | subspaces like that also introduces a degree of robustness
             | to out-of-distribution inputs, as this would make it harder
             | for the outputs of one attention head to shift towards the
             | in-distribution outputs of another head, and thus for the
             | following layer to confuse them.
        
           | espadrine wrote:
           | It is a neat approach, but one that comes with a tradeoff,
           | IIUC: doubling the key heads.
           | 
           | I wonder if a different approach without that issue exists.
           | For instance, using max(0, exp(x)-1) instead of exp(x) in the
           | softmax attention formula. That way when the query is
           | orthogonal to the key (or worse), it does not contribute.
        
             | smallnamespace wrote:
             | > using max(0, exp(x)-1) instead of exp(x)
             | 
             | Won't this cause the gradient to vanish on the left half,
             | causing problems with training?
        
           | x1000 wrote:
           | Could you help explain how we would achieve an attention
           | score of exactly 0, in practice? Here's my take:
           | 
           | If we're subtracting one attention matrix from another, we'd
           | end up with attention scores between -1 and 1, with a
           | probability of effectively 0 for any single entry to exactly
           | equal 0.
           | 
           | What's more, the learnable parameter \lambda allows for
           | negative values. This would allow the model to learn to
           | actually add the attention scores, making a score of exactly
           | 0 impossible.
        
             | jszymborski wrote:
             | Your comment brings up two interesting variants that could
             | be interesting if your goal is to increase the sparsity of
             | the attention:
             | 
             | - Rectify the difference of the softmaxes. (min(0, s(A1) -
             | lambda s(A2)))
             | 
             | - Apply the Heaviside function to the second softmax.
             | (softmax(A1) - lambda H(s(A1) - lambda s(A2))
             | 
             | The second one being a bit more drastic and maybe harder to
             | train.
        
         | phire wrote:
         | Noise cancelling headphones are probably the wrong analogy
         | here.
         | 
         | The better example is the differential signalling used in
         | professional audio and many digital signaling protocols like
         | Ethernet, HDMI and USB.
         | 
         | Instead of using one wire, referencing to ground, they send the
         | signal as the difference between both wires. Both wires end up
         | carrying the same signal with inverted polarity. Because both
         | wires are running next to each-other any external noice will be
         | applied to both equally.
         | 
         | The voltage will change, but the difference in voltage between
         | both wires is untouched. And when you subtract the two voltages
         | at the receiver end, any noise simply gets subtracted out.
        
           | seamossfet wrote:
           | I think when they bring up differential amplifiers they're
           | referring more to the DSP technique of how headphone noise
           | cancelling works but the actual electrical properties of how
           | a differential amplifier does that muddies the message a bit.
           | 
           | It sort of feels closer to heterodyning and "demodulating"
           | the signal encoded in the softmax. Those tiny little errors
           | we're trying to denoise with this technique are almost closer
           | to carrier waves (when encoded to softmax) than noise imo.
           | This wouldn't get rid of noise in the training data or noise
           | in the dimensionality of the key / value space. It's really
           | only removing noise introduced by the process itself.
        
         | seamossfet wrote:
         | It sounds like they're just splitting the query / key space
         | down the middle. We don't know which dimensions are encoded in
         | each matrix, but they're assuming the "noise" introduced in one
         | query / key space is equivalent to noise introduced in the
         | other space.
         | 
         | If that is the case, then the "signal" in this case would be
         | the softmax that encodes the dimensions captured by the query /
         | key space. Since the noise ideally is the same in both softmax
         | encodings, subtracting them should "cancel out" the noise.
        
         | WithinReason wrote:
         | Don't look for an analogy, this just adds a new mathematical
         | capability. It enables "negative attention", the network can
         | say "I want to subtract the contribution of this token" in the
         | attention calculation. Previously it could only reduce how much
         | it adds.
         | 
         | The simple way of doing this would be to just remove the
         | softmax or use a sigmoid instead, but in practice a softmax
         | works better it seems.
        
         | mistercheph wrote:
         | I think common mode filtering in balanced audio cables is a
         | much better analogy than noise canceling headphones (and where
         | this paper gets its name from I assume), you don't know what
         | the noise is ahead of time, but if you take two samples with
         | one positive and one negative, noise displaces both absolutely,
         | which you can take advantage of to denoise the signal (find the
         | differential mode).
         | 
         | For example, if you are trying to send a +1V signal on one
         | wire, and a -1V signal on the other and a +0.5V noise exists,
         | one wire will have +1.5V and the other will have -0.5V,
         | 
         | Take the difference and divide by 2:
         | 
         | (+1.5V - -0.5V) / 2 = +1V or, if your setup is different (-0.5V
         | - +1.5V) / 2 = -1V
        
         | chessgecko wrote:
         | My hypothesis for why this works that it mitigates the
         | downsides of rope
         | 
         | to eli5:
         | 
         | rope is the modern strategy used to give information to the
         | model about how far a query and a key are apart when doing
         | attention. It's the best strategy we have now, but has a major
         | downside, where it makes some connections between tokens that
         | are far apart much stronger than you would like them to be.
         | Xpos (https://arxiv.org/pdf/2212.10554) is another paper by
         | microsoft tackling issues with rope and you can see figure 1 on
         | page 4 to get a visual interpretation of the sinusoidal
         | attention strength (you would like it to be smooth).
         | 
         | I think a big reason differential transformers is working so
         | well, especially on long sequence stuff, because when both q1
         | and q2 don't match a token, the rope relative strength will
         | still have the same value and the noise will cancel out.
         | Leaving intended matches, but at the cost of somewhat dampening
         | the original value rope brought.
         | 
         | Just a hypothesis though. It would be easy to test by running
         | this experiment against a baseline where both use alibi
         | attention (https://arxiv.org/pdf/2108.12409) which has a
         | different set of tradeoffs this wouldn't mitigate, but still a
         | really interesting result.
        
       | watsonmusic wrote:
       | The modification is simple and beautiful. And the improvements
       | are quite significant.
        
       | campers wrote:
       | The tl;dr on high level performance improvements
       | 
       | "The scaling curves indicate that Diff Transformer requires only
       | about 65% of model size or training tokens needed by Transformer
       | to achieve comparable language modeling performance."
       | 
       | "Diff Transformer retains high performance even at reduced bit-
       | widths, ranging from 16 bits to 6 bits. In comparison,
       | Transformer's accuracy significantly drops with 6-bit
       | quantization. The 4-bit Diff Transformer achieves comparable
       | accuracy as the 6-bit Transformer, and outperforms the 4-bit
       | Transformer by about 25% in accuracy."
        
       | digdugdirk wrote:
       | Is there any way to replicate this with existing models, or are
       | we going to need to wait for models to be trained in this style?
       | 
       | I'm imagining a smaller model examining the output tokens of a
       | larger model and metaphorically slapping it on the wrist with a
       | ruler if the output tokens start drifting off topic. Not quite
       | the same, but an entertaining thought nonetheless.
        
         | causal wrote:
         | It's a different attention mechanism with a different map
         | setup, so fundamentally a different type of model
        
           | om8 wrote:
           | Looks like it is a drop in replacement for attention, but
           | models will need to be retrained for this one, yes.
        
             | aDyslecticCrow wrote:
             | It may not need to be entirely retrained. The value spans
             | and input are the same, and no extra weights are needed.
             | You may be able to tune an existing model with this
             | attention mechanism and get some of the benefits.
             | 
             | But overall... it's mainly a training change, so training
             | is needed to make a difference.
        
         | bionhoward wrote:
         | Yes, I believe this is possible, you could clone weights of one
         | or more existing models and fine tune them in groups with
         | different random seeds for noise/drop to produce reasonable
         | outputs under a differential transformer decoding scheme
         | whereby tokens with disagreement receive more attention
         | (surprisal analysis)
        
       | patcon wrote:
       | I wonder what is lost here. Surely there's a trade-off...
       | 
       | I'm wondering if there's any effect of "creativity", or ability
       | to interpolate between concepts. Hallucination and creativity
       | feel very related to me. I understand hallucinating as simply
       | being misaligned with the space humans feel appropriate to
       | interpolate between
        
         | watsonmusic wrote:
         | not all hallucinations are creativity Imaginate that for a RAG
         | application, the model is supposed to follow the given
         | documents
        
         | magicalhippo wrote:
         | > Surely there's a trade-off...
         | 
         | For one, speed and memory. They have twice as many Q and K
         | weights in the attention blocks, leading to a ~10% reduction in
         | throughput on their H100 (table 7 in appendix A).
        
           | lennxa wrote:
           | they mention similar performance to vanilla transformer with
           | significantly reduced param count though
        
           | karmasimida wrote:
           | I mean it doesn't necessarily needs 2x QK to match that
           | performance, in terms of accuracy, of a regular transformer
           | right?
        
         | dartos wrote:
         | > Hallucination and creativity feel very related to me.
         | 
         | Why? I see them as just sampling errors.
         | 
         | Sure a mistake can spark inspiration sometimes, but creativity
         | is much more than mistakes.
         | 
         | > I understand hallucinating as simply being misaligned with
         | the space humans feel appropriate to interpolate between
         | 
         | These language models are next-token predictors. The way the
         | next token is predicted is by sampling a probability space
         | outputted by the model.
         | 
         | That sampling process can be non deterministic.
         | 
         | Hallucinations are when that sampling results in tokens that
         | come together to create a false or otherwise unintended
         | statement.
         | 
         | You can just as well think of everything a model outputs as a
         | hallucination, but we train the model to output a space what we
         | want them to hallucinate is more likely. Otherwise it just
         | outputs meaningless noise.
         | 
         | "Hallucinate" is really an awful word for what it's trying to
         | describe.
        
           | nextaccountic wrote:
           | > Sure a mistake can spark inspiration sometimes, but
           | creativity is much more than mistakes.
           | 
           | It looks like creativity has many steps but being able to
           | come with novel, unprompted stuff is important, as long as
           | you are able to discard the bullshit earlier.
           | 
           | "Hallucination" is only a problem if later layers (or
           | additional networks) can't detect and remove it
        
             | dartos wrote:
             | > "Hallucination" is only a problem if later layers (or
             | additional networks) can't detect and remove it
             | 
             | Yeah I mean sure. Anything is only a problem if it goes
             | undetected. The issue is that if you rely on statistical
             | model, you'll always have hallucinations, so you can't
             | filter statistical output with another statistical model if
             | you need real guarantees.
             | 
             | Many products don't need those guarantees though.
        
           | thomastjeffery wrote:
           | Hallucinate is an awful word _because of_ what it is trying
           | to describe.
           | 
           | Hallucination describes the same feature you just called "non
           | deterministic sampling", but exclusively the cases that we
           | don't like. It would be really convenient if we could
           | actually draw that line, but _we can 't_. If non-determinism
           | is a core feature, then that feature will be present in every
           | case; including the ones we find desirable, and the ones we
           | find undesirable.
        
           | skybrian wrote:
           | LLM's are too unpredictable for many practical uses so I'd
           | guess better predictability is better. Hopefully the change
           | the paper proposes will help!
           | 
           | But here's a case for the other side: sure, most mistakes are
           | just errors, but evolution happens via "mistakes." Also,
           | LLM's often deliberately add add randomness at inference
           | time.
        
             | dartos wrote:
             | > evolution happens via "mistakes."
             | 
             | That's a nice slogan, but it's a gross oversimplification.
             | 
             | In the natural world, you can say that mistakes in DNA
             | replication leads to evolution, but that's discounting the
             | entire process of natural selection.
             | 
             | Same with creativity. Look at Picasso. His was a
             | technically brilliant realistic painter at 15, but his work
             | later in life evolved to be more abstract and weird. I
             | don't think that was the result of mistakes, but rather
             | intentionally breaking patterns he learned in his youth.
        
               | skybrian wrote:
               | To oversimplify, evolution is a generate-and-test process
               | and the evaluation step is critical. Something needs to
               | decide which variations are better. Often, with
               | generative AI, it's people who judge the results. Still,
               | generating interesting examples (the brainstorming phase)
               | plays _some_ role in that.
               | 
               | I don't know a whole lot about Picasso's art, but I
               | imagine the way he evaluated his own work played an
               | important role, in being able to see that sometimes
               | creative accidents are interesting.
        
           | slashdave wrote:
           | > You can just as well think of everything a model outputs as
           | a hallucination
           | 
           | Exactly. Don't forget that an important factor in the success
           | of GPT3 was RLHF, which is essentially training the model to
           | produce "hallucinations" that are more acceptable on average
           | to human trainers.
        
       | pxdm wrote:
       | What's the comparison with conventional attention using a more
       | aggressive (lower temperature) softmax? I can imagine that for
       | the multi-needle retrieval test this may also give a performance
       | boost, although at some cost other more creative tasks.
        
         | mota7 wrote:
         | I had the same thought: Just eye-balling the graphs, the result
         | of the subtraction looks very close to just reducing the
         | temperature.
         | 
         | They're effectively doing softmax with a fixed temperature, but
         | it's unclear that this work is going to do better than just
         | learning a per-head temperature parameter.
         | 
         | c.f. https://arxiv.org/abs/2010.04245 which shows an
         | improvement by learning per-head temperature.
         | 
         | The other way to think about this is that it looks like a
         | hacked-up kinda-sorta gated attention. If that's the case, then
         | doing softmax(alpha _q_1_ k_1^T - log_sigmoid(beta _q_2_
         | k_2^T)) might be better? (where alpha,beta are learned
         | temperatures).
        
       | nmacias wrote:
       | AdderaLLM was _right there_
        
       | vsroy wrote:
       | Is the thing that's going on here that softmax can't push a value
       | to 0, but by subtracting 2 softmax maps we can output 0s?
        
         | pkoird wrote:
         | Or negatives
        
         | vsroy wrote:
         | Follow-up question is: Isn't it extremely unlikely to output 0?
        
       | pizza wrote:
       | Was just going to mention that it seems that it should be
       | possible to make a Flash Attention version of this algorithm and
       | was pleasantly surprised to see they already included an
       | implementation of one :)
        
       | iandanforth wrote:
       | The key bit I didn't understand at first was what happens if the
       | two groups of attention learn the same thing; because their
       | attention masks are subtracted from one another if they both
       | output similar values the attention across the board will drop to
       | zero and this will lead to high loss. So the only way to reduce
       | loss is if they learn to attend to different things. One of the
       | simplest strategies they could learn (and this paper claims that
       | they do) is for one group to focus on relevant context and the
       | other to focus on irrelevant context. Thus one group learns the
       | noise and the other the signal (it's not this cut and dry but is
       | a useful simplification for understanding IMO).
        
         | dartos wrote:
         | There's probably a small chance that they could both learn the
         | same thing, but it's probably not likely enough to be a major
         | issue.
        
         | magicalhippo wrote:
         | An interesting aspect is that they don't do a plain
         | subtraction, but rather subtract a portion of the second
         | softmax.
         | 
         | This makes sense, if one considers that the two copies are
         | identical then the softmax outputs would be identical and the
         | difference is zero everywhere. However, by subtracting a scaled
         | copy, the normalization of the difference seems to really boost
         | the signal value(s) over the "noise", making the signal stand
         | out compared to pre-normalization.
        
           | testdfkjahdfh wrote:
           | if two attentions A, B are identical, would (A - lambda * B)
           | be just (1-lambda) * A, how does it "boost the signal
           | value(s) over the "noise""?
        
         | nextaccountic wrote:
         | Maybe the loss function could penalize them learning the same
         | thing?
        
         | patcon wrote:
         | > what happens if the two groups of attention learn the same
         | thing
         | 
         | I wonder if there's a metaphor here for our own experience and
         | utility in "surprise".
         | 
         | Like if one attention head is surprised by what another learns,
         | up-weight it. But if they both find the same, assume it's not
         | very surprising and down-weight it.
         | 
         | Admittedly, "surprise" is something that has a big section of
         | my knowledgebase[1][2][3] (both as a subjective feeling and
         | adaptive function of our minds, one of the most complex
         | adaptive system we know of)
         | 
         | [1] https://plus.maths.org/content/information-surprise
         | 
         | [2] https://blakeelias.name/papers/Multi-Agent-Cooperation-
         | Intri...
         | 
         | [3] https://complexity.simplecast.com/episodes/81/transcript
        
       | dartos wrote:
       | > By being less distracted by irrelevant context, Diff
       | Transformer can mitigate hallucination in question answering and
       | text summarization
       | 
       | I'm very interested in this claim. I was under the impression
       | that hallucination is unavoidable in these kinds of models. IIRC
       | proof for that was trending on HN a couple weeks ago.
        
         | moffkalast wrote:
         | It's not possible to get rid of it entirely, but if you can get
         | the model to bullshit only 0.1% of the time instead of 5% of
         | the time it's a massive improvement.
         | 
         | Most of it should be happening when there's no data to draw
         | conclusions from. E.g. STT models make up words in silence,
         | vision models find things in lens cap noise, LLMs make up
         | explanations when they have no data to pull from.
         | 
         | The real solution would be more along the lines of training
         | models to specifically ignore these cases, or in the case of
         | LLMs to just know when to say "I don't know".
        
         | ErikBjare wrote:
         | Mitigate, not completely fix.
        
         | pshc wrote:
         | More broadly I think hallucination is inevitable in pure text
         | models. We need model architectures incorporating a stream of
         | real-world ground truth such as a live video feed or
         | embodiment.
        
       | nowayno583 wrote:
       | Does anyone understand why they are taking the difference between
       | transformers instead of the sum? It seems to me that in a noise
       | reducing solution we would be more interested in the sum, as
       | random noise would cancel out and signal would be constructive.
       | 
       | Of course, even if I'm right proper training would account to
       | that by inverting signs where appropriate. Still, it seems weird
       | to present it as the difference, especially seeing as they
       | compare this directly to noise cancelling headphones, where we
       | sum both microphones inputs.
        
         | aDyslecticCrow wrote:
         | The noise isn't truly random; it's just a matrix of small
         | values that shouldn't be taken into account. Subtracting them
         | cancels them out.
         | 
         | As pointed out by a different comment, it's actually the
         | attention we are interested in that is cancelled out *if they
         | are both equal*. This is what the paper mentions in its
         | abstract;
         | 
         | > promoting the emergence of sparse attention patterns
         | 
         | In theory, it is quite clever, and their results seem to back
         | it up.
        
         | thegeomaster wrote:
         | I suspect that plus vs minus is arbitrary in this case (as you
         | said, due to being able to learn a simple negation during
         | training), but they are presenting it in this way because it is
         | more intuitive. Indeed, adding two sources that are noisy in
         | the same way just doubles the noise, whereas subtracting
         | cancels it out. It's how balanced audio cables work, for
         | example.
         | 
         | But with noise cancelling headphones, we don't sum anything
         | directly---we emit an inverted sound, and to the human ear,
         | this sounds like a subtraction of the two signals. (Audio from
         | the audio source, and noise from the microphone.)
        
           | nowayno583 wrote:
           | Oh! It's been a good while since I've worked in noise
           | cancelling. I didn't know current tech was at the point where
           | we could do direct reproduction of the outside noise, instead
           | of just using mic arrays! That's very cool, it used to be
           | considered totally sci fi to do it fast enough in a small
           | headset.
        
       | singularity2001 wrote:
       | Anyone remember siamese networks?
        
       | aDyslecticCrow wrote:
       | Very clever. I like this kind of nitty-gritty detail work, and
       | the change is small enough to be adapted easily by others. Bravo!
       | 
       | I'm a little concerned about the last sentence of the section
       | introduction of "2 Differential Transformer". It mentions using
       | improvements from previous papers, but in the grammatical
       | context, it's unclear if this improvement is added to both the
       | normal transformer and their diff transformer. This would
       | otherwise sully the comparisons. It's the "main difference"
       | wording in the previous sentence that raised a flag for me.
       | 
       | Of course, a good-faith researcher would know this and may not
       | feel the need to clarify. But you can never be too careful about
       | some published research in this field.
        
         | Chirono wrote:
         | The two other changes they mention have been widely adopted,
         | and are included in at least some of the models they benchmark
         | against. It seems they list them for completeness as changes to
         | the original transformer architecture.
        
           | aDyslecticCrow wrote:
           | Nicely spotted! Then, I really look forward to seeing this
           | method tested by others! Epic stuff.
        
         | vessenes wrote:
         | Yes. This looks really, really good to me. Cross the board
         | improvements in training time, perplexity improvements per both
         | token trained and per model size. I'm reminded of MoE
         | architectures, in that world we're choosing an optimal small
         | model to process part or all of the inference job; I wonder if
         | MoE got some of the same benefits from forcing the Transformer
         | to distinguish between alternate possibilities.
         | 
         | In any event, I'd imagine that this will get widely adopted if
         | the numbers hold up; like I said, this seems to be basically no
         | downside, and should be easy to replicate.
        
       | x49asvk wrote:
       | This concept is really interesting to me, I am very very new to
       | transformers but would love to learn more about normal
       | transformers and differential too. Can anyone suggest any
       | resources?
        
       | lucidrains wrote:
       | does this not mean we should explore usage of talking heads
       | (Shazeer et al) a bit more? https://arxiv.org/abs/2003.02436
        
       | WithinReason wrote:
       | _We empirically find that the setting l[?][?] = 0.8 - 0.6 x
       | exp(-0.3 * (l - 1)) works well in practice_
       | 
       | I wonder about the story behind that formula...
        
         | Kubuxu wrote:
         | Hmm, 0.8 works well, but let's try setting lower layers to
         | lower initial value. Let's say 0.2. Ok, I need a formula that
         | will go between 0.2 and 0.8, slowly approaching 0.8. Starts
         | fiddling with numbers for 20min, I guess this can work.
        
         | kridsdale3 wrote:
         | A whole lot of things are tuned optimally by rotating an analog
         | dial until things look / sound right.
        
         | stellalo wrote:
         | Looks like this makes (at least initially in training) the
         | "negative" attention term smaller in the early layers (smaller
         | l) compared to later layers (larger l). Which I guess makes
         | sense: you probably want to attend a little bit to everything
         | before concluding that it's really a few spots you should look
         | at.
         | 
         | (Although it seems the author do not discuss this choice
         | anywhere in the paper?)
        
       | WithinReason wrote:
       | Hmmm, this could be expressed as 2 consecutive attentions in a
       | residual branch:
       | 
       | Simplified differential T. looks like: (softmax(Q1K1) - l
       | softmax(Q2K2)) V
       | 
       | You can factor this into:                   x = softmax(Q1K1)V
       | x += -l softmax(Q2K2)V
       | 
       | which is like 2 subsequent regular attentions added that are
       | sharing V
        
         | kelseyfrog wrote:
         | You could also extrapolate this into more than two terms by
         | squinting your eyes and saying that l [?] {1, -1} is close
         | enough to li [?]R^d | [?]li [?]=1. No idea if it would result
         | in better performance, but that's research babyyyy!
        
       | miven wrote:
       | Is there an intuitive reason why this ends up working this well
       | compared to, say, applying some kind of thresholding to attention
       | activations that are below average for a given head to filter
       | that same attention noise out?
        
       | islewis wrote:
       | > Differential attention takes the difference between two softmax
       | attention functions to eliminate attention noise
       | 
       | If I understand correctly, this architecture trades twice as much
       | attention memory in exchange for either a higher quality model,
       | or less parameters at a similar quality.
       | 
       | > According to the fitted curves, 6.8B-size DIFF Transformer
       | achieves a validation loss comparable to 11B-size Transformer,
       | requiring only 62.2% of parameters
       | 
       | This raises a few questions for me:
       | 
       | - Would having only 60% of the parameters negate the double space
       | for attention, leaving a similar memory profile as a traditional
       | transformer?
       | 
       | - Does that tradeoff change noticeably between training and
       | inference?
        
         | entropicdrifter wrote:
         | I think it _would_ negate the RAM savings, but it would also
         | reduce the amount of storage needed at rest and possibly reduce
         | initial start up times depending on storage speed and model
         | size. So, possibly good for low-end models on consumer devices?
        
         | _hl_ wrote:
         | My understanding was that the extra parameters required for the
         | second attention mechanism are _included_ in those 6.8B
         | parameters (i.e. those are the total parameters of the model,
         | not some made-up metric of would-be parameter count in a
         | standard transformer). This makes the result doubly impressive!
         | 
         | Here's the bit from the paper:
         | 
         | > We set the number of heads h = dmodel/2d, where d is equal to
         | the head dimension of Transformer. So we can align the
         | parameter counts and computational complexity.
         | 
         | In other words, they make up for it by having only half as many
         | attention heads per layer.
        
         | chessgecko wrote:
         | I think they mitigated the extra memory/compute from this by
         | using half the number of overall heads and doubling V and O.
         | Without actually checking the math I think it should be
         | equivalent in flops, not counting the extra (cheap) multiply by
         | const and subtract.
        
         | Kubuxu wrote:
         | It would double the size of the KV cache, which can be
         | significant (multi-GB) at larger context sizes.
        
       | Imnimo wrote:
       | I feel like I'm missing a key insight here. I understand the
       | problem that regular softmax attention struggles to approach
       | assigning zero attention to irrelevant stuff. And I get that
       | having this subtraction formula makes it possible to assign
       | exactly (or near) zero attention weight without having crazy
       | outlier activations. But it seems like it also makes it very easy
       | to have negative attention weight (which is equivalent to having
       | positive attention weight on the negation of your value vectors).
       | Intuitively, it just feels like a difficult balancing act to keep
       | all the stuff you don't care about so close to zero.
       | 
       | But Figure 1 clearly shows that it works, so I don't doubt that
       | it is in fact possible. I'm just struggling to build a picture of
       | how exactly the network accomplishes this.
        
         | watsonmusic wrote:
         | negative values can enhance the expressibility
        
           | Jerrrrrrry wrote:
           | doubt is the seed of reason
        
         | Grosvenor wrote:
         | Regular softmax (and attention) has an error in it.
         | 
         | softmax should be exp()/1+[?]exp()
         | 
         | Notice the 1 added to the denominator.
         | 
         | The difference is at the negative limit, softmax can be 0,
         | instead of some epsilon. The same could be done by adding an
         | extra zero value in x.
         | 
         | Downside is, you have to retrain your model from scratch to fix
         | this.
        
           | impossiblefork wrote:
           | I've tried that in a small transformer that I trained from
           | scratch and it didn't really make any difference. I also made
           | a version where I made this trainable somehow, probably by
           | replacing the 1 with a constant associated with the layer,
           | and that didn't make any difference either.
           | 
           | I didn't follow Miller's proposal quite as he wrote it though
           | and I put the mechanism in all the layers rather than
           | avoiding it at the end.
           | 
           | My test doesn't absolutely rule out usefulness-- there's
           | always different ways of applying something, but I saw no
           | indication of it.
        
             | Grosvenor wrote:
             | I guess the next step is to see if you're getting those
             | mega activations as he describes.
             | 
             | A/B test the two models and compare?
             | 
             | Would be interesting to see if these activations only show
             | up on larger models, or they're some relation to model
             | size.
        
         | sigmoid10 wrote:
         | >I'm just struggling to build a picture of how exactly the
         | network accomplishes this.
         | 
         | I mean, intuitively it would be trivial for the model to just
         | optimise lambda to zero during training. Then you essentially
         | have built a vanilla transformer with an overcomplicated
         | parameter pruning mechanism. Pruning is already pretty well
         | established in the literature as something that works
         | surprisingly good for reducing parameter counts up to (hold on
         | to your papers)... about 40%. In practice the model probably
         | doesn't work exactly like that, but I wouldn't be surprised if
         | it just approximates the normal transformer in the end anyways.
        
       | machinelearning wrote:
       | This is a good problem to solve but the approach is wrong imo.
       | 
       | It has to be done in a hierarchical way to know what you attended
       | to + full context.
       | 
       | If the differential vector is being computed with the same input
       | as the attention vector how do you know how to modify the
       | attention vector correctly
        
         | quantadev wrote:
         | Doesn't everything just get tweaked in whatever direction the
         | back-propagation derivative says and proportionally to that
         | "slope"? In other words, simply by having back-propagation
         | system in effect there's never any question about which way to
         | adjust the weights, right?
        
       | slashdave wrote:
       | I don't get it. Arbitrary linear combinations are already
       | accommodated via feed forward. What am I missing?
        
         | michalsustr wrote:
         | My hunch is that this effectively creates a differentiable
         | minimax "search" "tree" that can be backpropagated through. Not
         | a tree -- a dag really -- and not search, but learning. :)
        
       | chessgecko wrote:
       | I wonder how much of the value here is from canceling out the
       | positional noise rope produces. I would love to see a table
       | comparing an alibi version of this to an alibi baseline in
       | addition to the rope models here.
       | 
       | Crazy gains though congrats to the researchers
        
       ___________________________________________________________________
       (page generated 2024-10-08 23:00 UTC)