[HN Gopher] Attention Is Off By One
       ___________________________________________________________________
        
       Attention Is Off By One
        
       Author : elbasti
       Score  : 458 points
       Date   : 2023-07-24 17:33 UTC (5 hours ago)
        
 (HTM) web link (www.evanmiller.org)
 (TXT) w3m dump (www.evanmiller.org)
        
       | janalsncm wrote:
       | I follow the argument but the proof of the pudding is in the
       | eating. I don't know what "battles" the author lost to PyTorch
       | lately but a good test would be to modify one of the smaller
       | models (maybe nanogpt) and swap out all of the softmax calls for
       | his quiet softmax.
       | 
       | I didn't see anything relevant on alternatives to softmax, since
       | TFA is specifically questioning softmax in a multihead attention
       | context.
       | 
       | Ultimately, neural networks are arbitrary function approximators.
       | It doesn't necessarily have to be "right" internally to fit the
       | data. But if this new softmax allows transformers to learn more,
       | that's great.
        
         | LoganDark wrote:
         | > a good test would be to modify one of the smaller models
         | (maybe nanogpt) and swap out all of the softmax calls for his
         | quiet softmax.
         | 
         | You'd have to train the model with the quiet softmax before
         | inferencing with it would work.
        
       | phkahler wrote:
       | A couple thoughts. 1) An alternative might be to have an extra
       | NULL output where the attention can be diverted. This might be
       | what existing models are using commas for, but make it explicit.
       | 2) What he proposes has a similar effect on the other weights
       | without explicitly having the NULL present. In this light it
       | should work, but does it have the advantage he thinks?
        
       | leecarraher wrote:
       | The author says to add a unity vector to the context, i presume
       | of each layer, to not mess with gradient calculations. But most
       | modern DL frameworks compute the gradient for you, (i know this
       | is true for JAX and Pytorch). Is it maybe that hand coded
       | gradient for a well-known enough dl architecture like transformer
       | is faster than letting the framework autodiff it?
       | 
       | Otherwise, i fear some of the 'magic' of transformer networks is
       | that this amplification effect allows it to encode/memorize some
       | results verbatim. And we often are seeing a heavily tuned
       | internet regurgitator. So similar to the rise of RNNs with
       | attention, which supposedly allowed them to focus on some things
       | and ignore others but really often was just overfitting stuff,
       | yielded more interesting results with the overfitting than
       | without.
        
         | bearzoo wrote:
         | think the author is talking about having to fix the extra
         | vector in V to be zeros and making sure to not compute/apply
         | gradients to it
        
       | seydor wrote:
       | Can i please ask why lim{x->-inf} softmax(x) = 1/k ?
        
         | hawkice wrote:
         | It's splitting the probability across all x_i equally, even
         | when they're all massively negatively weighted.
         | 
         | This change would have all the probability going into the null
         | option, in that case, basically.
        
       | xyproto wrote:
       | > I'm thinking those numbers will make for a handsome table in a
       | soon-to-be influential arXiV paper, either when those Qualcomm AI
       | researchers step off the plane from Italy, or someone in an LLM
       | hacker channel figures out biblatex, whichever happens first.
       | 
       | :D
        
       | sunleash wrote:
       | I don't see any results, it'd be more impactful and convincing if
       | there were numbers supplementing the theory. It's not that hard
       | to finetune existing LM on a small data and verify that it works.
       | 
       | I am, however, of the similar opinion that there could be better
       | attention formulations. A paper from 2020
       | https://arxiv.org/abs/2005.09561 helped a lot in one of the
       | transformers model I trained (not a vanilla LM but a specialised
       | multi-modal graph problem).
       | 
       | It proposes normalised attention which if I'm not wrong should
       | help with the quantisation problem too.
        
       | blueblimp wrote:
       | The proposed replacement definitely makes more sense (and I've
       | always found the absence of a "failed query" to be puzzling in
       | standard attention), but, in deep learning, things that make more
       | sense don't always actually get better results. So I'm curious
       | whether this has been tried and carefully evaluated.
        
         | jadbox wrote:
         | It would be an amusing find if "Black Swan mega-activations"
         | actually but yet unintentionally made the model smarter...
        
       | keskival wrote:
       | I thought everyone knew that softmax (and specifically exp
       | functions in it) are poison. I have always worked around them,
       | for example by using large epsilons (approaching one actually),
       | and using low-order polynomial approximations for the exp
       | functions.
       | 
       | I thought everyone does that, because you don't need to work long
       | with these models to get NaNs, and when you check why you see
       | it's because of the exp functions. Then you fix it. Apparently
       | people don't.
       | 
       | It's not like the neural models care if you approximate
       | functions. They couldn't care less actually.
        
         | sebzim4500 wrote:
         | It is pretty easy to avoid NaNs when working with softmax, you
         | certainly don't need any epsilons. Just subtract the largest
         | value from everything, and you will have no rounding problems
         | or catastrophic cancellation.
         | 
         | Clearly softmax is not too bad, if it is used extensively in
         | all the most powerful models.
        
       | jrochkind1 wrote:
       | While not about AI or the algorithm mentioned, on the subject of
       | little errors that you can't convince anyone are errors....
       | 
       | In 2011, I wanted to copy the reddit ranking algorithm in a
       | project of my own, so I went to source code to look at it... the
       | algorithm in the source code I found wasn't doing anything at all
       | sensible with negative-sum voted posts.
       | 
       | I thought I discovered the error, some terms swapped in the
       | simple equation, the sign for positive/negative was misapplied.
       | 
       | I blogged it, and [posted it to reddit](https://www.reddit.com/r/
       | programming/comments/td4tz/reddits_...), only to have MANY
       | people, including reddit employees, tell me I am definitely
       | definitely wrong, and the algorithm was working as intended. And
       | that I was in fact not the first to notice what I thought I
       | noticed, and point it out, and be told by everyone I was wrong.
       | 
       | OK, I didn't really understand what was going on, I couldn't make
       | sense of the algorithm if it wasn't wrong, but so be it. I
       | updated my blog post to say that people smarter than me said
       | there was no error in the reddit algorithm, all I can say is this
       | variation makes more sense to me.
       | 
       | Then, three years later in 2014, a commit was made to the reddit
       | source code with _exactly the correction I (and others before me)
       | had suggested_ all along. The one that everyone piled on to tell
       | me how dare I have the temerity to suggest reddit source code is
       | wrong.
       | 
       | https://github.com/reddit-archive/reddit/commit/50d35de04b92...
       | 
       | -\\_(tsu)_/-
       | 
       | Open source means there are lots of eyes that can find bugs, but
       | sometimes they can't convince anyone they've found a bug. (And of
       | course, then reddit close-sourced their code in 2017).
       | 
       | I never did end up using the ranking feature in my own project,
       | that I had wanted to copy from reddit. I didn't end adding "vote"
       | features to the app.
        
         | refulgentis wrote:
         | I work at a FAANG and it was absolutely astonishing to find out
         | how often this happens.
         | 
         | You can make a long, impactful career by just being "the guy
         | who adds log statements throughout the codebase and reasons
         | through it", doing this at even a simplistic level has always
         | shown me an astonishing fix to some long-standing issue.
         | 
         | n.b. It also attracts a ton of political fun. People's first
         | order reaction is denial, and it only gets worse from there.
         | Absolutely no one except 1-2 colleagues will see it as "oh we
         | should fix that", and at least one person will make sure your
         | boss' boss' boss is CCd on an email with a nice version of "no
         | he's just insufficiently concerned about {concurrency, memory
         | management, take your pick}" Just wait it out quietly when that
         | happens, do not engage or complain. If nothing happens and
         | you're never asked about it by leadership, but your peers ask,
         | make plans to move onto another team.
        
           | jrochkind1 wrote:
           | A long impactful career, or a career of horrible frustration
           | and alienation as everyone gets mad at you for pointing out
           | their bugs? (or, from their point of view, making trouble
           | insisting that something is a bug which isn't and is causing
           | no problems)
        
           | sidfthec wrote:
           | What FAANG have you seen this at?
           | 
           | I've been at big tech companies for most of my career and
           | I've never seen anyone deny the existence of a technical bug.
           | I've seen plenty of teams mark a bug as lower priority and
           | never fix it because other things are higher priority. But
           | _denying that the bug exists_ , especially after a detailed
           | explanation? That doesn't resonate with my experiences.
        
             | com2kid wrote:
             | I've told this story before!
             | 
             | It used to be writing the outputs from the C/C++
             | preprocessor (.i files) to disk took _forever_ (5+ minutes
             | IIRC) with Microsoft 's compilers. I asked one of the lead
             | compiler developers why, and he waved me away saying it was
             | just really complicated. Around that time a bunch of tools
             | existed for GCC that worked with .i files, but none existed
             | in the Microsoft ecosystem likely because writing .i files
             | was so slow.
             | 
             | I was on the compiler test team at the time and we did lots
             | of stuff with .i files, our tests were distributed across a
             | large cluster of test machines (see my post about that
             | https://meanderingthoughts.hashnode.dev/how-microsoft-
             | tested...) so it wasn't a big deal, but it still annoyed
             | me.
             | 
             | One day I decided to find out what was going on, so I
             | loaded up process monitor while outputting a .i file and
             | watched what was happening. Much to my surprise, only 1
             | byte was being written at a time! No wonder writes were
             | taking forever.
             | 
             | A quick dive into the source code revealed a comment above
             | the file write call that read to the effect
             | 
             | // to work around a bug in windows 98
             | 
             | So anyway I opened a bug against the compiler saying we
             | should probably fix that. :)
        
               | sidfthec wrote:
               | But that's not the type of story that's being claimed
               | from the person I responded to.
               | 
               | Of course the lead developer waved you off. You wondered
               | why things took forever, and the lead developer knew it
               | was a complicated system and figured it wasn't worth
               | their time investigating. It happened to be incorrect,
               | but the lead developer wasn't in denial. They just
               | filtered the issue out because they can't afford to go
               | down every rabbit-hole they come across. I'm sure once
               | you found the actual bug, it was later fixed.
               | 
               | The person I was responding to seems to think a large
               | number of people are in denial when a bug is filed
               | against them. That doesn't make sense, and isn't
               | something I see. It'd be as if when you pointed out the
               | actual bug, the lead developer continued to say it wasn't
               | actually a bug (which is of course ridiculous and I bet
               | didn't happen).
        
         | madrox wrote:
         | When I was an intern at Yahoo working on OAuth back in 2008
         | (2007? It was long ago and I'm old) I had the pleasure of
         | implementing an internal tool for generating OAuth 1.0 URLs,
         | which meant encoding a lot of things in query parameters. My
         | tool did not generate URLs which were compatible with Yahoo's
         | implementation (certain parameters effectively should be
         | encoded twice, which my tool did). The implementing engineer
         | insisted my tool was wrong, cited my status as a lowly intern,
         | and even pulled out the OAuth spec and bent over backwards to
         | say how his implementation was correct and I'm clearly reading
         | it wrong. It literally took bringing in Eran Hammer-Lahav to
         | weigh in on the topic to say I was correct, at which point the
         | engineer agreed that of course that was correct. I got zero
         | acknowledgment or apology for the days of ad hominem attacks
         | against me.
         | 
         | I did learn an important lesson that more senior people are not
         | always right, and as someone who's usually more senior than my
         | colleagues now I try to remember it daily.
        
       | ersiees wrote:
       | This trick "they found" is part of the standard torch
       | implementation of multi head attention, namely it is called,
       | add_zero_attention. They add a zero to the logits, resulting in a
       | one in the denominator as e^0=1
       | https://pytorch.org/docs/stable/generated/torch.nn.Multihead...
        
         | civilized wrote:
         | It's an option which is set to false by default. Does that mean
         | people have tried it and it's not usually helpful...?
        
           | mlyle wrote:
           | Yes.
        
       | bertil wrote:
       | If you hesitate to read it, let me say that the post denounces
       | "kurtotic barbarities." If that expression alone doesn't convince
       | you to read it, you might not be in the intended audience.
        
       | simbolit wrote:
       | I don't understand this well enough to say if it is correct, but
       | I do understand it well enough to say it is important if correct.
        
         | gremlinsinc wrote:
         | I don't know half of you half as well as I should like; and I
         | like less than half of you half as well as you deserve
        
           | simbolit wrote:
           | I don't understand your comment, but I gather that you and
           | others didn't like mine, which is noted, I will try to do
           | better.
        
             | the_af wrote:
             | The comment you're replying to is made by Bilbo Baggins
             | during his birthday, near the beginning of "The Lord of the
             | Rings".
             | 
             | As to what the commenter above meant I can only guess, but
             | it should be noted that Bilbo's audience reacts with
             | puzzlement, unable to parse his words.
        
       | make3 wrote:
       | Please, put empirical numbers with proposals like these.
       | Transformers have had a billion "improvements" suggested to them
       | through the years.
        
       | gerdusvz wrote:
       | now if only we could teach humans to also not annotate when they
       | have nothing to add
        
       | lscharen wrote:
       | This is similar the the (old) trick of adding a Uniform
       | distribution component to a Mixture of Gaussians model. It
       | doesn't really change the math wrt parameter optimization and
       | probability evaluation, but provides a place to capture
       | "background" or "unimportant" data points and improve the model
       | robustness to outliers.
       | 
       | The motivation follows from the same problem the author points
       | out in the original softmax formulation that it always "forces a
       | choice" when it may be more useful to put a "Not Applicable"
       | option into the model itself.
       | 
       | https://link.springer.com/article/10.1007/s10260-021-00578-2
        
       | tylerneylon wrote:
       | 1. Summary
       | 
       | The author is suggesting that we add 1 to the denominator of the
       | softmax that is used within attention mechanisms (not the final
       | output softmax).
       | 
       | The softmax inside an attention unit allows it to see key/query
       | matches as probabilities; those probabilities support a
       | continuous-valued version of a key-value lookup (instead of 1/0
       | output of a lookup, we get weights where a high weight = the
       | desired key-value lookup).
       | 
       | Adding 1 to the denominator would change an attention unit by no
       | longer working with a true probability vector of weights, but
       | rather working with weights that add up to less than 1. The
       | motivation is that the network can learn to provide high weights
       | so that the adjusted softmax is very close to a probability
       | vector; and it has a new option to provide all-low weights which
       | give all-low output weights, meaning it can opt out of having
       | high confidence in anything.
       | 
       | (switching to opinion mode)
       | 
       | 2. How can we tell if this is good?
       | 
       | 2a. We should just try it out: Train an LLM with this, see if it
       | works.
       | 
       | 2b. There are two reasons I suspect it won't make a big
       | difference.
       | 
       | First, if an attention node has low confidence, it can already
       | assign similar scores pre-softmax. Then we get what looks like a
       | uniform distribution as output. Then we're basically taking an
       | average of a bunch of vectors (vs a weighted average that is more
       | like choosing one of them). Statistically, we expect that
       | averaged vector to be close to zero. In other words, the node
       | already has a way to effectively opt-out by providing a near-zero
       | output vector.
       | 
       | Second, in a transformer, each attention unit has many other
       | learned weights that can support the ability to opt out. Both the
       | V matrix and the feed-forward layer after the attention unit give
       | that module a way to provide low values to the activation
       | function after the feed-forward layer, which would result in a
       | value as small as you like -- again, a way to opt out.
       | 
       | 3. I appreciate the non-academic tone of the article and the
       | willingness to play around with fundamental ideas. Although I'm
       | not totally convinced by the note, I'd love to read more stuff
       | like this.
        
       | sbszllr wrote:
       | OP is right in that his change would make the softmax in the
       | attention output zero if it "has nothing to add" (QuietAttention,
       | as he said).
       | 
       | Buuut, it's missing the forest for the trees. The goal of the
       | last step of attention (ref., Fig. 2, left in
       | https://arxiv.org/abs/1706.03762) is not to add/say anything (as
       | the author is saying) but to compute the relationship between the
       | tokens (QK^T) and V -- in layman terms, simplifying, which tokens
       | are related to each other. The softmax is there because it gives
       | a representation that is nicer to work with, it gives
       | probabilities, instead of unscaled matrix multiplication.
       | 
       | TLDR; author isn't wrong but he isn't right, practically
       | speaking, either.
        
         | lostmsu wrote:
         | What's wrong with unscaled matrix multiplication? Softmax has
         | some kind of intuition in the context, but why not layer norm
         | or something else instead (if anything is needed at all)?
        
           | sbszllr wrote:
           | The family of sigmoid functions has nice gradient properties
           | with theoretical backing. Good starting read:
           | https://stats.stackexchange.com/questions/162988/why-
           | sigmoid...
        
       | whimsicalism wrote:
       | > their existence is contrary to everything we thought we knew
       | about neural networks prior to building ones that worked so well
       | 
       | Why is this true? Because their existence implies some sort of
       | preferred basis that aligns with the dims of the neural network,
       | which is surprising?
       | 
       | It's not obvious why their existence is so contrary to what we
       | knew.
        
       | fwlr wrote:
       | The author identifies a real problem and poses a simple solution.
       | It passes all my crank tests (why did no one come up with this
       | before? Because the author is intimately familiar with the
       | softmax function from work outside of ML, and plausibly nobody
       | who's investigating these issues is remotely as familiar, so
       | despite researchers narrowing the issue down to "something to do
       | with softmax", they don't have a deep enough understanding of
       | softmax to see what's wrong).
       | 
       | If the author is reading any of these comments, though, I would
       | urge them to expand on their claim that "I'm 99.44% sure that it
       | will resolve the outlier feedback loop". As it stands, that's the
       | only explanation we get of how the outliers might be related to
       | softmax!
        
         | sedael wrote:
         | >why did no one come up with this before
         | 
         | So it turns out someone did. Specifically google did. This
         | _exact_ same idea has been in flaxformers since at least
         | November 2021.
         | 
         | https://github.com/google/flaxformer/blame/ee62754ebe5a5eeb1...
         | 
         | Specifically to save people a click it says:
         | 
         | > """Softmax function with an additional virtual logit equal to
         | zero.                 For compatibility with some previously
         | trained models.            This is equivalent to adding one to
         | the denominator.       In the context of attention, it allows
         | you to attend to nothing.
         | 
         | And creates the _exact_ same modified softmax as this essay. I
         | suppose only time will tell why it was ignored publicly before,
         | maybe it doesn 't do much, maybe it just fell through the
         | cracks, maybe google just didnt push it, who knows
        
           | toxik wrote:
           | Or maybe it doesn't really do anything to improve
           | performance.
        
           | littlestymaar wrote:
           | > I suppose only time will tell why it was ignored publicly
           | before, maybe it doesn't do much, maybe it just fell through
           | the cracks, maybe google just didnt push it, who knows
           | 
           | Maybe quantization wasn't as hot back then than it is now?
        
             | jablongo wrote:
             | Yea the benefit is not going to come in terms of
             | performance for a given model, but in terms of ability to
             | be efficiently quantized.
        
         | Legend2440 wrote:
         | Yeah, but it lacks the most important test: results. He hasn't
         | actually tried it, he just thinks it will work.
         | 
         | For such a simple change to the softmax it wouldn't take long
         | to verify. It's really embarrassing to not do that before
         | publishing.
        
           | refulgentis wrote:
           | It's not embarrassing at all.
           | 
           | I think there might be some curse of the auto-didact here,
           | hinging on the meaning of publish: it would be embarrassing
           | if he was capital-P publishing, as in a scientific paper.
           | 
           | The blog goes to great lengths to point out it is _not_
           | capital-P publishing.
        
           | furyofantares wrote:
           | It's a blog post. And it includes a call for help in testing
           | the idea.
        
           | joebiden2 wrote:
           | You seem to really disregard the positions of this author.
           | They seem to have invested substantial efforts in that
           | specific area of research.
           | 
           | To validate the idea the author has, it would be required to
           | train a LLM from zero. If the author is right, you would get
           | similar results to the current generation of LLMs, but with
           | (a lot) less space required for the intermediate layers.
           | 
           | The time to achieve that is still measured in kilo- to mega-
           | dollars, why is it wrong to put that idea in the open to
           | substantially criticize or adopt?
        
             | Legend2440 wrote:
             | You don't need to train a ChatGPT-sized LLM, a toy nanoGPT
             | would have been enough. You can train those on a consumer
             | GPU in an afternoon.
             | 
             | And yes I do disregard his research effort. There are
             | hundreds of well-justified and well-researched "clever
             | tricks" for improving Transformers, and almost all of them
             | don't work. I'll believe it when I see the results.
        
               | knewter wrote:
               | Google used it in flaxformers since 2021 apparently
        
               | renewiltord wrote:
               | Do you know of handy testing steps? I suppose I could ask
               | ChatGPT, but if someone has a validated "here, this is
               | how you do it" I have a 3090 that I can do it on, but I'm
               | not keen to debug anything here.
        
         | visarga wrote:
         | I interpreted it as cracking a joke about miscalibrated probs
         | in softmax, it tends to be 99.9% sure, or 0.1%, but little in-
         | between.
        
         | tel wrote:
         | > why did no one come up with this before? Because the author
         | is intimately familiar with the softmax function from work
         | outside of ML, and plausibly nobody who's investigating these
         | issues is remotely as familiar
         | 
         | I doubt that is true. Softmax is extremely well understood
         | within the ML community. It's a very common trick, these
         | properties are well-known as well. It feels very unlikely that
         | nobody has thought of this before. That said, it's also
         | plausible that the current softmax convention was chosen by
         | accident and the author is right to identify this drawback.
        
         | Majromax wrote:
         | > why did no one come up with this before?
         | 
         | And because the effects of the problem are subtle. Supposing
         | the diagnosis is correct, full-precision LLMs still avoid the
         | issue through large attention weights given to meaningless
         | tokens to give harmless attention outputs. The problem only
         | matters when quantizing weights, and quantized performance
         | isn't really the goal of recent cutting-edge LLM development.
        
       | Agingcoder wrote:
       | It's interesting. It looks like if you're trying to improve the
       | accuracy/perplexity of the model and using fp32 it doesn't make a
       | difference , but if you want to quantize it/make it compressible
       | a modified soft max makes a huge difference ( this is what I
       | understand from the Qualcomm paper). Different goals, different
       | findings ?
        
       | ks2048 wrote:
       | Interesting read. As others have said, it will be much more
       | convincing with some experimental numbers.
       | 
       | I'm confused what his goal is though:
       | 
       | I could imagine some theoretical reason to add a 1 there, but he
       | starts by saying this can lead to smaller, more compactable
       | models. Is he talking about the size the compressed weights? or
       | pruning to a smaller model? or resistant more quantization?
       | 
       | Parts of the essay seemed to throw me off track, because I'm not
       | sure if they are relevant at all to the proposal (eg the of the
       | initial embedding and how many bits it would take the store the
       | vocab size, etc).
        
         | feoren wrote:
         | He says this in the article: if you ever want to jam a multi-
         | trillion-parameter model into a phone app or a Raspberry Pi,
         | you _must_ quantize. I 've seen some quantization go from
         | doubles to bytes (64 bits to 8) per weight, reducing the RAM
         | requirement by 8x. A simple quantization (I'm sure there are
         | much better ones) is to round everything the nearest 1/255th of
         | your number range, then multiply by 255. So your resolution is
         | (max-min)/255. You also store the min and max so you can
         | reverse it, of course. Say you're trying to quantize these sets
         | of numbers:
         | 
         | 1. { -1.4, 0.8, 2.7, 7.3 } : With a range of 8.7, you have a
         | resolution of 0.034. This set quantizes to { 0, 64, 120, 255 }.
         | 
         | 2. { -1400, 800, 2700, 7300 } : Resolution 34.1, quantizing to
         | the same as the above { 0, 64, 120, 255 }.
         | 
         | 3. { -0.008, -0.001, 0.009, 0.019 } : resolution 0.000106. This
         | set quantizes to { 0, 66, 161, 255 }.
         | 
         | 3. { -1.4, 0.8, 2.7, 7329 } : Resolution 28.7. This set
         | quantizes to {0, 0, 0, 255 }. Oops -- we can no longer tell
         | most of our weights apart.
         | 
         | You can see how this quantization works really well when all
         | the numbers are close together, regardless of their absolute
         | scale. Major outliers completely mess up the entire system. You
         | can make more and more complicated quantization algorithms, but
         | those will always come with tradeoffs. The best option would be
         | to tame your weights so that they are again close together.
        
       | atorodius wrote:
       | My hot take is that if you dont do the trick, you basically get a
       | mean of all vectors in the value matrix if all x are very small.
       | Which then probably the next sequence of linear layers will be
       | able to interpret the same way as if you do the +1 trick and
       | prodce a 0?
        
       | naillo wrote:
       | Reading this I'm mostly thankful real brain power and the general
       | smart programming community is seriously taking a close look at
       | all these things. I barely feel the need to try to compete for
       | insight gathering it feels very healtily analyzed from every
       | perspective finally.
        
       | obiefernandez wrote:
       | This seems very important if accurate.
        
       | [deleted]
        
       | nborwankar wrote:
       | Shouldn't this be called Regularized SoftMax? Adding 1 in the
       | denominator looks a lot like a regularization in other ML
       | contexts.
        
       | chessgecko wrote:
       | I ran an experiment like this and in my setting it didn't help.
       | Not saying there may not have been a bug or something, but I
       | think attending over the current position sort of solves this
       | problem. IE when it should not speak it just emits the current
       | pos value.
       | 
       | edit to add details in case anyone is interested
       | 
       | I didn't add one to the softmax denom. I added a learned
       | parameter (the attention sink) that would be appended to the
       | beginning of QK but would be removed after softmax, so when
       | multiplying by V the totals wouldn't sum to one. I tried variants
       | that included looking at the current pos and not, and also
       | variants that predicted used an ffn to generate the sink per
       | position instead of a learned param. In my setting neither
       | approach really made much of a difference. But I also had a bunch
       | of other weird stuff in there too, so it may be worth trying
       | again.
        
         | abeppu wrote:
         | When you say it didn't help, can you clarify what you're
         | measuring? In the context of this post, I think both the
         | performance your task, and the number of outlier weights (and
         | their magnitude) are important.
        
           | chessgecko wrote:
           | I was just looking at doing this in pretraining, so I was
           | looking at pretraining losses. The difference was within the
           | range of usual noise so I didn't keep trying.
        
             | waynecochran wrote:
             | The question concerns outliers ... how did the change
             | manage them?
        
             | lucidrains wrote:
             | this is fixing a different issue, not the one you are
             | measuring.
        
               | chessgecko wrote:
               | It wasn't really the goal of my experiment to fix this
               | issue for sure, I was trying to see if you could improve
               | attention by decoupling the key used by a position for
               | itself and for future tokens.
               | 
               | Open to being wrong here, but wouldn't it be functionally
               | similar to adding a constant to the softmax denom? the
               | function could sort of learn a specific position to have
               | sink and q multiply to one, then removing it before
               | multipling with v would be exactly identical?
        
         | gwern wrote:
         | He's advertising it as fixing the spiking outliers. Did your
         | variant have those outliers beforehand?
        
           | chessgecko wrote:
           | I guess yeah I was mostly responding to
           | 
           |  _Now it's possible that softmax should be replaced
           | wholesale, but it's worked pretty well for the most part,
           | except for this one wee little bug that prevents attention
           | heads from saying nothing. So I propose a very small tweak on
           | which I am willing to stake all future Internet claims to
           | being correct. The tweak is so small, yet so obvious, and
           | it's been sitting here under everyone's noses ever since
           | attention was invented (2014)._
           | 
           | I didn't test for outliers, but I don't think this will lead
           | to a large improvement in attention overall/it will fix a
           | lurking bug.
        
             | zackangelo wrote:
             | He's not trying or claiming to improve attention. He's
             | trying to reduce outliers to improve the ability to
             | quantize the parameters.
        
               | chessgecko wrote:
               | He refers all over the blog post to an "error" in
               | attention. specifically says
               | 
               |  _The problem with using softmax is that it forces each
               | attention head to make an annotation, even if it has no
               | information to add to the output vector. Using softmax to
               | choose among discrete alternatives is great; using it for
               | optional annotation (i.e. as input into addition) is,
               | like, not cool, man._
               | 
               | I'm saying it uses the current position to do this, that
               | if it was a significant error I would expect it to
               | improve the training loss. I sort of interpreted the blog
               | post as being a bit more positive on the idea than just
               | being about improving the quantization
        
         | [deleted]
        
       | cs702 wrote:
       | TL;DR: The author proposes that instead of using the Softmax
       | function in each head,                 Softmax(x_i) = exp(x_i) /
       | sum(exp(x_i)),
       | 
       | we should use instead what the author calls the Softmax_1
       | function,                 Softmax_1(x_i) = exp(x_i) / (1 +
       | sum(exp(x_i))),
       | 
       | which would make it possible for each transformer head's
       | attention probabilities to be zero, i.e., attend to nothing, by
       | computing x_i's with values well below zero.
       | 
       | Giving each transformer head _the ability to ignore all tokens_
       | surely can 't hurt, but it remains to be seen if it will actually
       | improve transformer performance.
        
         | rrobukef wrote:
         | I also saw the author distinguished internal versus output
         | softmax. I think he'd apply his modification only to internal
         | softmax and let the external force an output.
        
           | cs702 wrote:
           | Yes, it makes sense to apply this only to the Softmax we use
           | to compute attention. It makes no sense to apply it to the
           | output Softmax, which must compute a probability
           | _distribution_ over the vocabulary.
        
         | mcbuilder wrote:
         | Activation sparsity and packing sparse matrices will surely be
         | important, so there is one kind of performance. However the
         | other, perplexity, needs a good demonstration. It might require
         | a big model, but even 30B you can fine tune on nowadays on a
         | big Cloud GPU box.
        
       | szundi wrote:
       | It would be fun when once in the future someone finds a bug like
       | this, merges the PR, and BAM! Singularity.
        
       | theGnuMe wrote:
       | Isn't this why you have the <BOS> token, or the <cls> token?
        
       | mellosouls wrote:
       | Unless he gives a good reason why he has not demonstrated his
       | claim (eg. "This effect only presents at a scale beyond my
       | resource"), the thesis seems severely weakened by the lack of
       | effort to prove it in a toy version.
       | 
       | He just says he doesn't want to spend any more time on it, which
       | is unlikely to convince or motivate anybody else that he has
       | discovered something important.
        
         | refulgentis wrote:
         | It got tons of people really excited.
         | 
         | I don't know what to say past that, but it's worth reflecting
         | on.
        
       | jxf wrote:
       | The author's use of "kurtotic barbarities" to describe this
       | situation is absolutely my new favorite phrase. English is a
       | beautiful language in which to express frustrations.
        
       | tsurba wrote:
       | In the text they say you need to cram all information needed to
       | predict the next token into a single 6KB word embedding, but
       | isn't that wrong?
       | 
       | Rather, isn't the autoregressively predicted single next token a
       | combination (based on attention) of all 6KB word tokens in the
       | attention window.
       | 
       | So the size of memory where all information for next token
       | prediction needs to be "crammed into" is more like
       | window_size*6KB, right?
        
       | Imnimo wrote:
       | >The problem with using softmax is that it forces each attention
       | head to make an annotation, even if it has no information to add
       | to the output vector. Using softmax to choose among discrete
       | alternatives is great; using it for optional annotation (i.e. as
       | input into addition) is, like, not cool, man. The problem here is
       | exacerbated with multi-head attention, as a specialized head is
       | more likely to want to "pass" than a general-purpose one. These
       | attention heads are needlessly noisy, a deafening democracy where
       | abstention is disallowed.
       | 
       | Can't the MLP that processes the concatenated outputs the
       | attention heads handle this? I don't understand why it should be
       | critical that a head be allowed to put something close to zero in
       | its segment of the concatenated vector if it's immediately going
       | to get projected by an MLP anyway.
        
         | marcyb5st wrote:
         | But you are wasting some of the model's capacity to learn to
         | ignore some of that information. I think it wouldn't hurt.
         | However, if I followed the reasoning correctly, I think the
         | biggest win is to reduce the range of the weights more than
         | improving performance.
         | 
         | > _This is what's been happening in LLMs - for reasons that are
         | only partially understood, Transformer models contain these
         | outlier weights and are emitting Black Swan mega-activations
         | that are much, much, much larger, like orders of magnitude
         | larger, than their peers ..._
         | 
         | meaning that once quantized you can either have a finer
         | quantization since the range of possible values is smaller or
         | you can pick a coarser strategy that saves bits for each
         | weight.
        
           | Imnimo wrote:
           | Right, I get the goal of removing the outlier activations,
           | but I just don't understand why outlier activations are a
           | consequence of the model trying to "pass". The story from the
           | linked paper earlier in the post
           | (https://arxiv.org/pdf/2306.12929.pdf) is that the model is
           | doing the following:
           | 
           | -Learn a near-zero representation for some otherwise low-
           | importance token, like delimiters or whitespace.
           | 
           | -When a head wants to "pass", emit an outlier activation to
           | attend to that token nearly-exclusively.
           | 
           | But I'm surprised the model can't just use its existing tools
           | (the post-concat projection layer and the following MLP
           | block) to achieve the same thing. And if the answer is that
           | it could do that, but tends to learn to use the outlier
           | activation trick instead, will giving it a new tool that
           | still allows the use of outlier activations be sufficient?
        
       | orasis wrote:
       | "the seemingly innocent exponentiator that no one thought capable
       | of such kurtotic barbarities."
       | 
       | This writing brought a happy tear to my eye.
        
       | politician wrote:
       | This makes sense. One tweak for the press: I think it would be an
       | improvement to call it OptionalAttention rather than
       | QuietAttention since the goal is to permit an attention head to
       | opt-out.
       | 
       | You might attract more, ahem, attention if it was immediately
       | apparent from the name only what this attention head does that
       | the current one does not. There's also that small matter of
       | distinguishing the internal vs output softmax functions.
        
       | alsodumb wrote:
       | I might be missing something obvious, but I am not sure why
       | everyone in the comments think it's a big deal. I've seen this
       | trick in practice multiple times.
       | 
       | For example, see this snippet from an old Google repo:
       | https://github.com/google/flaxformer/blob/ee62754ebe5a5eeb11...
        
         | alevskaya wrote:
         | Yeah we used to use this in our older models years ago... I
         | don't recall the details exactly, but I don't think it ever did
         | very much.
         | 
         | I certainly don't think it will help at all with stability.
         | Things like Q/K layernorm are better tricks for softmax
         | stability when scaling: https://arxiv.org/pdf/2302.05442.pdf
        
           | ggerganov wrote:
           | > I don't recall the details exactly, but I don't think it
           | ever did very much.
           | 
           | How would you have known if the trick actually reduces the
           | outliers in the weights? Even if the transformer quality does
           | not improve overall, having less outliers as a result is very
           | beneficial for more accurate quantization of the data
        
             | danielmarkbruce wrote:
             | Are you asking "why would you have bothered to look at"?
             | 
             | The "how" is pretty straightforward.
        
         | PartiallyTyped wrote:
         | The argument / reasoning is a bit dubious.
         | 
         | Technically softmax is not implemented as presented but through
         | exp(x_i-max(x)), and summing over it in the denom. But maybe I
         | am missing something.
         | 
         | Furthermore, the residuals are used exactly because the
         | networks cant learn the identity function; but they can learn
         | zero; at which point the residual is `f(x): x+g(x)` with being
         | `g:x ~> 0` (ie approximately 0).
         | 
         | It is also the case that `f(x): x+g(x)` makes it easier for
         | gradients to flow through.
        
           | Piezoid wrote:
           | Implementations usually replace replace the 1 in the
           | denominator with exp(-max(x)) for this reason.
        
           | mrfox321 wrote:
           | You are misreading things.
           | 
           | Regardless of numerical stability tricks (e.g. exp(x_i-
           | max(x))), you are still simply normalizing the logits such
           | that the probabilities sum to 1.
           | 
           | The blog adds an additional hidden logit (equal to 0) to
           | allow for softmax(x) = 0 when x -> -inf.
        
             | PartiallyTyped wrote:
             | How can `x -> -inf` occur in the first place when nearly
             | everything is within [-2,2] and doing a dot product plus
             | before that there's normalization too?
        
         | zorgmonkey wrote:
         | If popular models are still making this mistake then it still
         | seems noteworthy and making a blog post or paper to increase
         | awareness definitely seems worthwhile. Also multiple
         | independent discovery of good ideas is quite common.
        
       | jmount wrote:
       | The "missing 1" is a waste-category that is implicitly re-scaled.
       | 
       | The explicit 1 formulation is used in binary softmax, and the
       | implicit (not seen 1) is used in multinomial softmax. I suspect
       | this is the old "notation B looks silly in terms of notation A's
       | standards."
        
       | mlsu wrote:
       | I don't really understand the subject matter enough, so I
       | apologize in advance for the meta-comment...
       | 
       | The author mentions that he would maybe have written this as a
       | scientific paper:
       | 
       | > I tried writing a serious-looking research paper about the bug
       | and my proposed fix, but I lost a series of pitched battles
       | against Pytorch and biblatex, so I figured I'd just write a blog
       | post instead. (History is written by the winners; blogs are
       | written by...)
       | 
       | Honestly, thank god he didn't. This paper is so much more
       | readable and approachable than what gets published in "serious"
       | journals. The tone is self-effacing, it does not have an "ego"
       | the way scientific papers tend to have. If all science read like
       | this, and if we were "allowed" to cite research that reads like
       | this, I think we would be much better off. This reads like a
       | conversational, approachable textbook, not like an impenetrable
       | wall.
       | 
       | Is it because I don't understand attention at a PhD level that I
       | hold this opinion? Maybe. Could he be writing like this because
       | he's a layman and utterly wrong about the topic, unlike those
       | Serious Science Authors? Maybe, I don't know.
       | 
       | But my god, wouldn't it be nice to be allowed to write like this?
        
         | Waterluvian wrote:
         | To finish the author's analogy:
         | 
         | Blog posts are written by those who arrive first.
         | 
         | In a weird way my mental model is: blog posts are the recon
         | team discovering a new idea. They might have errors. They might
         | be incomplete. Maybe they're outright wrong. Stakes are lower
         | as it took less effort to get there and less loss if a position
         | is abandoned.
         | 
         | Then papers are authored, often much later, and they're the
         | regulars coming in to fortify a newly captured idea. They
         | provide (or at least are supposed to) rigor to the idea. A
         | fortification of a position that we decide is worth holding.
         | 
         | Yeah, this analogy is probably sloppy. But in my brain there's
         | an eternal conflict against ignorance as we keep advancing into
         | the unknown.
        
         | chessgecko wrote:
         | I think maybe its because he didn't have experimental results
         | that show that it worked. Not a knock against the author, there
         | are just so many things that seem like good ideas that don't
         | end up working well in practice, a paper like this without
         | results is hard to value.
        
           | mlsu wrote:
           | Yes, definitely. If he tried to have it published, the lack
           | of experimental results would definitely be a glaring error.
           | 
           | But this is still scientific communication. It's really nice
           | that it's legible!
           | 
           | > Even though softmax1 is facially quite boring, I'm 99.44%
           | sure that it will resolve the outlier feedback loop that's
           | making quantization the subject of cascades of research. If
           | you want to run some experiments and prove me right, DM me on
           | Twitter and we'll get a paper going.
           | 
           | I'm guessing that in the stodgy world of science, a
           | communication like this might happen over lunch at a
           | conference, limited to a small clique of researchers who are
           | zealously guarding their next paper. Who could blame them,
           | publish or perish!
           | 
           | But someone will probably test this theory out (after my
           | read, it will probably happen in llama.cpp with preliminary
           | results on GPT-2 by next week) and achieve results, and it
           | will happen quickly and legibly to the outside world, because
           | this was published openly and without all of the pretension
           | that formal science (tm) has. If it works, it works. Stuff
           | like this is the soul of the internet. Sharing knowledge and
           | making it legible for all.
        
             | [deleted]
        
           | WithinReason wrote:
           | Then again, if you don't have access to giant compute
           | clusters you can't test this, so it's either a blog post or
           | nothing. I believe the outlier problem that this solves only
           | appears for very large models.
        
             | janalsncm wrote:
             | That isn't true at all. Train a smaller model on a smaller
             | dataset. You can even train on your laptop. It's definitely
             | feasible. This is just a proof of concept, it doesn't need
             | to beat state of the art.
        
               | WithinReason wrote:
               | Maybe I edited my comment too late.
        
               | janalsncm wrote:
               | > I believe the outlier problem that this solves only
               | appears for very large models.
               | 
               | Any reason to believe this? The author never mentioned
               | it, and I can't think of any other _a priori_ reason why
               | it should be true.
        
               | WithinReason wrote:
               | See figure 1:
               | 
               | https://arxiv.org/pdf/2208.07339.pdf
               | 
               | Outliers appear at model size 6.7B and are not present at
               | 2.7B
        
               | janalsncm wrote:
               | Sure, emergent properties can arise as parameters
               | increase. Everyone knows that. That's a much less
               | specific claim than to say that the benefit of modifying
               | softmax can only arise as an emergent property after N
               | parameters, and therefore the benefit can only be
               | evaluated on models above a certain size. To my
               | understanding the author of TFA isn't suggesting the same
               | issue as the one in your linked paper.
        
               | WithinReason wrote:
               | The second heading in the TFA is "It's All About
               | Outliers"
        
               | PoignardAzur wrote:
               | 6.7B isn't "needs a datacenter" scale.
        
               | WithinReason wrote:
               | It's in the million dollar range. XLnet which is a 1.3B
               | model cost $245,000 to train for example.
        
         | Legend2440 wrote:
         | Counterargument: this blogpost is worthless. You get all the
         | way to the end and then find out he hasn't actually tried it,
         | not even on a toy model. It's just a neat idea he thinks will
         | work.
        
           | ambrozk wrote:
           | Why would that make it worthless?
        
             | PoignardAzur wrote:
             | Among other reasons, because the decoder-only version of
             | the original transformer architecture has proven _weirdly_
             | resistant to these kinds of hacks and clever optimizations.
             | 
             | Ideas like sparse attention, tree attention, residual
             | attention, etc, all sound good on paper, but when
             | researchers try to reproduce them they either find no
             | results or results that don't scale. Even AliBi is turning
             | out to be less powerful than scaled-down positional
             | embeddings. It's almost a bitter lesson on its own: you
             | can't beat the original transformer.
             | 
             | Optimizations that _do_ stick around tend to be the ones
             | that preserve the original algorithm but help with caching
             | or memory accesses.
        
             | 6gvONxR4sf7o wrote:
             | Because there are a thousand ideas a minute in this field
             | that meet the "it's worth trying" bar but don't actually
             | pan out to make any difference. It's the equivalent of a
             | blogpost that says "if someone else turned my idea into a
             | business, it would be a billion dollar business. But I
             | won't bother."
        
             | Legend2440 wrote:
             | Because until he tries it, who knows if it works?
             | 
             | There are a thousand papers out there making minor tweaks
             | to the transformer architecture. 99% of them are also
             | worthless and forgotten.
        
               | debugnik wrote:
               | > Because until he tries it, who knows if it works?
               | 
               | That's precisely what he shared this for, though. So
               | someone willing to train a model with this tweak tries
               | it.
        
               | [deleted]
        
           | janalsncm wrote:
           | I wouldn't quite say its value is zero. It's worth something,
           | but a lot less than if it had been shown to work empirically.
           | 
           | Explainers and their folksy, imprecise tone are good for
           | things we already know are true. I'm skeptical on things
           | which are unproven.
        
             | [deleted]
        
         | Method-X wrote:
         | I can see AI being used to make scientific papers more
         | approachable like this.
        
         | TigeriusKirk wrote:
         | Are most AI papers even published beyond arxiv anyway?
        
         | Der_Einzige wrote:
         | This is why folks like gwern have their own research published
         | this way, i.e. his analysis of GPT-3: https://gwern.net/gpt-3
         | 
         | We call him an "independent AI researcher" because his google
         | scholar is "bland" compared to many academics who play the
         | academia game -
         | https://scholar.google.com/citations?user=yk1QMowAAAAJ&hl=en
        
         | _Microft wrote:
         | > This paper
         | 
         | It's not a paper. It's an idea that sounds plausible, presented
         | in a highly entertaining form.
        
         | doliveira wrote:
         | Nah, scientific papers are supposed to be precise and
         | technical. This reads like those quite frequent suggestions
         | here of switching all equations in papers to plain English or
         | code: it honestly comes from a place of ignorance, and I say
         | that as basically a layman myself.
         | 
         | What should be encouraged is for academics to blog about their
         | research as well. It would even help when recruiting and
         | onboarding new members. Right now the sociological and
         | economical incentives don't promote this at all.
        
           | karaterobot wrote:
           | The writing quality of academic papers is very poor, whatever
           | its intended characteristics are, and we deserve better.
           | 
           | I'm skeptical that the only way for them to be precise and
           | technical is to make them impenetrable. I think there is a
           | culture of academic writing (many different cultures, really)
           | that has adopted a voice and writing style which became a
           | parody of itself over time.
           | 
           | Here's a trivial example: You frequently see papers use the
           | passive voice, something a middle school English teacher
           | would mark with a red pen. _500 participants were asked_ ,
           | vs. _we asked 500 participants_. In what sense is the former
           | more precise and technical? It 's not. It does not convey any
           | additional meaning. People use it to sound objective and
           | distant, even when they really aren't.
           | 
           | Realistically, academic writers usually don't even think
           | about it as much as that. They're just copying the tone of
           | other papers, because there is a culture and it enforces
           | certain behaviors on its members irrespective of the value.
        
           | baq wrote:
           | Leslie Lamport definitely doesn't share your opinion. A known
           | fact about the Paxos paper is that there are no dumbed down
           | summaries worth reading because the proper thing is so
           | approachable. Not sure if you only have to sound smart if
           | you've got nothing to say but certainly feels like it could
           | be the case.
        
           | coldtea wrote:
           | > _Nah, scientific papers are supposed to be precise and
           | technical._
           | 
           | They're also more often than not tedious, badly explained,
           | and oft-skipped, error prone, and hardly ever read carefully,
           | even during peer review for the paper that contains them.
           | That's how mistakes stay unnoticed for decades in influential
           | papers with tons of citations.
           | 
           | In essense, a paper's tone and languge is often more
           | formality, academic tradition, ritual, and padding for
           | publication purposes, than serving a real purpose.
        
           | guluarte wrote:
           | not always, ReLu is a fucking line, most papers write stuff
           | in the most complicated way to sound smart.
        
           | aqsalose wrote:
           | "it honestly comes from a place of ignorance, and I say that
           | as basically a layman myself"
           | 
           | Here is an added complication: succinct technical
           | communication can be efficient when communicating to peers
           | who work on the exactly same domain, similar problems as you,
           | and want digest your main ideas quickly.
           | 
           | On the other hand, for any particular paper, the size of the
           | audience to whom it is directly relevant and addressed to can
           | be small. The size of the audience who got to reading it
           | anyway may be _vast_. (Maybe I am reading your paper because
           | someone cited a method paper that in lieu of a proof or
           | explanation writes just two words and citation to your paper.
           | Maybe I am a freshly minted new student reading it for my
           | first seminar. Maybe I am from a neighboring field and trying
           | to understand what is happening in yours. Maybe I tried to
           | find what people have already done with particular idea I
           | just had and search engine gave your paper. And so on.)
           | 
           | During my (admittedly lackluster) academic career I recall
           | spending much more time trying to read and understand papers
           | that were not addressed to me than papers that were and where
           | I enjoyed the succinct style that avoids details and present
           | the results. (Maybe it is just an idiosyncratic trust issue
           | on my part, because I am often skeptical of stated results
           | and their interpretation, finding the methods more
           | interesting). But that is not all.
           | 
           | I also noticed that genuine misunderstandings coming from
           | "brief" communication of technical "details" were quite
           | common; two different researches would state they "applied
           | method X to avoid Y/seek Z[citation]" in exactly so many and
           | almost exactly same words, where X,Y and Z were complicated
           | technical terms, yet the authors would have quite different
           | opinion what the meaning of those words were and what would
           | be the intended reading and how and why X should be
           | implemented.
           | 
           | In conclusion, I think many a scientific field would benefit
           | from a style where authors were expected to clearly explain
           | what they did and why (as clearly as possible).
        
           | lofatdairy wrote:
           | I agree with everything you say. Though papers really are a
           | bit too hard to read sometimes, but I'd argue it's often not
           | for an overly technical tone so much as writers cutting out a
           | lot of background material for brevity and assumed
           | familiarity.
           | 
           | >What should be encouraged is for academics to blog about
           | their research as well. It would even help when recruiting
           | and onboarding new members. Right now the sociological and
           | economical incentives don't promote this at all.
           | 
           | I will add onto this that a lot of journals have been pushing
           | for video abstracts and "plain English" abstracts. For the
           | most part I don't see these too often but when they're there
           | they're appreciated, and I vaguely recall that someone found
           | that citations go up when they're used (specifically plain
           | English, I don't think anything has been on video abstracts).
           | 
           | There are a lot of good blogs for computational academic
           | subjects (ml, bioinformatics, comp neuro, etc) but I see less
           | for bio and non-software engineering. Math and physics seems
           | to have some really notable blogs, but beyond what gets
           | posted to HN and linked further on those blogs, I can't
           | comment.
        
           | r3trohack3r wrote:
           | There was this sociologist who had written a paper for us all
           | to read ahead of time. I started to read the damn thing, and
           | my eyes were coming out: I couldn't make head nor tail of it!
           | I figured it was because I hadn't read any of the books on
           | the list. I had this uneasy feeling of "I'm not adequate,"
           | until finally I said to myself "I'm gonna stop, and read one
           | sentence slowly so I can figure out what the hell it means."
           | So I stopped-at random-and read the next sentence very
           | carefully. I can't remember it precisely, but it was very
           | close to this: "The individual member of the social community
           | often receives his information via visual, symbolic
           | channels." I went back and forth over it, and translated. You
           | know what it means? "People read."                  Then I
           | went over the next sentence, and realised that I could
           | translate that one also. Then it became a kind of empty
           | business: "Sometimes people read; sometimes people listen to
           | the radio," and so on, but written in such a fancy way that I
           | couldn't understand it at first, and when I finally
           | deciphered it, there was nothing to it.            -- Feynman
           | 
           | I disagree. After going through quite a few research papers
           | in my time, I've found the best are the ones that are direct
           | and to the point. Many papers I've spent many hours/days
           | trying to unravel just to realize the concepts were
           | straightforward, not very novel, and there wasn't much of
           | real substance to the paper.
           | 
           | Meanwhile, some of the most impactful papers I've read are
           | direct and to the point. Kadmellia, Bitcoin, BitTorrent,
           | DynamoDB, Firecracker, etc.
           | 
           | It seems like, when you have something of substance to say,
           | you say it. When you don't you overcompensate by falling back
           | on building an intricate puzzle of jargon and convoluted
           | equations in an attempt to make what you're saying sound far
           | more important than it really is.
           | 
           | As LLMs get better, I look forward to the day where every
           | journal has a standard LLM filter you're required to apply to
           | your paper that unravels all of this nonsense and rewrites it
           | a more straightforward way, if not to directly publish than
           | just for the editors to verify there isn't a simpler way to
           | convey your ideas. I suspect that if we had an EIL5 filter
           | for most journal articles, we'd discover that a majority of
           | the words that get published have very little substance at
           | all.
        
             | dekhn wrote:
             | I hadn't seen that Feynman quote before, but I discovered
             | then when reading Donna Harraway's books (Cyborg Manifesto,
             | Modest_Witness@Second_Millennium.FemaleMan(c)Meets_OncoMous
             | e, Primate Visions).
             | 
             | The criticism was """Haraway's work has been criticized for
             | being "methodologically vague"[39] and using noticeably
             | opaque language that is "sometimes concealing in an
             | apparently deliberate way""""
        
               | coldtea wrote:
               | > _Haraway 's work has been criticized for being
               | "methodologically vague"[39] and using noticeably opaque
               | language that is "sometimes concealing in an apparently
               | deliberate way_
               | 
               | So you're saying that "Her work is basically handwaving
               | and bullshitting".
        
               | dekhn wrote:
               | Yes, but also, wrapping the handwaving and bullshitting
               | in a layer of obfuscation:
               | 
               | "Michel Foucault's biopolitics is a faccid premonition of
               | cyborg politics, a very open feld. By the late twentieth
               | century, our time, a mythic time, we are all chimeras,
               | theorized and fabricated hybrids of machine and organism
               | --in short, cyborgs. The cyborg is our ontology; it gives
               | us our politics. The cyborg is a condensed image of both
               | imagination and material reality, the two joined centers
               | structuring any possibility of historical transformation.
               | In the traditions of "Western" science and politics--the
               | tradition of racist, male-dominant capitalism; the
               | tradition of progress; the tradition of the appropriation
               | of nature as resource for the productions of culture; the
               | tradition of reproduction of the self from the refections
               | of the other--the relation between organism and machine
               | has been a border war"
               | 
               | (donna was woke before woke was a thing)
        
             | lamontcg wrote:
             | > It seems like, when you have something of substance to
             | say, you say it.
             | 
             | And this blog post probably could be condensed into 1/4 of
             | its size or less with a less conversational/bloggy tone.
        
               | coldtea wrote:
               | There are words that are added to drive the point in
               | multiple ways, ease into it, and make the text more
               | engaging.
               | 
               | And there are words that are added to add empty padding,
               | keep up academic pretenses, and appear smart.
               | 
               | The post could have been condensed, but it would lose the
               | former, not the latter.
        
             | cratermoon wrote:
             | I believe Feynman understood that he was oversimplifying,
             | and I believe he was able to do because his reason for
             | reading the paper was not the same as the reason another
             | sociologist might have. Thus a sentence like, "The
             | individual member of the social community often receives
             | his information via visual, symbolic channels", does, to a
             | non-expert, mean "people read", but to another sociologist
             | of a researcher in related fields, phrases like "individual
             | member", "social community", and "visual, symbolic
             | channels" would _terms of art_. That means an expert in the
             | field could read  "social community" and it would mean,
             | cognitively, an entire set of concepts in the field.
             | 
             | In short, jargon matters. People here can talk about
             | functional, procedural, and object-oriented programming
             | because each of the three words has more than just the
             | dictionary meaning - to those of use in the field. In the
             | same way we can talk about linear algebra and know it
             | doesn't mean "algebra on lines".
             | 
             | Yes, it's _possible_ to write scientifically without jargon
             | and wordiness, but it 's a lot of effort and takes much
             | more space to say "a group who follow a social structure
             | within a society (culture, norms, values, status). They may
             | work together to organise social life within a particular
             | place, or they may be bound by a sense of belonging
             | sustained across time and space"[1]
             | 
             | 1 https://othersociologist.com/2013/11/20/sociology-of-
             | communi...
        
               | PoignardAzur wrote:
               | Well, maybe, but you can rationalize arbitrary amounts of
               | pointless jargon that way.
               | 
               | Besides, in the example Faynman gives the simple sentence
               | is actually _shorter_. Maybe that shorter sentence loses
               | some information that the jargon carried, but Occam 's
               | razor suggests the writer was just trying to sound
               | smarter.
        
             | Vervious wrote:
             | Systems research papers do not represent all research
             | papers out there, not even in computer science.
             | 
             | In cryptography, certainly a paper with formal definitions
             | and proofs can be much more valuable than a corresponding
             | blog post. It's a field where formalism is desired, if not
             | necessary. Otherwise you can't check other people's
             | "proofs", or even know what model you're working in.
             | 
             | I think, since people haven't come up with better
             | formalisms, sometimes it's quite obtuse, which gets
             | mistaken as "academic writing", when really it's a best
             | effort to formalize.
        
               | renonce wrote:
               | Requiring formalism does not preclude attaching an
               | informal but intuitional description of the formal
               | definition or proof. Unless the authors don't understand
               | very clearly what they are talking about, or they want to
               | prevent others from understanding their concepts too
               | easily, I don't see why there is a reason for the authors
               | not to attach an EIL5 in addition to formalism.
        
               | Vervious wrote:
               | Sure. But it's an ELI5 "in addition to formalism", not
               | "in lieu of formalism". In theory conferences like STOC
               | or FOCS, the first section of the paper often comprises
               | such an overview.
               | 
               | Certainly some papers are better written than others. But
               | sometimes a blog post cannot replace a paper, unless it
               | also goes into the depth and detail that formalism
               | requires. (Then it becomes a 30 page blog post, where
               | most people don't read past the intro.)
        
               | acchow wrote:
               | The complaint about research papers is that almost all of
               | them omit the ELI5 and provide _only_ the formalism.
               | 
               | You can have both and weave them together into a
               | digestible narrative. I see Physics textbooks sometimes
               | written this way.
        
               | smallnamespace wrote:
               | Papers are mostly read by other researchers, where the
               | added background is actively bad because it obscures the
               | real meat of the paper to the main audience.
               | 
               | If you just wanted a digestible intro then you would
               | usually buy a textbook.
               | 
               | I think the argument that _every_ research paper ought to
               | be a mashup of a textbook + the actual research to be a
               | bit silly from a "people should specialize at what
               | they're good at" standpoint.
               | 
               | Put in another context, I also don't want every recipe to
               | reintroduce what it means to "fry" or "braise" or
               | "marinate". We have Google for that.
        
               | [deleted]
        
           | mlsu wrote:
           | Well, I'm not so sure. It seems to me that someone could
           | perfectly well devise an experiment based off of this
           | (another poster chastised me for saying paper, so) blog post.
           | 
           | Equations are perfectly clear. I was able to follow his
           | reasoning perfectly well.
           | 
           | I cannot say the same for so many papers (tm) that I've read.
           | Mostly in a similarly computational (though non-
           | deeplearning) applied math domain.
        
         | [deleted]
        
         | pessimizer wrote:
         | > The tone is self-effacing, it does not have an "ego" the way
         | scientific papers tend to have.
         | 
         | I can't imagine judging scientific papers based on whether the
         | author might be looking down on me, or thinks he knows better
         | than me.
         | 
         | > if we were "allowed" to cite research that reads like this
         | 
         | Maybe you're looking down on _yourself?_ You can cite anything
         | you want to cite.
        
           | caddemon wrote:
           | Well if you yourself are trying to publish in a scientific
           | venue you can't always cite exactly what you want to cite.
           | Though it's probably uncommon for a peer reviewer to ask for
           | a specific citation to be removed, the review process
           | absolutely does affect the references list, and expectations
           | about this process affect it doubly so.
        
         | baby wrote:
         | There isn't much difference between a blog and a whitepaper, in
         | that people tend to write blogs more casually and whitepaper
         | more seriously (and some academics event only accept things
         | that look more serious).
         | 
         | But a good writer can write great articles in whatever format
         | they wish.
        
         | nico wrote:
         | It would be amazing if academia started replacing papers with
         | videos + code
         | 
         | I want to see: an explainer of the
         | science/ideas/experiments/hipothesis
         | 
         | And instructions on how to reproduce the experiments/results
         | 
         | Some YouTubers are going in this direction
        
           | janalsncm wrote:
           | +1 to including code with your paper. It improves
           | reproducibility and transparency. There's even a well-known
           | website dedicated to this purpose.
           | 
           | For the rest of it I don't care. As long as researchers
           | understand what's going on, that's what matters.
        
       | sebzim4500 wrote:
       | Don't transformers typically have a <bot> token at the beginning
       | of the prompt? This seems equivalent to letting the network
       | attend to this token, and produce a zero value if that's what it
       | wants.
        
         | refulgentis wrote:
         | not a token, and not the transformers, but yes, commercial chat
         | models are fine-tuned on text transcripts containing dialogues.
         | (i believe llama-2 was as well)
        
           | sebzim4500 wrote:
           | Are you sure? I have never seen an LLM that did not have a
           | special token for start of text, I'm certain that llama had
           | one and I don't remember anywhere in the llama-2 paper where
           | they said they removed it.
        
             | refulgentis wrote:
             | tl;dr: you're right
             | 
             | it's messy though, bear with me for the full explanation:
             | 
             | - your initial post says "<bot>" token, which looked like a
             | mix of "chatbot" and ChatML, used by OpenAI
             | 
             | - there is a bo_S_ token, which acts as you described
             | 
             | - I averaged my attention over your post and the initial
             | reply, which answers as if you were using "<bot>" in the
             | misunderstood way
             | 
             | - when I go back and read your post, I realize the chatbot
             | interpretation doesn't quite make sense, since you're
             | referring to much more technical aspects than general "how
             | do I AI", i.e. you understand <X> as a way to denote
             | special tokens, not necessarily an XML tag
        
         | sp332 wrote:
         | Chat-tuned ones do, but the base models don't. For example,
         | Llama doesn't, but Alpaca has "### Instruction:", "### Input:",
         | and "### Response:".
        
           | int_19h wrote:
           | Base LLaMA still has dedicated tokens for beginning/end of
           | string. What you're describing is the instruction format,
           | which is separate.
        
             | sp332 wrote:
             | Oh, I had misunderstood something.
        
       | gwern wrote:
       | This reminds me of the normalization bug in StyleGAN. It had this
       | obvious visual artifact of a 'blob' which would appear in
       | otherwise photorealistic images, which was puzzling because it
       | was _so_ obvious how did the Discriminator not squash it? It
       | turned out to be a flaw in the normalization of the AdaIn style
       | layers, IIRC, where the Generator was pumping up numbers and
       | doing weird things to force through information.
        
       | firebirdn99 wrote:
       | This is right below the "Have Attention Spans Been Declining? -
       | Yes, 65%" post, lol brilliant. In general, human decreasing, AI
       | increasing- attention.
        
         | gwern wrote:
         | "In this post, I prove that attention spans have actually
         | declined by 64%, contrary to widely-publicized reports of
         | 65%..."
        
         | neilv wrote:
         | For posterity:                   1. Have attention spans been
         | declining? (slimemoldtimemold.com)            338 points by
         | janandonly 4 hours ago | flag | hide | 254 comments
         | 2. Attention Is Off By One (evanmiller.org)            400
         | points by elbasti 4 hours ago | flag | hide | 129 comments
         | 
         | Note that the #1 post is probably there because the title
         | earlier had the provacative "Yes, 65%" appended to it. So even
         | more numerical.
        
       ___________________________________________________________________
       (page generated 2023-07-24 23:00 UTC)