[HN Gopher] Attention Is Off By One
___________________________________________________________________
Attention Is Off By One
Author : elbasti
Score : 458 points
Date : 2023-07-24 17:33 UTC (5 hours ago)
(HTM) web link (www.evanmiller.org)
(TXT) w3m dump (www.evanmiller.org)
| janalsncm wrote:
| I follow the argument but the proof of the pudding is in the
| eating. I don't know what "battles" the author lost to PyTorch
| lately but a good test would be to modify one of the smaller
| models (maybe nanogpt) and swap out all of the softmax calls for
| his quiet softmax.
|
| I didn't see anything relevant on alternatives to softmax, since
| TFA is specifically questioning softmax in a multihead attention
| context.
|
| Ultimately, neural networks are arbitrary function approximators.
| It doesn't necessarily have to be "right" internally to fit the
| data. But if this new softmax allows transformers to learn more,
| that's great.
| LoganDark wrote:
| > a good test would be to modify one of the smaller models
| (maybe nanogpt) and swap out all of the softmax calls for his
| quiet softmax.
|
| You'd have to train the model with the quiet softmax before
| inferencing with it would work.
| phkahler wrote:
| A couple thoughts. 1) An alternative might be to have an extra
| NULL output where the attention can be diverted. This might be
| what existing models are using commas for, but make it explicit.
| 2) What he proposes has a similar effect on the other weights
| without explicitly having the NULL present. In this light it
| should work, but does it have the advantage he thinks?
| leecarraher wrote:
| The author says to add a unity vector to the context, i presume
| of each layer, to not mess with gradient calculations. But most
| modern DL frameworks compute the gradient for you, (i know this
| is true for JAX and Pytorch). Is it maybe that hand coded
| gradient for a well-known enough dl architecture like transformer
| is faster than letting the framework autodiff it?
|
| Otherwise, i fear some of the 'magic' of transformer networks is
| that this amplification effect allows it to encode/memorize some
| results verbatim. And we often are seeing a heavily tuned
| internet regurgitator. So similar to the rise of RNNs with
| attention, which supposedly allowed them to focus on some things
| and ignore others but really often was just overfitting stuff,
| yielded more interesting results with the overfitting than
| without.
| bearzoo wrote:
| think the author is talking about having to fix the extra
| vector in V to be zeros and making sure to not compute/apply
| gradients to it
| seydor wrote:
| Can i please ask why lim{x->-inf} softmax(x) = 1/k ?
| hawkice wrote:
| It's splitting the probability across all x_i equally, even
| when they're all massively negatively weighted.
|
| This change would have all the probability going into the null
| option, in that case, basically.
| xyproto wrote:
| > I'm thinking those numbers will make for a handsome table in a
| soon-to-be influential arXiV paper, either when those Qualcomm AI
| researchers step off the plane from Italy, or someone in an LLM
| hacker channel figures out biblatex, whichever happens first.
|
| :D
| sunleash wrote:
| I don't see any results, it'd be more impactful and convincing if
| there were numbers supplementing the theory. It's not that hard
| to finetune existing LM on a small data and verify that it works.
|
| I am, however, of the similar opinion that there could be better
| attention formulations. A paper from 2020
| https://arxiv.org/abs/2005.09561 helped a lot in one of the
| transformers model I trained (not a vanilla LM but a specialised
| multi-modal graph problem).
|
| It proposes normalised attention which if I'm not wrong should
| help with the quantisation problem too.
| blueblimp wrote:
| The proposed replacement definitely makes more sense (and I've
| always found the absence of a "failed query" to be puzzling in
| standard attention), but, in deep learning, things that make more
| sense don't always actually get better results. So I'm curious
| whether this has been tried and carefully evaluated.
| jadbox wrote:
| It would be an amusing find if "Black Swan mega-activations"
| actually but yet unintentionally made the model smarter...
| keskival wrote:
| I thought everyone knew that softmax (and specifically exp
| functions in it) are poison. I have always worked around them,
| for example by using large epsilons (approaching one actually),
| and using low-order polynomial approximations for the exp
| functions.
|
| I thought everyone does that, because you don't need to work long
| with these models to get NaNs, and when you check why you see
| it's because of the exp functions. Then you fix it. Apparently
| people don't.
|
| It's not like the neural models care if you approximate
| functions. They couldn't care less actually.
| sebzim4500 wrote:
| It is pretty easy to avoid NaNs when working with softmax, you
| certainly don't need any epsilons. Just subtract the largest
| value from everything, and you will have no rounding problems
| or catastrophic cancellation.
|
| Clearly softmax is not too bad, if it is used extensively in
| all the most powerful models.
| jrochkind1 wrote:
| While not about AI or the algorithm mentioned, on the subject of
| little errors that you can't convince anyone are errors....
|
| In 2011, I wanted to copy the reddit ranking algorithm in a
| project of my own, so I went to source code to look at it... the
| algorithm in the source code I found wasn't doing anything at all
| sensible with negative-sum voted posts.
|
| I thought I discovered the error, some terms swapped in the
| simple equation, the sign for positive/negative was misapplied.
|
| I blogged it, and [posted it to reddit](https://www.reddit.com/r/
| programming/comments/td4tz/reddits_...), only to have MANY
| people, including reddit employees, tell me I am definitely
| definitely wrong, and the algorithm was working as intended. And
| that I was in fact not the first to notice what I thought I
| noticed, and point it out, and be told by everyone I was wrong.
|
| OK, I didn't really understand what was going on, I couldn't make
| sense of the algorithm if it wasn't wrong, but so be it. I
| updated my blog post to say that people smarter than me said
| there was no error in the reddit algorithm, all I can say is this
| variation makes more sense to me.
|
| Then, three years later in 2014, a commit was made to the reddit
| source code with _exactly the correction I (and others before me)
| had suggested_ all along. The one that everyone piled on to tell
| me how dare I have the temerity to suggest reddit source code is
| wrong.
|
| https://github.com/reddit-archive/reddit/commit/50d35de04b92...
|
| -\\_(tsu)_/-
|
| Open source means there are lots of eyes that can find bugs, but
| sometimes they can't convince anyone they've found a bug. (And of
| course, then reddit close-sourced their code in 2017).
|
| I never did end up using the ranking feature in my own project,
| that I had wanted to copy from reddit. I didn't end adding "vote"
| features to the app.
| refulgentis wrote:
| I work at a FAANG and it was absolutely astonishing to find out
| how often this happens.
|
| You can make a long, impactful career by just being "the guy
| who adds log statements throughout the codebase and reasons
| through it", doing this at even a simplistic level has always
| shown me an astonishing fix to some long-standing issue.
|
| n.b. It also attracts a ton of political fun. People's first
| order reaction is denial, and it only gets worse from there.
| Absolutely no one except 1-2 colleagues will see it as "oh we
| should fix that", and at least one person will make sure your
| boss' boss' boss is CCd on an email with a nice version of "no
| he's just insufficiently concerned about {concurrency, memory
| management, take your pick}" Just wait it out quietly when that
| happens, do not engage or complain. If nothing happens and
| you're never asked about it by leadership, but your peers ask,
| make plans to move onto another team.
| jrochkind1 wrote:
| A long impactful career, or a career of horrible frustration
| and alienation as everyone gets mad at you for pointing out
| their bugs? (or, from their point of view, making trouble
| insisting that something is a bug which isn't and is causing
| no problems)
| sidfthec wrote:
| What FAANG have you seen this at?
|
| I've been at big tech companies for most of my career and
| I've never seen anyone deny the existence of a technical bug.
| I've seen plenty of teams mark a bug as lower priority and
| never fix it because other things are higher priority. But
| _denying that the bug exists_ , especially after a detailed
| explanation? That doesn't resonate with my experiences.
| com2kid wrote:
| I've told this story before!
|
| It used to be writing the outputs from the C/C++
| preprocessor (.i files) to disk took _forever_ (5+ minutes
| IIRC) with Microsoft 's compilers. I asked one of the lead
| compiler developers why, and he waved me away saying it was
| just really complicated. Around that time a bunch of tools
| existed for GCC that worked with .i files, but none existed
| in the Microsoft ecosystem likely because writing .i files
| was so slow.
|
| I was on the compiler test team at the time and we did lots
| of stuff with .i files, our tests were distributed across a
| large cluster of test machines (see my post about that
| https://meanderingthoughts.hashnode.dev/how-microsoft-
| tested...) so it wasn't a big deal, but it still annoyed
| me.
|
| One day I decided to find out what was going on, so I
| loaded up process monitor while outputting a .i file and
| watched what was happening. Much to my surprise, only 1
| byte was being written at a time! No wonder writes were
| taking forever.
|
| A quick dive into the source code revealed a comment above
| the file write call that read to the effect
|
| // to work around a bug in windows 98
|
| So anyway I opened a bug against the compiler saying we
| should probably fix that. :)
| sidfthec wrote:
| But that's not the type of story that's being claimed
| from the person I responded to.
|
| Of course the lead developer waved you off. You wondered
| why things took forever, and the lead developer knew it
| was a complicated system and figured it wasn't worth
| their time investigating. It happened to be incorrect,
| but the lead developer wasn't in denial. They just
| filtered the issue out because they can't afford to go
| down every rabbit-hole they come across. I'm sure once
| you found the actual bug, it was later fixed.
|
| The person I was responding to seems to think a large
| number of people are in denial when a bug is filed
| against them. That doesn't make sense, and isn't
| something I see. It'd be as if when you pointed out the
| actual bug, the lead developer continued to say it wasn't
| actually a bug (which is of course ridiculous and I bet
| didn't happen).
| madrox wrote:
| When I was an intern at Yahoo working on OAuth back in 2008
| (2007? It was long ago and I'm old) I had the pleasure of
| implementing an internal tool for generating OAuth 1.0 URLs,
| which meant encoding a lot of things in query parameters. My
| tool did not generate URLs which were compatible with Yahoo's
| implementation (certain parameters effectively should be
| encoded twice, which my tool did). The implementing engineer
| insisted my tool was wrong, cited my status as a lowly intern,
| and even pulled out the OAuth spec and bent over backwards to
| say how his implementation was correct and I'm clearly reading
| it wrong. It literally took bringing in Eran Hammer-Lahav to
| weigh in on the topic to say I was correct, at which point the
| engineer agreed that of course that was correct. I got zero
| acknowledgment or apology for the days of ad hominem attacks
| against me.
|
| I did learn an important lesson that more senior people are not
| always right, and as someone who's usually more senior than my
| colleagues now I try to remember it daily.
| ersiees wrote:
| This trick "they found" is part of the standard torch
| implementation of multi head attention, namely it is called,
| add_zero_attention. They add a zero to the logits, resulting in a
| one in the denominator as e^0=1
| https://pytorch.org/docs/stable/generated/torch.nn.Multihead...
| civilized wrote:
| It's an option which is set to false by default. Does that mean
| people have tried it and it's not usually helpful...?
| mlyle wrote:
| Yes.
| bertil wrote:
| If you hesitate to read it, let me say that the post denounces
| "kurtotic barbarities." If that expression alone doesn't convince
| you to read it, you might not be in the intended audience.
| simbolit wrote:
| I don't understand this well enough to say if it is correct, but
| I do understand it well enough to say it is important if correct.
| gremlinsinc wrote:
| I don't know half of you half as well as I should like; and I
| like less than half of you half as well as you deserve
| simbolit wrote:
| I don't understand your comment, but I gather that you and
| others didn't like mine, which is noted, I will try to do
| better.
| the_af wrote:
| The comment you're replying to is made by Bilbo Baggins
| during his birthday, near the beginning of "The Lord of the
| Rings".
|
| As to what the commenter above meant I can only guess, but
| it should be noted that Bilbo's audience reacts with
| puzzlement, unable to parse his words.
| make3 wrote:
| Please, put empirical numbers with proposals like these.
| Transformers have had a billion "improvements" suggested to them
| through the years.
| gerdusvz wrote:
| now if only we could teach humans to also not annotate when they
| have nothing to add
| lscharen wrote:
| This is similar the the (old) trick of adding a Uniform
| distribution component to a Mixture of Gaussians model. It
| doesn't really change the math wrt parameter optimization and
| probability evaluation, but provides a place to capture
| "background" or "unimportant" data points and improve the model
| robustness to outliers.
|
| The motivation follows from the same problem the author points
| out in the original softmax formulation that it always "forces a
| choice" when it may be more useful to put a "Not Applicable"
| option into the model itself.
|
| https://link.springer.com/article/10.1007/s10260-021-00578-2
| tylerneylon wrote:
| 1. Summary
|
| The author is suggesting that we add 1 to the denominator of the
| softmax that is used within attention mechanisms (not the final
| output softmax).
|
| The softmax inside an attention unit allows it to see key/query
| matches as probabilities; those probabilities support a
| continuous-valued version of a key-value lookup (instead of 1/0
| output of a lookup, we get weights where a high weight = the
| desired key-value lookup).
|
| Adding 1 to the denominator would change an attention unit by no
| longer working with a true probability vector of weights, but
| rather working with weights that add up to less than 1. The
| motivation is that the network can learn to provide high weights
| so that the adjusted softmax is very close to a probability
| vector; and it has a new option to provide all-low weights which
| give all-low output weights, meaning it can opt out of having
| high confidence in anything.
|
| (switching to opinion mode)
|
| 2. How can we tell if this is good?
|
| 2a. We should just try it out: Train an LLM with this, see if it
| works.
|
| 2b. There are two reasons I suspect it won't make a big
| difference.
|
| First, if an attention node has low confidence, it can already
| assign similar scores pre-softmax. Then we get what looks like a
| uniform distribution as output. Then we're basically taking an
| average of a bunch of vectors (vs a weighted average that is more
| like choosing one of them). Statistically, we expect that
| averaged vector to be close to zero. In other words, the node
| already has a way to effectively opt-out by providing a near-zero
| output vector.
|
| Second, in a transformer, each attention unit has many other
| learned weights that can support the ability to opt out. Both the
| V matrix and the feed-forward layer after the attention unit give
| that module a way to provide low values to the activation
| function after the feed-forward layer, which would result in a
| value as small as you like -- again, a way to opt out.
|
| 3. I appreciate the non-academic tone of the article and the
| willingness to play around with fundamental ideas. Although I'm
| not totally convinced by the note, I'd love to read more stuff
| like this.
| sbszllr wrote:
| OP is right in that his change would make the softmax in the
| attention output zero if it "has nothing to add" (QuietAttention,
| as he said).
|
| Buuut, it's missing the forest for the trees. The goal of the
| last step of attention (ref., Fig. 2, left in
| https://arxiv.org/abs/1706.03762) is not to add/say anything (as
| the author is saying) but to compute the relationship between the
| tokens (QK^T) and V -- in layman terms, simplifying, which tokens
| are related to each other. The softmax is there because it gives
| a representation that is nicer to work with, it gives
| probabilities, instead of unscaled matrix multiplication.
|
| TLDR; author isn't wrong but he isn't right, practically
| speaking, either.
| lostmsu wrote:
| What's wrong with unscaled matrix multiplication? Softmax has
| some kind of intuition in the context, but why not layer norm
| or something else instead (if anything is needed at all)?
| sbszllr wrote:
| The family of sigmoid functions has nice gradient properties
| with theoretical backing. Good starting read:
| https://stats.stackexchange.com/questions/162988/why-
| sigmoid...
| whimsicalism wrote:
| > their existence is contrary to everything we thought we knew
| about neural networks prior to building ones that worked so well
|
| Why is this true? Because their existence implies some sort of
| preferred basis that aligns with the dims of the neural network,
| which is surprising?
|
| It's not obvious why their existence is so contrary to what we
| knew.
| fwlr wrote:
| The author identifies a real problem and poses a simple solution.
| It passes all my crank tests (why did no one come up with this
| before? Because the author is intimately familiar with the
| softmax function from work outside of ML, and plausibly nobody
| who's investigating these issues is remotely as familiar, so
| despite researchers narrowing the issue down to "something to do
| with softmax", they don't have a deep enough understanding of
| softmax to see what's wrong).
|
| If the author is reading any of these comments, though, I would
| urge them to expand on their claim that "I'm 99.44% sure that it
| will resolve the outlier feedback loop". As it stands, that's the
| only explanation we get of how the outliers might be related to
| softmax!
| sedael wrote:
| >why did no one come up with this before
|
| So it turns out someone did. Specifically google did. This
| _exact_ same idea has been in flaxformers since at least
| November 2021.
|
| https://github.com/google/flaxformer/blame/ee62754ebe5a5eeb1...
|
| Specifically to save people a click it says:
|
| > """Softmax function with an additional virtual logit equal to
| zero. For compatibility with some previously
| trained models. This is equivalent to adding one to
| the denominator. In the context of attention, it allows
| you to attend to nothing.
|
| And creates the _exact_ same modified softmax as this essay. I
| suppose only time will tell why it was ignored publicly before,
| maybe it doesn 't do much, maybe it just fell through the
| cracks, maybe google just didnt push it, who knows
| toxik wrote:
| Or maybe it doesn't really do anything to improve
| performance.
| littlestymaar wrote:
| > I suppose only time will tell why it was ignored publicly
| before, maybe it doesn't do much, maybe it just fell through
| the cracks, maybe google just didnt push it, who knows
|
| Maybe quantization wasn't as hot back then than it is now?
| jablongo wrote:
| Yea the benefit is not going to come in terms of
| performance for a given model, but in terms of ability to
| be efficiently quantized.
| Legend2440 wrote:
| Yeah, but it lacks the most important test: results. He hasn't
| actually tried it, he just thinks it will work.
|
| For such a simple change to the softmax it wouldn't take long
| to verify. It's really embarrassing to not do that before
| publishing.
| refulgentis wrote:
| It's not embarrassing at all.
|
| I think there might be some curse of the auto-didact here,
| hinging on the meaning of publish: it would be embarrassing
| if he was capital-P publishing, as in a scientific paper.
|
| The blog goes to great lengths to point out it is _not_
| capital-P publishing.
| furyofantares wrote:
| It's a blog post. And it includes a call for help in testing
| the idea.
| joebiden2 wrote:
| You seem to really disregard the positions of this author.
| They seem to have invested substantial efforts in that
| specific area of research.
|
| To validate the idea the author has, it would be required to
| train a LLM from zero. If the author is right, you would get
| similar results to the current generation of LLMs, but with
| (a lot) less space required for the intermediate layers.
|
| The time to achieve that is still measured in kilo- to mega-
| dollars, why is it wrong to put that idea in the open to
| substantially criticize or adopt?
| Legend2440 wrote:
| You don't need to train a ChatGPT-sized LLM, a toy nanoGPT
| would have been enough. You can train those on a consumer
| GPU in an afternoon.
|
| And yes I do disregard his research effort. There are
| hundreds of well-justified and well-researched "clever
| tricks" for improving Transformers, and almost all of them
| don't work. I'll believe it when I see the results.
| knewter wrote:
| Google used it in flaxformers since 2021 apparently
| renewiltord wrote:
| Do you know of handy testing steps? I suppose I could ask
| ChatGPT, but if someone has a validated "here, this is
| how you do it" I have a 3090 that I can do it on, but I'm
| not keen to debug anything here.
| visarga wrote:
| I interpreted it as cracking a joke about miscalibrated probs
| in softmax, it tends to be 99.9% sure, or 0.1%, but little in-
| between.
| tel wrote:
| > why did no one come up with this before? Because the author
| is intimately familiar with the softmax function from work
| outside of ML, and plausibly nobody who's investigating these
| issues is remotely as familiar
|
| I doubt that is true. Softmax is extremely well understood
| within the ML community. It's a very common trick, these
| properties are well-known as well. It feels very unlikely that
| nobody has thought of this before. That said, it's also
| plausible that the current softmax convention was chosen by
| accident and the author is right to identify this drawback.
| Majromax wrote:
| > why did no one come up with this before?
|
| And because the effects of the problem are subtle. Supposing
| the diagnosis is correct, full-precision LLMs still avoid the
| issue through large attention weights given to meaningless
| tokens to give harmless attention outputs. The problem only
| matters when quantizing weights, and quantized performance
| isn't really the goal of recent cutting-edge LLM development.
| Agingcoder wrote:
| It's interesting. It looks like if you're trying to improve the
| accuracy/perplexity of the model and using fp32 it doesn't make a
| difference , but if you want to quantize it/make it compressible
| a modified soft max makes a huge difference ( this is what I
| understand from the Qualcomm paper). Different goals, different
| findings ?
| ks2048 wrote:
| Interesting read. As others have said, it will be much more
| convincing with some experimental numbers.
|
| I'm confused what his goal is though:
|
| I could imagine some theoretical reason to add a 1 there, but he
| starts by saying this can lead to smaller, more compactable
| models. Is he talking about the size the compressed weights? or
| pruning to a smaller model? or resistant more quantization?
|
| Parts of the essay seemed to throw me off track, because I'm not
| sure if they are relevant at all to the proposal (eg the of the
| initial embedding and how many bits it would take the store the
| vocab size, etc).
| feoren wrote:
| He says this in the article: if you ever want to jam a multi-
| trillion-parameter model into a phone app or a Raspberry Pi,
| you _must_ quantize. I 've seen some quantization go from
| doubles to bytes (64 bits to 8) per weight, reducing the RAM
| requirement by 8x. A simple quantization (I'm sure there are
| much better ones) is to round everything the nearest 1/255th of
| your number range, then multiply by 255. So your resolution is
| (max-min)/255. You also store the min and max so you can
| reverse it, of course. Say you're trying to quantize these sets
| of numbers:
|
| 1. { -1.4, 0.8, 2.7, 7.3 } : With a range of 8.7, you have a
| resolution of 0.034. This set quantizes to { 0, 64, 120, 255 }.
|
| 2. { -1400, 800, 2700, 7300 } : Resolution 34.1, quantizing to
| the same as the above { 0, 64, 120, 255 }.
|
| 3. { -0.008, -0.001, 0.009, 0.019 } : resolution 0.000106. This
| set quantizes to { 0, 66, 161, 255 }.
|
| 3. { -1.4, 0.8, 2.7, 7329 } : Resolution 28.7. This set
| quantizes to {0, 0, 0, 255 }. Oops -- we can no longer tell
| most of our weights apart.
|
| You can see how this quantization works really well when all
| the numbers are close together, regardless of their absolute
| scale. Major outliers completely mess up the entire system. You
| can make more and more complicated quantization algorithms, but
| those will always come with tradeoffs. The best option would be
| to tame your weights so that they are again close together.
| atorodius wrote:
| My hot take is that if you dont do the trick, you basically get a
| mean of all vectors in the value matrix if all x are very small.
| Which then probably the next sequence of linear layers will be
| able to interpret the same way as if you do the +1 trick and
| prodce a 0?
| naillo wrote:
| Reading this I'm mostly thankful real brain power and the general
| smart programming community is seriously taking a close look at
| all these things. I barely feel the need to try to compete for
| insight gathering it feels very healtily analyzed from every
| perspective finally.
| obiefernandez wrote:
| This seems very important if accurate.
| [deleted]
| nborwankar wrote:
| Shouldn't this be called Regularized SoftMax? Adding 1 in the
| denominator looks a lot like a regularization in other ML
| contexts.
| chessgecko wrote:
| I ran an experiment like this and in my setting it didn't help.
| Not saying there may not have been a bug or something, but I
| think attending over the current position sort of solves this
| problem. IE when it should not speak it just emits the current
| pos value.
|
| edit to add details in case anyone is interested
|
| I didn't add one to the softmax denom. I added a learned
| parameter (the attention sink) that would be appended to the
| beginning of QK but would be removed after softmax, so when
| multiplying by V the totals wouldn't sum to one. I tried variants
| that included looking at the current pos and not, and also
| variants that predicted used an ffn to generate the sink per
| position instead of a learned param. In my setting neither
| approach really made much of a difference. But I also had a bunch
| of other weird stuff in there too, so it may be worth trying
| again.
| abeppu wrote:
| When you say it didn't help, can you clarify what you're
| measuring? In the context of this post, I think both the
| performance your task, and the number of outlier weights (and
| their magnitude) are important.
| chessgecko wrote:
| I was just looking at doing this in pretraining, so I was
| looking at pretraining losses. The difference was within the
| range of usual noise so I didn't keep trying.
| waynecochran wrote:
| The question concerns outliers ... how did the change
| manage them?
| lucidrains wrote:
| this is fixing a different issue, not the one you are
| measuring.
| chessgecko wrote:
| It wasn't really the goal of my experiment to fix this
| issue for sure, I was trying to see if you could improve
| attention by decoupling the key used by a position for
| itself and for future tokens.
|
| Open to being wrong here, but wouldn't it be functionally
| similar to adding a constant to the softmax denom? the
| function could sort of learn a specific position to have
| sink and q multiply to one, then removing it before
| multipling with v would be exactly identical?
| gwern wrote:
| He's advertising it as fixing the spiking outliers. Did your
| variant have those outliers beforehand?
| chessgecko wrote:
| I guess yeah I was mostly responding to
|
| _Now it's possible that softmax should be replaced
| wholesale, but it's worked pretty well for the most part,
| except for this one wee little bug that prevents attention
| heads from saying nothing. So I propose a very small tweak on
| which I am willing to stake all future Internet claims to
| being correct. The tweak is so small, yet so obvious, and
| it's been sitting here under everyone's noses ever since
| attention was invented (2014)._
|
| I didn't test for outliers, but I don't think this will lead
| to a large improvement in attention overall/it will fix a
| lurking bug.
| zackangelo wrote:
| He's not trying or claiming to improve attention. He's
| trying to reduce outliers to improve the ability to
| quantize the parameters.
| chessgecko wrote:
| He refers all over the blog post to an "error" in
| attention. specifically says
|
| _The problem with using softmax is that it forces each
| attention head to make an annotation, even if it has no
| information to add to the output vector. Using softmax to
| choose among discrete alternatives is great; using it for
| optional annotation (i.e. as input into addition) is,
| like, not cool, man._
|
| I'm saying it uses the current position to do this, that
| if it was a significant error I would expect it to
| improve the training loss. I sort of interpreted the blog
| post as being a bit more positive on the idea than just
| being about improving the quantization
| [deleted]
| cs702 wrote:
| TL;DR: The author proposes that instead of using the Softmax
| function in each head, Softmax(x_i) = exp(x_i) /
| sum(exp(x_i)),
|
| we should use instead what the author calls the Softmax_1
| function, Softmax_1(x_i) = exp(x_i) / (1 +
| sum(exp(x_i))),
|
| which would make it possible for each transformer head's
| attention probabilities to be zero, i.e., attend to nothing, by
| computing x_i's with values well below zero.
|
| Giving each transformer head _the ability to ignore all tokens_
| surely can 't hurt, but it remains to be seen if it will actually
| improve transformer performance.
| rrobukef wrote:
| I also saw the author distinguished internal versus output
| softmax. I think he'd apply his modification only to internal
| softmax and let the external force an output.
| cs702 wrote:
| Yes, it makes sense to apply this only to the Softmax we use
| to compute attention. It makes no sense to apply it to the
| output Softmax, which must compute a probability
| _distribution_ over the vocabulary.
| mcbuilder wrote:
| Activation sparsity and packing sparse matrices will surely be
| important, so there is one kind of performance. However the
| other, perplexity, needs a good demonstration. It might require
| a big model, but even 30B you can fine tune on nowadays on a
| big Cloud GPU box.
| szundi wrote:
| It would be fun when once in the future someone finds a bug like
| this, merges the PR, and BAM! Singularity.
| theGnuMe wrote:
| Isn't this why you have the <BOS> token, or the <cls> token?
| mellosouls wrote:
| Unless he gives a good reason why he has not demonstrated his
| claim (eg. "This effect only presents at a scale beyond my
| resource"), the thesis seems severely weakened by the lack of
| effort to prove it in a toy version.
|
| He just says he doesn't want to spend any more time on it, which
| is unlikely to convince or motivate anybody else that he has
| discovered something important.
| refulgentis wrote:
| It got tons of people really excited.
|
| I don't know what to say past that, but it's worth reflecting
| on.
| jxf wrote:
| The author's use of "kurtotic barbarities" to describe this
| situation is absolutely my new favorite phrase. English is a
| beautiful language in which to express frustrations.
| tsurba wrote:
| In the text they say you need to cram all information needed to
| predict the next token into a single 6KB word embedding, but
| isn't that wrong?
|
| Rather, isn't the autoregressively predicted single next token a
| combination (based on attention) of all 6KB word tokens in the
| attention window.
|
| So the size of memory where all information for next token
| prediction needs to be "crammed into" is more like
| window_size*6KB, right?
| Imnimo wrote:
| >The problem with using softmax is that it forces each attention
| head to make an annotation, even if it has no information to add
| to the output vector. Using softmax to choose among discrete
| alternatives is great; using it for optional annotation (i.e. as
| input into addition) is, like, not cool, man. The problem here is
| exacerbated with multi-head attention, as a specialized head is
| more likely to want to "pass" than a general-purpose one. These
| attention heads are needlessly noisy, a deafening democracy where
| abstention is disallowed.
|
| Can't the MLP that processes the concatenated outputs the
| attention heads handle this? I don't understand why it should be
| critical that a head be allowed to put something close to zero in
| its segment of the concatenated vector if it's immediately going
| to get projected by an MLP anyway.
| marcyb5st wrote:
| But you are wasting some of the model's capacity to learn to
| ignore some of that information. I think it wouldn't hurt.
| However, if I followed the reasoning correctly, I think the
| biggest win is to reduce the range of the weights more than
| improving performance.
|
| > _This is what's been happening in LLMs - for reasons that are
| only partially understood, Transformer models contain these
| outlier weights and are emitting Black Swan mega-activations
| that are much, much, much larger, like orders of magnitude
| larger, than their peers ..._
|
| meaning that once quantized you can either have a finer
| quantization since the range of possible values is smaller or
| you can pick a coarser strategy that saves bits for each
| weight.
| Imnimo wrote:
| Right, I get the goal of removing the outlier activations,
| but I just don't understand why outlier activations are a
| consequence of the model trying to "pass". The story from the
| linked paper earlier in the post
| (https://arxiv.org/pdf/2306.12929.pdf) is that the model is
| doing the following:
|
| -Learn a near-zero representation for some otherwise low-
| importance token, like delimiters or whitespace.
|
| -When a head wants to "pass", emit an outlier activation to
| attend to that token nearly-exclusively.
|
| But I'm surprised the model can't just use its existing tools
| (the post-concat projection layer and the following MLP
| block) to achieve the same thing. And if the answer is that
| it could do that, but tends to learn to use the outlier
| activation trick instead, will giving it a new tool that
| still allows the use of outlier activations be sufficient?
| orasis wrote:
| "the seemingly innocent exponentiator that no one thought capable
| of such kurtotic barbarities."
|
| This writing brought a happy tear to my eye.
| politician wrote:
| This makes sense. One tweak for the press: I think it would be an
| improvement to call it OptionalAttention rather than
| QuietAttention since the goal is to permit an attention head to
| opt-out.
|
| You might attract more, ahem, attention if it was immediately
| apparent from the name only what this attention head does that
| the current one does not. There's also that small matter of
| distinguishing the internal vs output softmax functions.
| alsodumb wrote:
| I might be missing something obvious, but I am not sure why
| everyone in the comments think it's a big deal. I've seen this
| trick in practice multiple times.
|
| For example, see this snippet from an old Google repo:
| https://github.com/google/flaxformer/blob/ee62754ebe5a5eeb11...
| alevskaya wrote:
| Yeah we used to use this in our older models years ago... I
| don't recall the details exactly, but I don't think it ever did
| very much.
|
| I certainly don't think it will help at all with stability.
| Things like Q/K layernorm are better tricks for softmax
| stability when scaling: https://arxiv.org/pdf/2302.05442.pdf
| ggerganov wrote:
| > I don't recall the details exactly, but I don't think it
| ever did very much.
|
| How would you have known if the trick actually reduces the
| outliers in the weights? Even if the transformer quality does
| not improve overall, having less outliers as a result is very
| beneficial for more accurate quantization of the data
| danielmarkbruce wrote:
| Are you asking "why would you have bothered to look at"?
|
| The "how" is pretty straightforward.
| PartiallyTyped wrote:
| The argument / reasoning is a bit dubious.
|
| Technically softmax is not implemented as presented but through
| exp(x_i-max(x)), and summing over it in the denom. But maybe I
| am missing something.
|
| Furthermore, the residuals are used exactly because the
| networks cant learn the identity function; but they can learn
| zero; at which point the residual is `f(x): x+g(x)` with being
| `g:x ~> 0` (ie approximately 0).
|
| It is also the case that `f(x): x+g(x)` makes it easier for
| gradients to flow through.
| Piezoid wrote:
| Implementations usually replace replace the 1 in the
| denominator with exp(-max(x)) for this reason.
| mrfox321 wrote:
| You are misreading things.
|
| Regardless of numerical stability tricks (e.g. exp(x_i-
| max(x))), you are still simply normalizing the logits such
| that the probabilities sum to 1.
|
| The blog adds an additional hidden logit (equal to 0) to
| allow for softmax(x) = 0 when x -> -inf.
| PartiallyTyped wrote:
| How can `x -> -inf` occur in the first place when nearly
| everything is within [-2,2] and doing a dot product plus
| before that there's normalization too?
| zorgmonkey wrote:
| If popular models are still making this mistake then it still
| seems noteworthy and making a blog post or paper to increase
| awareness definitely seems worthwhile. Also multiple
| independent discovery of good ideas is quite common.
| jmount wrote:
| The "missing 1" is a waste-category that is implicitly re-scaled.
|
| The explicit 1 formulation is used in binary softmax, and the
| implicit (not seen 1) is used in multinomial softmax. I suspect
| this is the old "notation B looks silly in terms of notation A's
| standards."
| mlsu wrote:
| I don't really understand the subject matter enough, so I
| apologize in advance for the meta-comment...
|
| The author mentions that he would maybe have written this as a
| scientific paper:
|
| > I tried writing a serious-looking research paper about the bug
| and my proposed fix, but I lost a series of pitched battles
| against Pytorch and biblatex, so I figured I'd just write a blog
| post instead. (History is written by the winners; blogs are
| written by...)
|
| Honestly, thank god he didn't. This paper is so much more
| readable and approachable than what gets published in "serious"
| journals. The tone is self-effacing, it does not have an "ego"
| the way scientific papers tend to have. If all science read like
| this, and if we were "allowed" to cite research that reads like
| this, I think we would be much better off. This reads like a
| conversational, approachable textbook, not like an impenetrable
| wall.
|
| Is it because I don't understand attention at a PhD level that I
| hold this opinion? Maybe. Could he be writing like this because
| he's a layman and utterly wrong about the topic, unlike those
| Serious Science Authors? Maybe, I don't know.
|
| But my god, wouldn't it be nice to be allowed to write like this?
| Waterluvian wrote:
| To finish the author's analogy:
|
| Blog posts are written by those who arrive first.
|
| In a weird way my mental model is: blog posts are the recon
| team discovering a new idea. They might have errors. They might
| be incomplete. Maybe they're outright wrong. Stakes are lower
| as it took less effort to get there and less loss if a position
| is abandoned.
|
| Then papers are authored, often much later, and they're the
| regulars coming in to fortify a newly captured idea. They
| provide (or at least are supposed to) rigor to the idea. A
| fortification of a position that we decide is worth holding.
|
| Yeah, this analogy is probably sloppy. But in my brain there's
| an eternal conflict against ignorance as we keep advancing into
| the unknown.
| chessgecko wrote:
| I think maybe its because he didn't have experimental results
| that show that it worked. Not a knock against the author, there
| are just so many things that seem like good ideas that don't
| end up working well in practice, a paper like this without
| results is hard to value.
| mlsu wrote:
| Yes, definitely. If he tried to have it published, the lack
| of experimental results would definitely be a glaring error.
|
| But this is still scientific communication. It's really nice
| that it's legible!
|
| > Even though softmax1 is facially quite boring, I'm 99.44%
| sure that it will resolve the outlier feedback loop that's
| making quantization the subject of cascades of research. If
| you want to run some experiments and prove me right, DM me on
| Twitter and we'll get a paper going.
|
| I'm guessing that in the stodgy world of science, a
| communication like this might happen over lunch at a
| conference, limited to a small clique of researchers who are
| zealously guarding their next paper. Who could blame them,
| publish or perish!
|
| But someone will probably test this theory out (after my
| read, it will probably happen in llama.cpp with preliminary
| results on GPT-2 by next week) and achieve results, and it
| will happen quickly and legibly to the outside world, because
| this was published openly and without all of the pretension
| that formal science (tm) has. If it works, it works. Stuff
| like this is the soul of the internet. Sharing knowledge and
| making it legible for all.
| [deleted]
| WithinReason wrote:
| Then again, if you don't have access to giant compute
| clusters you can't test this, so it's either a blog post or
| nothing. I believe the outlier problem that this solves only
| appears for very large models.
| janalsncm wrote:
| That isn't true at all. Train a smaller model on a smaller
| dataset. You can even train on your laptop. It's definitely
| feasible. This is just a proof of concept, it doesn't need
| to beat state of the art.
| WithinReason wrote:
| Maybe I edited my comment too late.
| janalsncm wrote:
| > I believe the outlier problem that this solves only
| appears for very large models.
|
| Any reason to believe this? The author never mentioned
| it, and I can't think of any other _a priori_ reason why
| it should be true.
| WithinReason wrote:
| See figure 1:
|
| https://arxiv.org/pdf/2208.07339.pdf
|
| Outliers appear at model size 6.7B and are not present at
| 2.7B
| janalsncm wrote:
| Sure, emergent properties can arise as parameters
| increase. Everyone knows that. That's a much less
| specific claim than to say that the benefit of modifying
| softmax can only arise as an emergent property after N
| parameters, and therefore the benefit can only be
| evaluated on models above a certain size. To my
| understanding the author of TFA isn't suggesting the same
| issue as the one in your linked paper.
| WithinReason wrote:
| The second heading in the TFA is "It's All About
| Outliers"
| PoignardAzur wrote:
| 6.7B isn't "needs a datacenter" scale.
| WithinReason wrote:
| It's in the million dollar range. XLnet which is a 1.3B
| model cost $245,000 to train for example.
| Legend2440 wrote:
| Counterargument: this blogpost is worthless. You get all the
| way to the end and then find out he hasn't actually tried it,
| not even on a toy model. It's just a neat idea he thinks will
| work.
| ambrozk wrote:
| Why would that make it worthless?
| PoignardAzur wrote:
| Among other reasons, because the decoder-only version of
| the original transformer architecture has proven _weirdly_
| resistant to these kinds of hacks and clever optimizations.
|
| Ideas like sparse attention, tree attention, residual
| attention, etc, all sound good on paper, but when
| researchers try to reproduce them they either find no
| results or results that don't scale. Even AliBi is turning
| out to be less powerful than scaled-down positional
| embeddings. It's almost a bitter lesson on its own: you
| can't beat the original transformer.
|
| Optimizations that _do_ stick around tend to be the ones
| that preserve the original algorithm but help with caching
| or memory accesses.
| 6gvONxR4sf7o wrote:
| Because there are a thousand ideas a minute in this field
| that meet the "it's worth trying" bar but don't actually
| pan out to make any difference. It's the equivalent of a
| blogpost that says "if someone else turned my idea into a
| business, it would be a billion dollar business. But I
| won't bother."
| Legend2440 wrote:
| Because until he tries it, who knows if it works?
|
| There are a thousand papers out there making minor tweaks
| to the transformer architecture. 99% of them are also
| worthless and forgotten.
| debugnik wrote:
| > Because until he tries it, who knows if it works?
|
| That's precisely what he shared this for, though. So
| someone willing to train a model with this tweak tries
| it.
| [deleted]
| janalsncm wrote:
| I wouldn't quite say its value is zero. It's worth something,
| but a lot less than if it had been shown to work empirically.
|
| Explainers and their folksy, imprecise tone are good for
| things we already know are true. I'm skeptical on things
| which are unproven.
| [deleted]
| Method-X wrote:
| I can see AI being used to make scientific papers more
| approachable like this.
| TigeriusKirk wrote:
| Are most AI papers even published beyond arxiv anyway?
| Der_Einzige wrote:
| This is why folks like gwern have their own research published
| this way, i.e. his analysis of GPT-3: https://gwern.net/gpt-3
|
| We call him an "independent AI researcher" because his google
| scholar is "bland" compared to many academics who play the
| academia game -
| https://scholar.google.com/citations?user=yk1QMowAAAAJ&hl=en
| _Microft wrote:
| > This paper
|
| It's not a paper. It's an idea that sounds plausible, presented
| in a highly entertaining form.
| doliveira wrote:
| Nah, scientific papers are supposed to be precise and
| technical. This reads like those quite frequent suggestions
| here of switching all equations in papers to plain English or
| code: it honestly comes from a place of ignorance, and I say
| that as basically a layman myself.
|
| What should be encouraged is for academics to blog about their
| research as well. It would even help when recruiting and
| onboarding new members. Right now the sociological and
| economical incentives don't promote this at all.
| karaterobot wrote:
| The writing quality of academic papers is very poor, whatever
| its intended characteristics are, and we deserve better.
|
| I'm skeptical that the only way for them to be precise and
| technical is to make them impenetrable. I think there is a
| culture of academic writing (many different cultures, really)
| that has adopted a voice and writing style which became a
| parody of itself over time.
|
| Here's a trivial example: You frequently see papers use the
| passive voice, something a middle school English teacher
| would mark with a red pen. _500 participants were asked_ ,
| vs. _we asked 500 participants_. In what sense is the former
| more precise and technical? It 's not. It does not convey any
| additional meaning. People use it to sound objective and
| distant, even when they really aren't.
|
| Realistically, academic writers usually don't even think
| about it as much as that. They're just copying the tone of
| other papers, because there is a culture and it enforces
| certain behaviors on its members irrespective of the value.
| baq wrote:
| Leslie Lamport definitely doesn't share your opinion. A known
| fact about the Paxos paper is that there are no dumbed down
| summaries worth reading because the proper thing is so
| approachable. Not sure if you only have to sound smart if
| you've got nothing to say but certainly feels like it could
| be the case.
| coldtea wrote:
| > _Nah, scientific papers are supposed to be precise and
| technical._
|
| They're also more often than not tedious, badly explained,
| and oft-skipped, error prone, and hardly ever read carefully,
| even during peer review for the paper that contains them.
| That's how mistakes stay unnoticed for decades in influential
| papers with tons of citations.
|
| In essense, a paper's tone and languge is often more
| formality, academic tradition, ritual, and padding for
| publication purposes, than serving a real purpose.
| guluarte wrote:
| not always, ReLu is a fucking line, most papers write stuff
| in the most complicated way to sound smart.
| aqsalose wrote:
| "it honestly comes from a place of ignorance, and I say that
| as basically a layman myself"
|
| Here is an added complication: succinct technical
| communication can be efficient when communicating to peers
| who work on the exactly same domain, similar problems as you,
| and want digest your main ideas quickly.
|
| On the other hand, for any particular paper, the size of the
| audience to whom it is directly relevant and addressed to can
| be small. The size of the audience who got to reading it
| anyway may be _vast_. (Maybe I am reading your paper because
| someone cited a method paper that in lieu of a proof or
| explanation writes just two words and citation to your paper.
| Maybe I am a freshly minted new student reading it for my
| first seminar. Maybe I am from a neighboring field and trying
| to understand what is happening in yours. Maybe I tried to
| find what people have already done with particular idea I
| just had and search engine gave your paper. And so on.)
|
| During my (admittedly lackluster) academic career I recall
| spending much more time trying to read and understand papers
| that were not addressed to me than papers that were and where
| I enjoyed the succinct style that avoids details and present
| the results. (Maybe it is just an idiosyncratic trust issue
| on my part, because I am often skeptical of stated results
| and their interpretation, finding the methods more
| interesting). But that is not all.
|
| I also noticed that genuine misunderstandings coming from
| "brief" communication of technical "details" were quite
| common; two different researches would state they "applied
| method X to avoid Y/seek Z[citation]" in exactly so many and
| almost exactly same words, where X,Y and Z were complicated
| technical terms, yet the authors would have quite different
| opinion what the meaning of those words were and what would
| be the intended reading and how and why X should be
| implemented.
|
| In conclusion, I think many a scientific field would benefit
| from a style where authors were expected to clearly explain
| what they did and why (as clearly as possible).
| lofatdairy wrote:
| I agree with everything you say. Though papers really are a
| bit too hard to read sometimes, but I'd argue it's often not
| for an overly technical tone so much as writers cutting out a
| lot of background material for brevity and assumed
| familiarity.
|
| >What should be encouraged is for academics to blog about
| their research as well. It would even help when recruiting
| and onboarding new members. Right now the sociological and
| economical incentives don't promote this at all.
|
| I will add onto this that a lot of journals have been pushing
| for video abstracts and "plain English" abstracts. For the
| most part I don't see these too often but when they're there
| they're appreciated, and I vaguely recall that someone found
| that citations go up when they're used (specifically plain
| English, I don't think anything has been on video abstracts).
|
| There are a lot of good blogs for computational academic
| subjects (ml, bioinformatics, comp neuro, etc) but I see less
| for bio and non-software engineering. Math and physics seems
| to have some really notable blogs, but beyond what gets
| posted to HN and linked further on those blogs, I can't
| comment.
| r3trohack3r wrote:
| There was this sociologist who had written a paper for us all
| to read ahead of time. I started to read the damn thing, and
| my eyes were coming out: I couldn't make head nor tail of it!
| I figured it was because I hadn't read any of the books on
| the list. I had this uneasy feeling of "I'm not adequate,"
| until finally I said to myself "I'm gonna stop, and read one
| sentence slowly so I can figure out what the hell it means."
| So I stopped-at random-and read the next sentence very
| carefully. I can't remember it precisely, but it was very
| close to this: "The individual member of the social community
| often receives his information via visual, symbolic
| channels." I went back and forth over it, and translated. You
| know what it means? "People read." Then I
| went over the next sentence, and realised that I could
| translate that one also. Then it became a kind of empty
| business: "Sometimes people read; sometimes people listen to
| the radio," and so on, but written in such a fancy way that I
| couldn't understand it at first, and when I finally
| deciphered it, there was nothing to it. -- Feynman
|
| I disagree. After going through quite a few research papers
| in my time, I've found the best are the ones that are direct
| and to the point. Many papers I've spent many hours/days
| trying to unravel just to realize the concepts were
| straightforward, not very novel, and there wasn't much of
| real substance to the paper.
|
| Meanwhile, some of the most impactful papers I've read are
| direct and to the point. Kadmellia, Bitcoin, BitTorrent,
| DynamoDB, Firecracker, etc.
|
| It seems like, when you have something of substance to say,
| you say it. When you don't you overcompensate by falling back
| on building an intricate puzzle of jargon and convoluted
| equations in an attempt to make what you're saying sound far
| more important than it really is.
|
| As LLMs get better, I look forward to the day where every
| journal has a standard LLM filter you're required to apply to
| your paper that unravels all of this nonsense and rewrites it
| a more straightforward way, if not to directly publish than
| just for the editors to verify there isn't a simpler way to
| convey your ideas. I suspect that if we had an EIL5 filter
| for most journal articles, we'd discover that a majority of
| the words that get published have very little substance at
| all.
| dekhn wrote:
| I hadn't seen that Feynman quote before, but I discovered
| then when reading Donna Harraway's books (Cyborg Manifesto,
| Modest_Witness@Second_Millennium.FemaleMan(c)Meets_OncoMous
| e, Primate Visions).
|
| The criticism was """Haraway's work has been criticized for
| being "methodologically vague"[39] and using noticeably
| opaque language that is "sometimes concealing in an
| apparently deliberate way""""
| coldtea wrote:
| > _Haraway 's work has been criticized for being
| "methodologically vague"[39] and using noticeably opaque
| language that is "sometimes concealing in an apparently
| deliberate way_
|
| So you're saying that "Her work is basically handwaving
| and bullshitting".
| dekhn wrote:
| Yes, but also, wrapping the handwaving and bullshitting
| in a layer of obfuscation:
|
| "Michel Foucault's biopolitics is a faccid premonition of
| cyborg politics, a very open feld. By the late twentieth
| century, our time, a mythic time, we are all chimeras,
| theorized and fabricated hybrids of machine and organism
| --in short, cyborgs. The cyborg is our ontology; it gives
| us our politics. The cyborg is a condensed image of both
| imagination and material reality, the two joined centers
| structuring any possibility of historical transformation.
| In the traditions of "Western" science and politics--the
| tradition of racist, male-dominant capitalism; the
| tradition of progress; the tradition of the appropriation
| of nature as resource for the productions of culture; the
| tradition of reproduction of the self from the refections
| of the other--the relation between organism and machine
| has been a border war"
|
| (donna was woke before woke was a thing)
| lamontcg wrote:
| > It seems like, when you have something of substance to
| say, you say it.
|
| And this blog post probably could be condensed into 1/4 of
| its size or less with a less conversational/bloggy tone.
| coldtea wrote:
| There are words that are added to drive the point in
| multiple ways, ease into it, and make the text more
| engaging.
|
| And there are words that are added to add empty padding,
| keep up academic pretenses, and appear smart.
|
| The post could have been condensed, but it would lose the
| former, not the latter.
| cratermoon wrote:
| I believe Feynman understood that he was oversimplifying,
| and I believe he was able to do because his reason for
| reading the paper was not the same as the reason another
| sociologist might have. Thus a sentence like, "The
| individual member of the social community often receives
| his information via visual, symbolic channels", does, to a
| non-expert, mean "people read", but to another sociologist
| of a researcher in related fields, phrases like "individual
| member", "social community", and "visual, symbolic
| channels" would _terms of art_. That means an expert in the
| field could read "social community" and it would mean,
| cognitively, an entire set of concepts in the field.
|
| In short, jargon matters. People here can talk about
| functional, procedural, and object-oriented programming
| because each of the three words has more than just the
| dictionary meaning - to those of use in the field. In the
| same way we can talk about linear algebra and know it
| doesn't mean "algebra on lines".
|
| Yes, it's _possible_ to write scientifically without jargon
| and wordiness, but it 's a lot of effort and takes much
| more space to say "a group who follow a social structure
| within a society (culture, norms, values, status). They may
| work together to organise social life within a particular
| place, or they may be bound by a sense of belonging
| sustained across time and space"[1]
|
| 1 https://othersociologist.com/2013/11/20/sociology-of-
| communi...
| PoignardAzur wrote:
| Well, maybe, but you can rationalize arbitrary amounts of
| pointless jargon that way.
|
| Besides, in the example Faynman gives the simple sentence
| is actually _shorter_. Maybe that shorter sentence loses
| some information that the jargon carried, but Occam 's
| razor suggests the writer was just trying to sound
| smarter.
| Vervious wrote:
| Systems research papers do not represent all research
| papers out there, not even in computer science.
|
| In cryptography, certainly a paper with formal definitions
| and proofs can be much more valuable than a corresponding
| blog post. It's a field where formalism is desired, if not
| necessary. Otherwise you can't check other people's
| "proofs", or even know what model you're working in.
|
| I think, since people haven't come up with better
| formalisms, sometimes it's quite obtuse, which gets
| mistaken as "academic writing", when really it's a best
| effort to formalize.
| renonce wrote:
| Requiring formalism does not preclude attaching an
| informal but intuitional description of the formal
| definition or proof. Unless the authors don't understand
| very clearly what they are talking about, or they want to
| prevent others from understanding their concepts too
| easily, I don't see why there is a reason for the authors
| not to attach an EIL5 in addition to formalism.
| Vervious wrote:
| Sure. But it's an ELI5 "in addition to formalism", not
| "in lieu of formalism". In theory conferences like STOC
| or FOCS, the first section of the paper often comprises
| such an overview.
|
| Certainly some papers are better written than others. But
| sometimes a blog post cannot replace a paper, unless it
| also goes into the depth and detail that formalism
| requires. (Then it becomes a 30 page blog post, where
| most people don't read past the intro.)
| acchow wrote:
| The complaint about research papers is that almost all of
| them omit the ELI5 and provide _only_ the formalism.
|
| You can have both and weave them together into a
| digestible narrative. I see Physics textbooks sometimes
| written this way.
| smallnamespace wrote:
| Papers are mostly read by other researchers, where the
| added background is actively bad because it obscures the
| real meat of the paper to the main audience.
|
| If you just wanted a digestible intro then you would
| usually buy a textbook.
|
| I think the argument that _every_ research paper ought to
| be a mashup of a textbook + the actual research to be a
| bit silly from a "people should specialize at what
| they're good at" standpoint.
|
| Put in another context, I also don't want every recipe to
| reintroduce what it means to "fry" or "braise" or
| "marinate". We have Google for that.
| [deleted]
| mlsu wrote:
| Well, I'm not so sure. It seems to me that someone could
| perfectly well devise an experiment based off of this
| (another poster chastised me for saying paper, so) blog post.
|
| Equations are perfectly clear. I was able to follow his
| reasoning perfectly well.
|
| I cannot say the same for so many papers (tm) that I've read.
| Mostly in a similarly computational (though non-
| deeplearning) applied math domain.
| [deleted]
| pessimizer wrote:
| > The tone is self-effacing, it does not have an "ego" the way
| scientific papers tend to have.
|
| I can't imagine judging scientific papers based on whether the
| author might be looking down on me, or thinks he knows better
| than me.
|
| > if we were "allowed" to cite research that reads like this
|
| Maybe you're looking down on _yourself?_ You can cite anything
| you want to cite.
| caddemon wrote:
| Well if you yourself are trying to publish in a scientific
| venue you can't always cite exactly what you want to cite.
| Though it's probably uncommon for a peer reviewer to ask for
| a specific citation to be removed, the review process
| absolutely does affect the references list, and expectations
| about this process affect it doubly so.
| baby wrote:
| There isn't much difference between a blog and a whitepaper, in
| that people tend to write blogs more casually and whitepaper
| more seriously (and some academics event only accept things
| that look more serious).
|
| But a good writer can write great articles in whatever format
| they wish.
| nico wrote:
| It would be amazing if academia started replacing papers with
| videos + code
|
| I want to see: an explainer of the
| science/ideas/experiments/hipothesis
|
| And instructions on how to reproduce the experiments/results
|
| Some YouTubers are going in this direction
| janalsncm wrote:
| +1 to including code with your paper. It improves
| reproducibility and transparency. There's even a well-known
| website dedicated to this purpose.
|
| For the rest of it I don't care. As long as researchers
| understand what's going on, that's what matters.
| sebzim4500 wrote:
| Don't transformers typically have a <bot> token at the beginning
| of the prompt? This seems equivalent to letting the network
| attend to this token, and produce a zero value if that's what it
| wants.
| refulgentis wrote:
| not a token, and not the transformers, but yes, commercial chat
| models are fine-tuned on text transcripts containing dialogues.
| (i believe llama-2 was as well)
| sebzim4500 wrote:
| Are you sure? I have never seen an LLM that did not have a
| special token for start of text, I'm certain that llama had
| one and I don't remember anywhere in the llama-2 paper where
| they said they removed it.
| refulgentis wrote:
| tl;dr: you're right
|
| it's messy though, bear with me for the full explanation:
|
| - your initial post says "<bot>" token, which looked like a
| mix of "chatbot" and ChatML, used by OpenAI
|
| - there is a bo_S_ token, which acts as you described
|
| - I averaged my attention over your post and the initial
| reply, which answers as if you were using "<bot>" in the
| misunderstood way
|
| - when I go back and read your post, I realize the chatbot
| interpretation doesn't quite make sense, since you're
| referring to much more technical aspects than general "how
| do I AI", i.e. you understand <X> as a way to denote
| special tokens, not necessarily an XML tag
| sp332 wrote:
| Chat-tuned ones do, but the base models don't. For example,
| Llama doesn't, but Alpaca has "### Instruction:", "### Input:",
| and "### Response:".
| int_19h wrote:
| Base LLaMA still has dedicated tokens for beginning/end of
| string. What you're describing is the instruction format,
| which is separate.
| sp332 wrote:
| Oh, I had misunderstood something.
| gwern wrote:
| This reminds me of the normalization bug in StyleGAN. It had this
| obvious visual artifact of a 'blob' which would appear in
| otherwise photorealistic images, which was puzzling because it
| was _so_ obvious how did the Discriminator not squash it? It
| turned out to be a flaw in the normalization of the AdaIn style
| layers, IIRC, where the Generator was pumping up numbers and
| doing weird things to force through information.
| firebirdn99 wrote:
| This is right below the "Have Attention Spans Been Declining? -
| Yes, 65%" post, lol brilliant. In general, human decreasing, AI
| increasing- attention.
| gwern wrote:
| "In this post, I prove that attention spans have actually
| declined by 64%, contrary to widely-publicized reports of
| 65%..."
| neilv wrote:
| For posterity: 1. Have attention spans been
| declining? (slimemoldtimemold.com) 338 points by
| janandonly 4 hours ago | flag | hide | 254 comments
| 2. Attention Is Off By One (evanmiller.org) 400
| points by elbasti 4 hours ago | flag | hide | 129 comments
|
| Note that the #1 post is probably there because the title
| earlier had the provacative "Yes, 65%" appended to it. So even
| more numerical.
___________________________________________________________________
(page generated 2023-07-24 23:00 UTC)