[HN Gopher] Understanding and coding the self-attention mechanis...
___________________________________________________________________
Understanding and coding the self-attention mechanism of large
language models
Author : mariuz
Score : 52 points
Date : 2023-02-10 18:04 UTC (4 hours ago)
(HTM) web link (sebastianraschka.com)
(TXT) w3m dump (sebastianraschka.com)
| hprotagonist wrote:
| https://arxiv.org/abs/2105.02723
|
| _The strong performance of vision transformers on image
| classification and other vision tasks is often attributed to the
| design of their multi-head attention layers. However, the extent
| to which attention is responsible for this strong performance
| remains unclear.
|
| In this short report, we ask: is the attention layer even
| necessary?
|
| Specifically, we replace the attention layer in a vision
| transformer with a feed-forward layer applied over the patch
| dimension. The resulting architecture is simply a series of feed-
| forward layers applied over the patch and feature dimensions in
| an alternating fashion. In experiments on ImageNet, this
| architecture performs surprisingly well: a ViT/DeiT-base-sized
| model obtains 74.9\% top-1 accuracy, compared to 77.9\% and
| 79.9\% for ViT and DeiT respectively.
|
| These results indicate that aspects of vision transformers other
| than attention, such as the patch embedding, may be more
| responsible for their strong performance than previously thought.
| We hope these results prompt the community to spend more time
| trying to understand why our current models are as effective as
| they are._
| [deleted]
| lostmsu wrote:
| That's a pretty huge drop in accuracy.
| thomasahle wrote:
| ViT gives you 90% top-1 accuracy on ImageNet:
| https://paperswithcode.com/sota/image-classification-on-imag...
| I don't know where they get the 77.9% number from. 75% is
| pretty bad. Similar to the 2015 VGG net as the authors also
| admit.
| thomasahle wrote:
| Nevermind, I guess it's "ImageNet-1K trained models" on which
| ViT gets 79.9% and the 90% is only when pretraining with
| ImageNet-22K.
|
| There are other non-attention based networks that get 90% too
| though: https://arxiv.org/pdf/2212.11696v3.pdf
| mirker wrote:
| Isn't this very similar to Karpathy's nanoGPT?
| hummus_bae wrote:
| [dead]
___________________________________________________________________
(page generated 2023-02-10 23:00 UTC)