[HN Gopher] Memorizing Transformers
___________________________________________________________________
Memorizing Transformers
Author : silencedogood3
Score : 127 points
Date : 2022-05-20 14:58 UTC (8 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| tipsytoad wrote:
| Could there be any merit training this on a common-sense dataset
| such as Cyc?
|
| https://www.lesswrong.com/tag/cyc
| ipsum2 wrote:
| Probably not, most common facts (sandcat is a type of feline)
| are already known by transformers. Maybe some obscure ones.
| jerpint wrote:
| The basic idea is to have a q,k,v cache of all the previously
| seen tokens that gets updated over time. The transformer can
| decide to do self-attention (and ignore the cache) or focus on
| elements from the cache (enabling it to attend to previously seen
| tokens). They mainly apply this to large documents, i'd be very
| curious to see a followup on time-dependent tasks like videos
| lucidrains wrote:
| have an implementation of this over at
| https://github.com/lucidrains/memorizing-transformers-pytorc...,
| for any researcher exploring retrieval and memory with attention
| networks
| silencedogood3 wrote:
| Neat! Can you explain what the KNn is doing? I can't quite
| follow the paper.
| visarga wrote:
| It's a sparse attention scheme. They store and reuse
| activations thus "memorising" the past without the need for
| training. In order to keep the sequence short enough to fit
| into memory they only recall the k most similar memories from
| a much larger log.
| knrz wrote:
| Dude your repo's are great, marvellous code quality too for
| cutting edge papers. Keep it up!
| lucidrains wrote:
| hey thanks! :^) hope someone makes the next big discovery
| with them
| shallichange wrote:
| Top of my head: Rodimus, Bumblebee, Ratchet, Optimus Prime,
| Laserbeak, Megatron, Astro Train, Jazz
| jameshart wrote:
| The 'ethics' section seems surprisingly cursory and lacking in
| references.
|
| "The ability to memorize large databases of facts could have
| potential ramifications for society, especially if those
| databases include sensitive personal information or copyrighted
| works. However, one advantage of using an external memory is that
| the memory can be easily cleared of all such information"
|
| That's it? Just 'may have ramifications'?
|
| No concern that this enables 'Tay'-like failure modes where a
| system can be manipulated through input into generating
| particular output?
|
| Or even just grappling with whether adding 'memory of
| experiences' to a language model might open the door to creating
| a system that has beliefs, or opinions...? and that maybe there
| might be some ethical concerns with just wiping that out?
| [deleted]
| kettleballroll wrote:
| The ethics section is a tacked on thing which is required by
| some large ML conferences. They're essentially a PR stunt. No
| ML researcher i know cares about it, or devotes more than the 5
| minutes it takes to write some platitudes to the task. There
| are simply no incentives to write this properly. And quite
| frankly, i don't think there should be. We are educated, paid
| and motivated to push the boundaries of research, not to think
| about all potential fallout (which, let's face it, would
| usually require a whole additional paper for most meaningful
| contributions). I don't really see how we could change this.
|
| Tldr: as a general rule you can ignore the ethics section of ML
| papers.
| enchiridion wrote:
| Yep, this is correct.
| 6gvONxR4sf7o wrote:
| > We are educated, paid and motivated to push the boundaries
| of research, not to think about all potential fallout
|
| That's the whole problem that led to the introduction of
| these sections.
| belval wrote:
| That's debatable, would an "ethics" section on the original
| deepfake paper have changed anything?
|
| ML research isn't as inaccessible as genetics research, if
| there's something idiotic that people can do with DL, they
| will eventually do it. Acting as if having people add a
| paragraph to their paper where they "reflect" on the
| consequences will change anything is only showing how
| disconnected you are with reality.
|
| Research is research, there shouldn't be any "forbidden
| knowledge", we have laws for a reason.
| YeGoblynQueenne wrote:
| >> Tldr: as a general rule you can ignore the ethics section
| of ML papers.
|
| More generally still, you can ignore the ethics of ML
| researchers- pretty much for the same reasons that you can
| ignore the Great Turnip of Justice in the sky.
| changoplatanero wrote:
| My feeling is that those topics would be best addressed in a
| separate paper by authors who have more of a background in
| ethics.
| visarga wrote:
| > No concern that this enables 'Tay' like failure modes where a
| system can be manipulated through input into generating
| particular output?
|
| Isn't that the core idea in prompting and few shot learning for
| large language models?
| ipsum2 wrote:
| That'd be a waste of space. Most transformer models have the
| same ethical concerns, which have been addressed in countless
| other papers. Why bother copy pasting the same essays in every
| minor tweak of transformers?
| refulgentis wrote:
| I'm not sure it's scientific or helpful to include the risk
| that a program develops "beliefs" or "opinions", and
| terminating the program is "wiping [someone] out"
| mountainriver wrote:
| Love it! Its seems like a lot of the ideas from reinforcement
| learning are making their way into transformer land and NLP
| blackbear_ wrote:
| > On benchmarks including code and mathematics, we find that the
| model is capable of making use of newly defined functions and
| theorems during test time.
|
| Train on test, improved performance on test. Wow.
| spullara wrote:
| It isn't being trained on test. Kind of the point of memory is
| that you can change the memory at will and don't need to train
| on new information you have never seen before.
| visarga wrote:
| > Wow.
|
| Transformers are very limited in the size of the attention
| window. They can take a few thousand tokens at maximum. But
| your data might not fit into the window, and you also don't
| want to have to fine-tune the model. This paper offers a
| solution.
| axg11 wrote:
| See also RETRO, a type of retrieval transformer: [0], [1], [2]
|
| [0] - https://www.deepmind.com/publications/improving-language-
| mod...
|
| [1] - https://jalammar.github.io/illustrated-retrieval-
| transformer...
|
| [2] - https://arsham.substack.com/p/retrieval-transformers-for-
| med...
| 6gvONxR4sf7o wrote:
| External memory with pretrained models (or more generally,
| external not-necessarily-differentiable memory) is one of the
| most exciting areas of ML right now. It opens up models to
| external things like facts and databases.
| silencedogood3 wrote:
| Can you explain what the big deal is? I'm still in the early
| learning stages.
| 6gvONxR4sf7o wrote:
| As an example, if you want to encode all of the data in
| wikipedia with embeddings and train a model to answer
| questions with that information, historically, that would
| mean a model that encodes all of wikipedia, encodes the
| question, uses all of encoded wikipedia to decode an answer,
| then does backprop through all of that and updates the
| weights. Then it re-encodes all of wikipedia with the new
| weights and goes all over again, again and again at each
| training step, also somehow holding all of that in GPU
| memory. Meaning you basically couldn't do it that way.
|
| Today, we're seeing big models that can encode all of
| wikipedia in useful ways. If the encodings are "good enough"
| then you can encode all of wikipedia once, before training
| another model that just has to encode a question, then use
| encoded wikipedia to decode an answer, then do backprop
| through just the answer and question. If wikipedia changes in
| the meantime, you can probably just update your database of
| encoded stuff and your learned QA model will be able to
| incorporate that new information.
| amelius wrote:
| Replace Wikipedia by the internet, and you can replace
| Google Search by some (hopefully) soon to be discovered
| algorithm based on these principles. Exciting times.
___________________________________________________________________
(page generated 2022-05-20 23:00 UTC)