[HN Gopher] Memorizing Transformers
       ___________________________________________________________________
        
       Memorizing Transformers
        
       Author : silencedogood3
       Score  : 127 points
       Date   : 2022-05-20 14:58 UTC (8 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | tipsytoad wrote:
       | Could there be any merit training this on a common-sense dataset
       | such as Cyc?
       | 
       | https://www.lesswrong.com/tag/cyc
        
         | ipsum2 wrote:
         | Probably not, most common facts (sandcat is a type of feline)
         | are already known by transformers. Maybe some obscure ones.
        
       | jerpint wrote:
       | The basic idea is to have a q,k,v cache of all the previously
       | seen tokens that gets updated over time. The transformer can
       | decide to do self-attention (and ignore the cache) or focus on
       | elements from the cache (enabling it to attend to previously seen
       | tokens). They mainly apply this to large documents, i'd be very
       | curious to see a followup on time-dependent tasks like videos
        
       | lucidrains wrote:
       | have an implementation of this over at
       | https://github.com/lucidrains/memorizing-transformers-pytorc...,
       | for any researcher exploring retrieval and memory with attention
       | networks
        
         | silencedogood3 wrote:
         | Neat! Can you explain what the KNn is doing? I can't quite
         | follow the paper.
        
           | visarga wrote:
           | It's a sparse attention scheme. They store and reuse
           | activations thus "memorising" the past without the need for
           | training. In order to keep the sequence short enough to fit
           | into memory they only recall the k most similar memories from
           | a much larger log.
        
         | knrz wrote:
         | Dude your repo's are great, marvellous code quality too for
         | cutting edge papers. Keep it up!
        
           | lucidrains wrote:
           | hey thanks! :^) hope someone makes the next big discovery
           | with them
        
       | shallichange wrote:
       | Top of my head: Rodimus, Bumblebee, Ratchet, Optimus Prime,
       | Laserbeak, Megatron, Astro Train, Jazz
        
       | jameshart wrote:
       | The 'ethics' section seems surprisingly cursory and lacking in
       | references.
       | 
       | "The ability to memorize large databases of facts could have
       | potential ramifications for society, especially if those
       | databases include sensitive personal information or copyrighted
       | works. However, one advantage of using an external memory is that
       | the memory can be easily cleared of all such information"
       | 
       | That's it? Just 'may have ramifications'?
       | 
       | No concern that this enables 'Tay'-like failure modes where a
       | system can be manipulated through input into generating
       | particular output?
       | 
       | Or even just grappling with whether adding 'memory of
       | experiences' to a language model might open the door to creating
       | a system that has beliefs, or opinions...? and that maybe there
       | might be some ethical concerns with just wiping that out?
        
         | [deleted]
        
         | kettleballroll wrote:
         | The ethics section is a tacked on thing which is required by
         | some large ML conferences. They're essentially a PR stunt. No
         | ML researcher i know cares about it, or devotes more than the 5
         | minutes it takes to write some platitudes to the task. There
         | are simply no incentives to write this properly. And quite
         | frankly, i don't think there should be. We are educated, paid
         | and motivated to push the boundaries of research, not to think
         | about all potential fallout (which, let's face it, would
         | usually require a whole additional paper for most meaningful
         | contributions). I don't really see how we could change this.
         | 
         | Tldr: as a general rule you can ignore the ethics section of ML
         | papers.
        
           | enchiridion wrote:
           | Yep, this is correct.
        
           | 6gvONxR4sf7o wrote:
           | > We are educated, paid and motivated to push the boundaries
           | of research, not to think about all potential fallout
           | 
           | That's the whole problem that led to the introduction of
           | these sections.
        
             | belval wrote:
             | That's debatable, would an "ethics" section on the original
             | deepfake paper have changed anything?
             | 
             | ML research isn't as inaccessible as genetics research, if
             | there's something idiotic that people can do with DL, they
             | will eventually do it. Acting as if having people add a
             | paragraph to their paper where they "reflect" on the
             | consequences will change anything is only showing how
             | disconnected you are with reality.
             | 
             | Research is research, there shouldn't be any "forbidden
             | knowledge", we have laws for a reason.
        
           | YeGoblynQueenne wrote:
           | >> Tldr: as a general rule you can ignore the ethics section
           | of ML papers.
           | 
           | More generally still, you can ignore the ethics of ML
           | researchers- pretty much for the same reasons that you can
           | ignore the Great Turnip of Justice in the sky.
        
         | changoplatanero wrote:
         | My feeling is that those topics would be best addressed in a
         | separate paper by authors who have more of a background in
         | ethics.
        
         | visarga wrote:
         | > No concern that this enables 'Tay' like failure modes where a
         | system can be manipulated through input into generating
         | particular output?
         | 
         | Isn't that the core idea in prompting and few shot learning for
         | large language models?
        
         | ipsum2 wrote:
         | That'd be a waste of space. Most transformer models have the
         | same ethical concerns, which have been addressed in countless
         | other papers. Why bother copy pasting the same essays in every
         | minor tweak of transformers?
        
         | refulgentis wrote:
         | I'm not sure it's scientific or helpful to include the risk
         | that a program develops "beliefs" or "opinions", and
         | terminating the program is "wiping [someone] out"
        
       | mountainriver wrote:
       | Love it! Its seems like a lot of the ideas from reinforcement
       | learning are making their way into transformer land and NLP
        
       | blackbear_ wrote:
       | > On benchmarks including code and mathematics, we find that the
       | model is capable of making use of newly defined functions and
       | theorems during test time.
       | 
       | Train on test, improved performance on test. Wow.
        
         | spullara wrote:
         | It isn't being trained on test. Kind of the point of memory is
         | that you can change the memory at will and don't need to train
         | on new information you have never seen before.
        
         | visarga wrote:
         | > Wow.
         | 
         | Transformers are very limited in the size of the attention
         | window. They can take a few thousand tokens at maximum. But
         | your data might not fit into the window, and you also don't
         | want to have to fine-tune the model. This paper offers a
         | solution.
        
       | axg11 wrote:
       | See also RETRO, a type of retrieval transformer: [0], [1], [2]
       | 
       | [0] - https://www.deepmind.com/publications/improving-language-
       | mod...
       | 
       | [1] - https://jalammar.github.io/illustrated-retrieval-
       | transformer...
       | 
       | [2] - https://arsham.substack.com/p/retrieval-transformers-for-
       | med...
        
       | 6gvONxR4sf7o wrote:
       | External memory with pretrained models (or more generally,
       | external not-necessarily-differentiable memory) is one of the
       | most exciting areas of ML right now. It opens up models to
       | external things like facts and databases.
        
         | silencedogood3 wrote:
         | Can you explain what the big deal is? I'm still in the early
         | learning stages.
        
           | 6gvONxR4sf7o wrote:
           | As an example, if you want to encode all of the data in
           | wikipedia with embeddings and train a model to answer
           | questions with that information, historically, that would
           | mean a model that encodes all of wikipedia, encodes the
           | question, uses all of encoded wikipedia to decode an answer,
           | then does backprop through all of that and updates the
           | weights. Then it re-encodes all of wikipedia with the new
           | weights and goes all over again, again and again at each
           | training step, also somehow holding all of that in GPU
           | memory. Meaning you basically couldn't do it that way.
           | 
           | Today, we're seeing big models that can encode all of
           | wikipedia in useful ways. If the encodings are "good enough"
           | then you can encode all of wikipedia once, before training
           | another model that just has to encode a question, then use
           | encoded wikipedia to decode an answer, then do backprop
           | through just the answer and question. If wikipedia changes in
           | the meantime, you can probably just update your database of
           | encoded stuff and your learned QA model will be able to
           | incorporate that new information.
        
             | amelius wrote:
             | Replace Wikipedia by the internet, and you can replace
             | Google Search by some (hopefully) soon to be discovered
             | algorithm based on these principles. Exciting times.
        
       ___________________________________________________________________
       (page generated 2022-05-20 23:00 UTC)