[HN Gopher] LoMA: Lossless Compressed Memory Attention
       ___________________________________________________________________
        
       LoMA: Lossless Compressed Memory Attention
        
       Author : PaulHoule
       Score  : 62 points
       Date   : 2024-01-27 17:37 UTC (5 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | buildbot wrote:
       | Very interesting, but on my initial skim through, I don't
       | understand how this technique is lossless? Reproduced from the
       | methods section:
       | 
       | 1. Select a sequence of tc tokens that the model has already
       | generated or completed predicting as the reading area. 2. Insert
       | t '<m>' tokens at once after the reading area to serve as the
       | memory area. 3. The model performs a single inference on the
       | memory area, but discards the model's output, retaining only the
       | KV pairs from each layer. 4. Discard the reading area, and the
       | model continues generating text from after the memory area.
       | 
       | Isn't the memory area a lossy compression of the reading area?
        
         | godelski wrote:
         | The paper is very confusing and should have a title change. In
         | their results (4.2.1) they say
         | 
         | > The observation that L_Repeat converges rapidly to a value
         | close to zero under smaller compression ratios is significant.
         | It demonstrates that the method is highly effective in
         | compressing information losslessly into memory tokens
         | 
         | So I'm also unconvinced
        
       | godelski wrote:
       | Cool, but poorly written. Why on page 4 spend half of it on
       | standard attention (which should be assumed knowledge to a reader
       | at this point) but then not explain equation 10? what is L? Why
       | is there an identity matrix? I don't need equations 6-9 but I
       | sure do need more information on 10-14. I hope there's code
       | 
       | What a weird line in Figure 2
       | 
       | > Note: L_LM represents L_LM, and L_Repeat represents L_Repeat,
       | and Loss represents L.
       | 
       | Tautology is not helpful here.
       | 
       | And what's everyone's aversion to log plots? Figure 3 is
       | unreadable but would be perfect with log.
       | 
       | And where's the Appendix? It's referenced from the paper.... I'm
       | also unconvinced it's lossless
        
         | Solvency wrote:
         | They spend that much time reiterating well-trodden established
         | information because that's the easiest to redundantly speak to.
         | All of the novel stuff is suspiciously vague and detail is left
         | to the imagination. Because the author is fudging things.
        
       ___________________________________________________________________
       (page generated 2024-01-27 23:00 UTC)