[HN Gopher] Let's Think Dot by Dot: Hidden Computation in Transf...
___________________________________________________________________
Let's Think Dot by Dot: Hidden Computation in Transformer Language
Models
Author : Jimmc414
Score : 67 points
Date : 2024-04-27 19:28 UTC (3 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| diziet wrote:
| This is a surprising result to me, given that (in my mind) the
| method simply does a few more forward passes, without encoding or
| transferring meaningful state between each pass.
| sdenton4 wrote:
| You get embeddings at every activation layer of the network, at
| every token. That's extra state accessible to the network when
| running in recurrent 'generate the next token' mode.
| ehsanu1 wrote:
| How much extra state and computation is it per token exactly?
| Can we account for the improvement in just those terms?
| ehsanu1 wrote:
| I've only read the abstract, but also find this strange. I
| wonder if this is just tapping into the computational chains
| that are already available when tokens are further away, due to
| the positional encodings being trained that way. If so, that
| makes the reasoning/modeling powers of LLMs even more
| impressive and inscrutable.
| dist-epoch wrote:
| You can transfer some state just through dots. The dot count
| could mean "the first n ideas do not work, analyze the n+1 one,
| if that's bad, emit another dot"
| pyinstallwoes wrote:
| Can't anything be compressed into one word by comparison?
| rgbrgb wrote:
| i found a nice thread-level walkthrough of this paper by the
| first coauthor here:
| https://twitter.com/jacob_pfau/status/1783951795238441449
___________________________________________________________________
(page generated 2024-04-27 23:00 UTC)