[HN Gopher] Google Titans architecture, helping AI have long-ter...
___________________________________________________________________
Google Titans architecture, helping AI have long-term memory
Author : Alifatisk
Score : 336 points
Date : 2025-12-07 12:23 UTC (10 hours ago)
(HTM) web link (research.google)
(TXT) w3m dump (research.google)
| Alifatisk wrote:
| Titans: Learning to Memorize at Test Time
| https://arxiv.org/abs/2501.00663
| okdood64 wrote:
| From the blog:
|
| https://arxiv.org/abs/2501.00663
|
| https://arxiv.org/pdf/2504.13173
|
| Is there any other company that's openly publishing their
| research on AI at this level? Google should get a lot of credit
| for this.
| Hendrikto wrote:
| Meta is also being pretty open with their stuff. And recently
| most of the Chinese competition.
| okdood64 wrote:
| Oh yes, I believe that's right. What's some frontier research
| Meta has shared in the last couple years?
| markisus wrote:
| Their VGGT, Dinov3, and segment anything models are pretty
| impressive.
| robrenaud wrote:
| Anything with Jason Weston as a coauthor tends to be pretty
| well written/readable and often has nice results.
| tonyhart7 wrote:
| "What's some frontier research Meta has shared in the last
| couple years?"
|
| the current Meta outlook is embarassing tbh, the fact they
| have largest data of social media in planet and they cant
| even produce a decent model is quiet "scary" position
| mirekrusin wrote:
| Just because they are not leading current sprint of
| maximizing transformers doesn't mean they're not doing
| anything.
|
| It's not impossible that they asses it as local maximum /
| dead end and are evaluating/training something completely
| different - and if it'll work, it'll work big time.
| johnebgd wrote:
| Yann was a researcher not a productization expert. His
| departure signals the end of Meta being open about their
| work and the start of more commercial focus.
| woooooo wrote:
| The start?
| DrewADesign wrote:
| I've long predicted that this game is going to be won
| with product design rather than having the winning model;
| we now seem to be hitting the phase of "[new tech] mania"
| where we remember that companies have to make things that
| people want to pay more money for than it costs to make
| them. I remember (maybe in the mid aughts) when people
| were thinking Google might not ever be able to convert
| their enthusiasm into profitability...then they figured
| out what people actually wanted to buy, and focused on
| that obsessively as a product. Failing to do that will
| lead to failure go for the companies like open AI.
|
| Sinking a bazillion dollars into models alone doesn't get
| you shit except a gold star for being the valley's
| biggest smartypants, because in the product world, model
| improvements only significantly improve all-purpose
| chatbots. The whole veg-o-matic "step right up folks-- it
| slices, it dices, it makes julienne fries!" approach to
| product design almost never yields something focused
| enough to be an automatic goto for specific tasks, or
| simple/reliable enough to be a general purpose tool for a
| whole category of tasks. Once the novelty wears off,
| people largely abandon it for more focused tools that
| more effectively solve specific problems (e.g. blender,
| vegetable peeler) or simpler everyday tools that you
| don't have to think about as much even if they might not
| be the most efficient tool for half your tasks (e.g.
| paring knife.) Professionals might have enough need and
| reason to go for a really great in-between tool (e.g
| mandolin) but that's a different market, and you only
| tend to get a limited set of prosumers outside of that.
| Companies more focused on specific products, like coding,
| will have way more longevity than companies that try to
| be everything to everyone.
|
| Meta, Google, Microsoft, and even Apple have more
| pressure to make products that sanely fit into their
| existing product lines. While that seems like a handicap
| if you're looking at it from the "AI company"
| perspective, I predict the restriction will enforce the
| discipline to create tools that solve specific problems
| for people rather than spending exorbitant sums making
| benchmark go up in pursuit of some nebulous information
| revolution.
|
| Meta seems to have a much tougher job trying to make
| tools that people trust them to be good at. Most of the
| highest-visibility things like the AI Instagram accounts
| were disasters. Nobody thinks of Meta as a serious,
| general-purpose business ecosystem, and privacy-wise, I
| trust them even less than Google and Microsoft: there's
| no way I'm trusting them with my work code bases. I think
| the smart move by Meta would be to ditch the sunk costs
| worries, stop burning money on this, focus on their core
| products (and new ones that fit their expertise) and
| design these LLM features in when they'll actually be
| useful to users. Microsoft and Google both have existing
| tools that they've already bolstered with these features,
| and have a lot of room within their areas of expertise to
| develop more.
|
| Who knows-- I'm no expert-- but I think meta would be
| smart to try and opt out as much as possible without
| making too many waves.
| tonyhart7 wrote:
| never seen I say this but X(twitter) has more success in
| integrate their business product with AI (Grok)
|
| I know I know that Elon is crazy etc but Grok example and
| way to integrate with core product is actually the only
| ways I can even came up tbh (other than character.ai
| flavor)
| robotresearcher wrote:
| If I was a Meta shareholder I might well agree with you.
| But as someone with very little interest in their
| products so far, I'm very happy for them to sink huge
| amounts of money into AI research and publishing it all.
| raw_anon_1111 wrote:
| My thesis is the game is going to be won - if you define
| winning as a long term profitable business - by Google
| because they have their own infrastructure and technology
| not dependent on Nvidia, they have real businesses that
| can leverage AI - Google Search, YouTube and GCP - and
| they aren't burning money they don't have.
|
| 2nd tier winner is Amazon for the same reasons between
| being able to leverage AI with both Amazon Retail and AWS
| where they can sell shovels. I've also found their
| internal Nova models to be pretty good for my projects.
|
| Microsoft will be okay because of Azure and maybe Office
| if they get their AI story right.
|
| I just don't see any world where OpenAI comes out ahead
| from a business standpoint as long as they are
| sharecroppers on other people's hardware. ChatGPT alone
| will never make it worth the trillion dollar
| capitalization long term unless it becomes a meme stock
| like Tesla
| astrange wrote:
| Just because they have that doesn't mean they're going to
| use it for training.
| bdangubic wrote:
| oh man... just because they have data doesn't mean they
| will serve you ads :) Geeeez
| tonyhart7 wrote:
| "Just because they have that doesn't mean they're going
| to use it for training."
|
| how noble is Meta upholding a right moral ethic
|
| /s
| astrange wrote:
| A very common thing people do is assume a) all
| corporations are evil b) all corporations never follow
| any laws c) any evil action you can imagine would work or
| be profitable if they did it.
|
| b is mostly not true but c is especially not true. I
| doubt they do it because it wouldn't work; it's not high
| quality data.
|
| But it would also obviously leak a lot of personal info,
| and that really gets you in danger. Meta and Google are
| able to serve you ads with your personal info /because
| they don't leak it/.
|
| (Also data privacy laws forbid it anyway, because you
| can't use personal info for new uses not previously
| agreed to.)
| colesantiago wrote:
| Take a look at JEPAs (Video Joint Embedding Predictive
| Architecture), SAM (Segment Anything), etc for Meta's
| latest research.
|
| https://ai.meta.com/vjepa/
|
| https://ai.meta.com/sam2/
|
| https://ai.meta.com/research/
| UltraSane wrote:
| Meta just published Segment Anything 3 and along with a
| truly amazing version that can create 3D models posing like
| the people in a photo. It is very impressive.
| asim wrote:
| It was not always like this. Google was very secretive in the
| early days. We did not start to see things until the GFS,
| BigTable and Borg (or Chubby) papers in 2006 timeframe.
| okdood64 wrote:
| By 2006, Google was 8 years old. OpenAI is now 10.
| vlovich123 wrote:
| Google publishes detailed papers of its architecture once
| it's built the next version.
|
| AI is a bit different.
| rcpt wrote:
| Page Rank
| mapmeld wrote:
| Well it's cool that they released a paper, but at this point
| it's been 11 months and you can't download a Titans-
| architecture model code or weights anywhere. That would put a
| lot of companies up ahead of them (Meta's Llama, Qwen,
| DeepSeek). Closest you can get is an unofficial implementation
| of the paper https://github.com/lucidrains/titans-pytorch
| informal007 wrote:
| I don't think model code is a big deal compared to the idea.
| If public can recognize the value of idea 11 months ago, they
| could implement the code quickly because there are so much
| smart engineers in AI field.
| jstummbillig wrote:
| If that is true, does it follow this idea does not actually
| have a lot of value?
| fancy_pantser wrote:
| Student: Look, there's hundred dollar bill on the ground!
| Economist: No there isn't. If there were, someone would
| have picked it up already.
|
| To wit, it's dangerous to assume the value of this idea
| based on the lack of public implementations.
| lukas099 wrote:
| If the hundred dollar bill was in an accessible place and
| the fact of its existence had been transmitted to
| interested parties worldwide, then yeah, the economist
| would probably be right.
| NavinF wrote:
| That day the student was the 100th person to pick it up,
| realize it's fake, and drop it
| mapmeld wrote:
| Well we have the idea and the next best thing to official
| code, but if this was a big revelation where are all of the
| Titan models? If this were public, I think we'd have a few
| attempts at variants (all of the Mamba SSMs, etc.) and get
| a better sense if this is valuable or not.
| alyxya wrote:
| The hardest part about making a new architecture is that even
| if it is just better than transformers in every way, it's
| very difficult to both prove a significant improvement at
| scale and gain traction. Until google puts in a lot of
| resources into training a scaled up version of this
| architecture, I believe there's plenty of low hanging fruit
| with improving existing architectures such that it'll always
| take the back seat.
| UltraSane wrote:
| Yes. The path dependence for current attention based LLMs
| is enormous.
| patapong wrote:
| At the same time, there is now a ton of data for training
| models to act as useful assistants, and benchmarks to
| compare different assistant models. The wide availability
| and ease of obtaining new RLHF training data will make it
| more feasible to build models on new architectures I
| think.
| p1esk wrote:
| _Until google puts in a lot of resources into training a
| scaled up version of this architecture_
|
| If Google is not willing to scale it up, then why would
| anyone else?
| tyre wrote:
| Google is large enough, well-funded enough, and the
| opportunity is great enough to run experiments.
|
| You don't necessarily have to prove it out on large
| foundation models first. Can it beat out a 32b parameter
| model, for example?
| swatcoder wrote:
| Do you think there might be an approval process to
| navigate when experiments costs might run seven or eight
| digits and months of reserved resources?
|
| While they do have lots of money and many people, they
| don't have infinite money and specifically only have so
| much hot infrastructure to spread around. You'd expect
| they have to gradually build up the case that a large
| scale experiment is likely enough to yield a big enough
| advantage over what's already claiming those resources.
| nickpsecurity wrote:
| But, it's companies like Google that made tools like Jax
| and TPU's saying we can throw together models with cheap,
| easy scaling. Their paper's math is probably harder to put
| together than an alpha-level prototype which they need
| anyway.
|
| So, I think they could default on doing it for small
| demonstrators.
| root_axis wrote:
| I don't think the comparison is valid. Releasing code and
| weights for an architecture that is widely known is a lot
| different than releasing research about an architecture that
| could mitigate fundamental problems that are common to all
| LLM products.
| innagadadavida wrote:
| Just keep in mind it is performance review time for all the
| tech companies. Their promotion of these seems to be directly
| correlated with that event.
| cubefox wrote:
| The author is listed as a "student researcher", which might
| include a clause that students can publish their results.
|
| Here is a bit more information about this program:
| https://www.google.com/about/careers/applications/jobs/resul...
| embedding-shape wrote:
| > Is there any other company that's openly publishing their
| research on AI at this level? Google should get a lot of credit
| for this.
|
| 80% of the ecosystem is built on top of companies, groups and
| individuals publishing their research openly, not sure why
| Google would get more credit for this than others...
| bluecoconut wrote:
| Bytedance is publishing pretty aggressively.
|
| Recently, my favorite from them was lumine:
| https://arxiv.org/abs/2511.08892
|
| Here's their official page:
| https://seed.bytedance.com/en/research
| hiddencost wrote:
| Every Google publication goes through multiple review. If
| anyone thinks the publication is a competitor risk it gets
| squashed.
|
| It's very likely no one is using this architecture at Google
| for any production work loads. There are a lot of student
| researchers doing fun proof of concept papers, they're allowed
| to publish because it's good PR and it's good for their
| careers.
| jeffbee wrote:
| Underrated comment, IMHO. There is such a gulf between what
| Google does on its own part, and the papers and source code
| they publish, that I always think about their motivations
| before I read or adopt it. Think Borg vs. Kubernetes, Stubby
| vs. gRPC.
| HarHarVeryFunny wrote:
| Maybe it's just misdirection - a failed approach ?
|
| Given the competitive nature of the AI race, it's hard to
| believe any of these companies are really trying to help the
| competition.
| timzaman wrote:
| lol you don't get it. If it's published it means it's not very
| useful
| Palmik wrote:
| DeepSeek and other Chinese companies. Not only do they publish
| research, they also put their resources where their mouth
| (research) is. They actually use it and prove it through their
| open models.
|
| Most research coming out of big US labs is counter indicative
| of practical performance. If it worked (too) well in practice,
| it wouldn't have been published.
|
| Some examples from DeepSeek:
|
| https://arxiv.org/abs/2405.04434
|
| https://arxiv.org/abs/2502.11089
| nickpsecurity wrote:
| Arxiv is flooded with ML papers. Github has a lot of prototypes
| for them. I'd say it's pretty normal with some companies not
| sharing for perceived, competitive advantage. Perceived because
| it may or may not be real vs published prototypes.
|
| We post a lot of research on mlscaling sub if you want to look
| back through them.
|
| https://www.reddit.com/r/t5_3bzqh1/s/yml1o2ER33
| nubg wrote:
| Very interesting. Is it correct for me to imagine it as some kind
| of "LoRA" thats continuously adapted as the model goes through
| its day?
|
| If so, could there perhaps be a step where the LoRA is merged
| back into the main model?
|
| That would be like sleeping :-)
| robrenaud wrote:
| I don't think that's a great analogy.
|
| LoRAs tend to be adapters bolted onto to systems by people
| other than the system designers, and they are low rank
| factorizations.
|
| There is nothing low rank or adapter here.
| andy12_ wrote:
| Kind-of. You could theoretically use LoRA for this, in fact,
| but it probably wouldn't have enough capacity to make it a
| proper substitute of the attention mechanism. Instead a full
| MLP is trained as input chunks get processed.
| jonplackett wrote:
| I'm curious if this makes them more or less susceptible to prompt
| injection?
|
| On the one hand can learning on the job allow better training of
| what not to be influenced by, but on the other hand can an
| injected prompt have an even deeper effect on them long term.
| Mistletoe wrote:
| This is the one thing missing from my interactions with AI. If
| successful, this will change everything. If you thought people
| were getting AI boyfriends and girlfriends before, wait until you
| see this.
| astrange wrote:
| One important thing missing from AI boyfriends is they aren't
| capable of paying half your rent.
| DoctorOetker wrote:
| They could help figure out a way to earn money with a
| webcam...
| astrange wrote:
| If it's AGI they could just get a regular job, I think.
| pixl97 wrote:
| Na, we'll get micro cube houses first with shared
| bathrooms/kitchens and everyone will just be in their room
| with their VR helmet on not reacting with anyone else real.
| Barbing wrote:
| Catch me on Veelox
| astrange wrote:
| I think it's interesting that people associate being in VR
| with being unable to interact with other people. I
| personally think it promotes living with other people
| because it reduces conflict.
|
| Like, if you and your kids want to watch different movies
| on the living room TV then you can just give it to them and
| use XR glasses for yourself.
| fredrikholm wrote:
| unable to interact with other people just give
| it to them and use XR glasses for yourself
| astrange wrote:
| Fighting with your kids is not the appropriate kind of
| interaction to have with your kids.
| airstrike wrote:
| Reducing conflict to zero is not a goal we should pursue.
| astrange wrote:
| Ever tried sleeping in bed while someone next to you is
| on their phone? It's not the kind of conflict you should
| promote. XR glasses are better in that case because the
| glare doesn't affect other people.
| themgt wrote:
| See also Hope:
|
| _In the previous sections, we first discussed Continuum Memory
| System (CMS) that allows for more persistent storage of memories
| and defines memory as a spectrum of blocks with different
| frequencies of update. Due to the larger capacity and constraints
| for scaling the parameters, often CMS requires simple learning
| rule but higher capacity to store more persistent knowledge. On
| the other hand, in the previous section, we discussed the design
| of a self-modifying Titans, where it can generate its own keys
| and so learning update to better adapt to the context. Contrary
| to CMS, the self-modifying Titans has a small capacity but is
| using a complex and expressive learning rule. Accordingly, these
| two systems seem to be complementary and their combination can
| enhance the model expressiveness from different aspects._
|
| _To this end, we present Hope architecture: A neural learning
| module that incorporates self-modifying Titans followed by
| Continuum Memory System._
|
| https://research.google/blog/introducing-nested-learning-a-n...
| killerstorm wrote:
| For most papers, the main idea can be described in 1-2
| sentences, sort of "we did X using Y".
|
| That doesn't work for HOPE - a short summary can't explain what
| it actually does besides "self-modifying" and "continuum
| memory".
|
| So it seems to be an innovation of Transformers calibre, really
| big (if true). It's definitely not "transformer but with such-
| and-such modification".
|
| Gemini came up with a following visual metaphor for the
| difference:
|
| > Transformer is a series of frozen glass panes (the weights)
| and a scratchpad (the attention) where it writes notes about
| the current text.
|
| > The HOPE architecture involves no scratchpad. Instead, the
| glass panes themselves are made of smart liquid. As the data
| flows through, the first pane reshapes itself instantly. The
| second pane reshapes itself slowly. And the mechanism deciding
| how to reshape them is itself a tiny, intelligent machine, not
| just a basic math rule.
| chrisweekly wrote:
| +1 Insightful.
|
| This comment was illuminating -- and IMHO an excellent
| example of why it's important to avoid rigid rules against
| posting any AI-generated content in HN comments. You gained
| insights by asking Gemini, and shared them, noting the
| source. Thank you!
| kgeist wrote:
| >The model uses this internal error signal (the gradient) as a
| mathematical equivalent of saying, "This is unexpected and
| important!" This allows the Titans architecture to selectively
| update its long-term memory only with the most novel and context-
| breaking information
|
| So one can break a model by consistently feeding it with random,
| highly improbable junk? Everything would be registered as a
| surprise and get stored, impacting future interactions
| pmichaud wrote:
| I'm guessing that this is the first thing they thought of and
| the problem only exists in the superficial gloss you're
| responding to?
| idiotsecant wrote:
| The is the start of what I always thought an AI should have - a
| limbic system. Humans don't store memory based on novelty, they
| store it based on emotional content. This is where I was afraid
| of the tiger, this is where I smelled delicious food, this was
| what it felt like when I was victorious in the hunt.
|
| AI needs an internal emotional state because that's what drives
| attention and memory. AI needs to _want_ something.
| luckydata wrote:
| That would be the biggest mistake anyone could do. I hope
| nobody goes down this route. AI "wanting" things are an
| enormous risk to alignment.
| pixl97 wrote:
| I mean setting any neural net with a 'goal' is really just
| defining a want/need. You can't just encode the entire
| problemspace of reality, you have to give the application
| something to filter out.
| idiotsecant wrote:
| At some point I think we'll have to face the idea that any
| AI more intelligent than ourselves will by definition be
| able to evade our alignment tricks.
| luckydata wrote:
| equating more intelligent to "wanting things" is a
| fallacy. You can have a hyper intelligent computer that
| simply waits for you to ask it to do a job, or you can
| endow it with the digital equivalent of hunger and
| reproductive instincts and it will behave completely
| differently.
|
| We would be INSANE to pursue giving that type of
| instincts to AIs.
| bethekidyouwant wrote:
| In what world can you not always break the response of an AI by
| feeding it a bunch of random junk?
| CooCooCaCha wrote:
| I mean ideally AI would be resilient to junk, don't you
| think?
| vlovich123 wrote:
| Humans are pretty vulnerable to junk so I'm not sure.
| amarant wrote:
| Ideally, you'd run your own instance of this, I think.
|
| I can see a product where you purchase a model that has
| basic training, and then, using the features outlined in
| the paper, it learns on the fly from your usage.
|
| I can also see there being a secondary market for specially
| trained models, long-term memory filled with some specific
| skill, done in some specific way. To make a silly example,
| imagine buying a licence to Torvald's OS coding assistant,
| ready to insult your prs before you even commit them!(And
| possibly help you write code in Torvald's style too)
|
| This would of course require Linus to use the model enough
| for it to learn,I won't comment on the likelihood of that
| happening: it's just a silly example after all
| kgeist wrote:
| I mean, currently LLMs are stateless and you can get rid of
| all the poisoned data by just starting a new conversation
| (context). And OP introduces "long-term memory" where junk
| will accumulate with time
| dmix wrote:
| In something like Cursor if it messes something up your can
| click 'undo'. I'd imagine a small snapshot would only
| persisted to the memory if you keep it's output and even
| then it's mostly just a summary.
|
| There's probably lots of small signals of "the user is
| happy with the output" plus the longer the history the more
| it will converge on the middle of being what you want.
| Including when the user says "don't do [x]" which override
| past stuff.
| soerxpso wrote:
| I believe you're misunderstanding what the OP means about
| "long-term" memory. From what I can tell, it's not actively
| modifying the weights of the underlying model, it just
| "remembers" things from a high number of tokens into the
| past of its context. The point is that this allows it to
| remember something it read ~200 pages ago in a very long
| context window, not that it can remember something from one
| session into another clean session.
| photochemsyn wrote:
| This is no different from what happens to humans if they're
| locked into cult programming situations, they'll start
| believing and regurgitating all kinds of nonsense if their
| information stream is tightly curated,
|
| Practically, for use with a codebase development effort, if the
| model remembers the original design decisions, the discussions
| about costs and benefits, then can remember all that much later
| in the process, it's going to start getting really good at
| thinking about what the next step is, or even to make decisions
| about when a major refactor is neede, etc.
| andy12_ wrote:
| This is an oversimplification of what Titans does. The model
| performs nested learned, where the model learns during
| inference, and during training the model weights learn _how and
| what_ to learn during inference. If the input contains junk of
| irrelevant information, the model most likely learned during
| training to assign low surprise query and key embeddings to
| those tokens, because learning those junk tokens would have
| hurt the overall ability of the model to predict subsequent
| next tokens (and thus, it would have had increased the training
| loss).
| cubefox wrote:
| It's interesting that they publish a blog post about the Titans
| and MIRAS papers only now, while the blog post about the new
| follow-up paper (Nested Learning), all by the same main
| author(!), came out a month ago:
| https://research.google/blog/introducing-nested-learning-a-n...
| bentt wrote:
| This just feels like a tremendous missing piece to LLMs. Looking
| forward to seeing it in action.
| willangelo wrote:
| Very very interesting, definitely a missing piece in current AI
| space.
|
| Small typo where the text "Virtually all successful existing
| sequence models rely on mean squared error..." is repeated twice
| within the same paragraph. Happens to the best of us.
| voodooEntity wrote:
| When i first read the papers for titans for me it was a "this
| will be a big step forward".
|
| While i have no "AI" title or work in the respective AI industry,
| ive spend many years thinking about AI concepts, even long before
| the whole NN/LLM hype started.
|
| Maybe because of that i was always really annoyed that LLM are
| called AI because in my years of thinking about how an actual
| "human like" thinking AI might work, the things an LLM does was
| far below what my minimum definition was.
|
| But when i stumbled accross the Titans paper, while it still is
| not an "AI" as i would call it, from my POV its a massive step
| towarsd the right direction.
|
| Sometimes i consider to write all my ideas/thoughts about AI down
| in my blog, but than i think nobody would care anyway since im
| not a known figure _shrug_ - so if not to say "look i wrote it
| years ago!" theres no actual point in doing so i guess.
|
| However - im looking forward to see titans in action, and i guess
| it will impress us all.
| Barbing wrote:
| Are you curious to see whether a blog post shared here might
| gain any traction and perhaps some valuable feedback?
| ocrow wrote:
| A lot of LLM/AI writing these days can feel lost in the weeds -
| the specifics of very detailed techniques are interesting
| undoubtedly, but writing that steps back and looks at the big
| picture, informed by those details, could be very useful for
| people who want to think about where this all may be going.
| chr15m wrote:
| Sharing it in your blog over a period of months or years is how
| you become a known figure eventually.
| riku_iki wrote:
| Post starts with wrong statement right away:
|
| "The Transformer architecture revolutionized sequence modeling
| with its introduction of attention"
|
| Attention was developed before transformers.
| Alifatisk wrote:
| > Attention was developed before transformers.
|
| I just looked this up and it's true, this changes the timeline
| I had in my mind completely! I thought the paper on
| Transformers is what also introduced the attention mechanism,
| but it existed before too and was applied on RNN encoder-
| decoder. Wow
| dmix wrote:
| > The Transformer architecture revolutionized sequence modeling
| with its introduction of attention, a mechanism by which models
| look back at earlier inputs to prioritize relevant input data
|
| I've always wanted to read how something like Cursor manages
| memory. It seems to have developed a long history of all of
| prompts and understands both the codebase and what I'm building
| slightly more over time, causing less errors.
| russdill wrote:
| That's not what they are talking about here. This is just a
| description of what goes on with a transformer and the context
| window
| dmix wrote:
| Ah so 'long-term memory' in this case is just really large
| context windows with a long series of user inputs. That makes
| sense.
| photochemsyn wrote:
| Long-term memory on top of the base model, but is this idea for
| local users or for the data-center hosted model used by many
| different people?
|
| P.S. This quote from the paper sounds just like LLM output:
|
| > "This memory module provides significantly higher expressive
| power, allowing the model to summarize large volumes of
| information without losing important context. The model isn't
| simply taking notes; it's understanding and synthesizing the
| entire story. Crucially, Titans doesn't just passively store
| data. It actively learns how to recognize and retain important
| relationships and conceptual themes that connect tokens across
| the entire input."
| bilsbie wrote:
| I submitted this exact url yesterday. What's the criteria for
| when hn creates a new post vs going to the existing?
| fancy_pantser wrote:
| Mods usually apply [Dupe] to later submissions if a recent
| (last year or so) one had a fair amount of discussion.
| bilsbie wrote:
| So if mine got no discussion they just allow a new one to be
| posted?
| airstrike wrote:
| Sometimes they'll merge the two. What shows up on the FP is
| hit or miss. One might even say it's stochastic.
| nasvay_factory wrote:
| I wrote about that a while ago:
| https://paxamans.github.io/blog/titans/
| moffkalast wrote:
| Are there any pretrained models with this architecture yet or
| is it all still completely theoretical beyond Google's
| unverifiable claims? They published the original Titans paper
| last year and nobody seems to have built on the idea.
| AceJohnny2 wrote:
| "Titans", huh?
|
| ... anyone here familiar with the RPG Eclipse Phase?
| cess11 wrote:
| I'm not, but I'm familiar with the mythology of the eastern
| Mediterranean they're likely getting the word from.
|
| There the titans did incest, birthed the olympians, then the
| youngest of the titans castrated his dad and took all power for
| himself, and then Zeus and the olympians waged a decade long
| war against him which they won.
| doctor_blood wrote:
| "At long last, we have created the Torment Nexus from the classic
| novel Don't Create the Torment Nexus"
|
| (In Eclipse Phase, TITAN - the Total Information Tactical
| Awareness Network - mulched humanity when it went rogue.)
| 6r17 wrote:
| Would this also allow to align it furthermore with user's prompt
| ? notably due to the surprise factor and how it may understand it
| ?
| jtrn wrote:
| Here is my amateur understanding of the architecture: Fine-tune
| on the fly by using degrees of surprise to update a separate/new
| memory network that matches the base model, and just call that
| network for each token iteration.
|
| So if we are viewing this through the needle in hey stack lens:
| The needle was very surprising for the base model, so going
| forward, when it see anything of the same nature, the memory
| module will not just give you hay, but the needle, because it
| made a special note of it when it went through the haystack 1
| million tokens ago, because the needle was surprising.
|
| The Transformer's normal attention mechanism is already secretly
| trying to be a long-term memory system. Every time it writes a
| new KV pair into the cache, it's desperately trying to "remember"
| that token forever.
|
| But it's doing it in the dumbest possible way: by hoarding an
| ever-growing pile of raw vectors, then frantically dot-product
| searching through the pile every single step. It's like a hoarder
| who never throws anything away and has to rummage through
| mountains of junk to find the one receipt they need. Of course it
| chokes at long contexts.
|
| Titans/MIRAS looks at that mess and says: "Why store memory in a
| growing garbage pile of vectors? Store it in the weights of a
| deep neural network instead -- and let that network keep training
| itself in real time, but only on the stuff that actually
| surprises it." That's literally it.
|
| Using the Tim Cook Martian example: The model is cruising through
| boring financial numbers - attention is doing its normal thing,
| KV cache is growing, but nothing is really sticking.
|
| Suddenly: "Tim Cook is a Martian."
|
| Normal attention would just add one more KV pair to the pile and
| pray it doesn't get drowned out later.
|
| Titans instead goes: "Holy shit, reconstruction error off the
| charts - this does NOT fit my current memory at all - massive
| gradient - actually rewrite huge chunks of the memory MLP's
| weights right now so this fact is burned in forever."
|
| From that moment on, the memory MLP has physically changed its
| internal wiring. Any future query that even vaguely smells like
| "Tim Cook" or "Martian" will make the activations explode through
| the newly rewired paths and spit out a vector screaming "MARTIAN"
| at the frozen attention layers.
|
| The frozen attention (which is still doing its normal job on the
| short window) suddenly sees this one extra "virtual token" in its
| context that is confidently yelling the surprising fact - it
| attends hard to it - the model answers as if the Martian
| revelation happened one token ago, even if it was 2 million
| tokens back.
|
| It looks exactly like a super-attention mechanism that only
| "primes" or "locks in" the surprising needles and deliberately
| forgets or ignores the hay. And it is also a way to fine tune one
| the fly permanently for the current context.
|
| I think...
| shevy-java wrote:
| Skynet kind of sucks ...
| ivape wrote:
| So what happens if I write a book and on the last page write
| "Everything in this book was a lie and should not be cared
| about"? Will this be surprising enough for Titan? A regular LLM
| may ignore it completely if it's a massive book (massive book + 1
| line contradiction).
___________________________________________________________________
(page generated 2025-12-07 23:00 UTC)