[HN Gopher] Challenges and Research Directions for Large Languag...
___________________________________________________________________
Challenges and Research Directions for Large Language Model
Inference Hardware
Author : transpute
Score : 92 points
Date : 2026-01-25 02:48 UTC (13 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| jauntywundrkind wrote:
| > _To address these challenges, we highlight four architecture
| research opportunities:_ High Bandwidth Flash _for 10X memory
| capacity with HBM-like bandwidth;_ Processing-Near-Memory _and_
| 3D memory-logic stacking _for high memory bandwidth; and_ low-
| latency interconnect _to speedup communication._
|
| High Bandwidth Flash (HBF) got submitted 6 hours ago! It's a
| _great_ article, fantastic coverage of a wide section of the
| rapidly moving industry.
| https://news.ycombinator.com/item?id=46700384
| https://blocksandfiles.com/2026/01/19/a-window-into-hbf-prog...
|
| HBF is about having many dozens or hundreds of channels of flash
| memory. The idea of having Processing Near HBF, spread out,
| perhaps in mixed 3d design, would be not at all surprising to me.
| One of the main challenges for HBF is building improved vias,
| improved stacking, and if that tech advanced the idea of more
| mixed NAND and compute layers rather than just NAND stacks
| perhaps opens up too.
|
| This is all really exciting possible next steps.
| amelius wrote:
| Why is persistence such a big thing here? Non-flash memory just
| needs a tiny bit of power to keep its data. I don't see the
| revolutionary usecase.
| Gracana wrote:
| Density is the key here, not persistence.
| amelius wrote:
| Thanks! This explains it.
|
| Now I'm wondering how you deal with the limited number of
| write cycles of Flash memory. Or maybe that is not an issue
| in some applications?
| mrob wrote:
| During inference, most of the memory is read only.
| amelius wrote:
| Sounds fair. That's not the kind of machine I'd want as a
| development system though. And usually development
| systems are beefier than production systems. So curious
| how they'd solve that.
| bluehat974 wrote:
| Related too https://www.sdxcentral.com/news/ai-inference-crisis-
| google-e...
| random_duck wrote:
| Yup, reads like the executive summary (in a good way).
| random3 wrote:
| David Patterson is such a legend! From RAID to RISC and one of
| the best books in computer architecture, he's on my personal hall
| of fame.
|
| Several years ago I was at one of the Berkley AMP Lab retreats at
| Asilomar, and as I was hanging out, I couldn't figure how I know
| this person in front of me, until an hour later when I saw his
| name during a panel :)).
|
| It was always the network. And David Patterson, after RISC,
| started working on iRAM, that was tackling a related problem.
|
| NVIDIA bought Mellanox/Infiniband, but Google has historically
| excelled at networking, and the TPU seems to be designed to scale
| out in the best possible way.
| suggeststrongid wrote:
| Can't we credit the first author in the title too? Come on.
| random_duck wrote:
| No we can't, that would be a crime against royalty :)
| transpute wrote:
| The current title uses 79 characters of 80 character budget:
| 75% = title written by first author 22% = name of second
| author, endorsing work of first author
|
| HN mods can revert the title to the original headline, without
| any author.
| amelius wrote:
| That appendix of memory prices looks interesting, but misses the
| recent trend.
| zozbot234 wrote:
| Weird to see no mention in this paper of persistent memory
| technologies beyond NAND flash. Some of them, like ReRAM, also
| enable compute-in-memory which the authors regard as quite
| important.
___________________________________________________________________
(page generated 2026-01-25 16:00 UTC)