Post AWkRAaeMlfqNp0UsVs by Meandres@hackers.town
(DIR) More posts by Meandres@hackers.town
(DIR) Post #AWkQOGkTSSYiCMJS5I by niconiconi@mk.absturztau.be
2023-06-16T11:42:25.829Z
0 likes, 0 repeats
If your problem is memory-bound, like iterative stencil computing or sparse matrix operations, it would run at less than 1% of the peak floating-point performance on the world's top supercomputer. In-Memory Computing when? #hpc
(DIR) Post #AWkRAaeMlfqNp0UsVs by Meandres@hackers.town
2023-06-16T11:50:13Z
1 likes, 0 repeats
@niconiconi Soon ! ;)CXL and specialized processing units may also allow processing between CPU and memory 🤔.
(DIR) Post #AWkRSWiO8ZnApUvnAO by ignaloidas@not.acu.lt
2023-06-16T11:54:24.255Z
1 likes, 0 repeats
@niconiconi@mk.absturztau.be In memory compute (single non-stacked die, SRAM only, so very low density)Honestly would be interesting to see some FTDT done on it, probably would work great.
(DIR) Post #AWkRqhCenI5cJAc7Jw by niconiconi@mk.absturztau.be
2023-06-16T11:58:46.435Z
1 likes, 0 repeats
@ignaloidas@not.acu.lt "Mom, can we have in-memory computing?""Yes, we have in-memory computer at home."*the in-memory computer at home*
(DIR) Post #AWkRzz6BczAAPu15Ci by ignaloidas@not.acu.lt
2023-06-16T12:00:26.780Z
2 likes, 0 repeats
@niconiconi@mk.absturztau.be FWIW it is very much a memory computer
(DIR) Post #AWkSV7IfBZuopUhIbQ by niconiconi@mk.absturztau.be
2023-06-16T12:06:03.639Z
0 likes, 0 repeats
@ignaloidas@not.acu.lt "would be interesting to see some FTDT done on it"Try selling a few chips to the US DoD, they are probably more than happy to use them for running their fighter jet Radar Cross Section FDTD simulations.
(DIR) Post #AWkSnO0PoTWlqzsDxY by ignaloidas@not.acu.lt
2023-06-16T12:09:22.156Z
1 likes, 0 repeats
@niconiconi@mk.absturztau.be I'm pretty sure DoD already has them, since at least NETL has already done some stuff with it https://www.cerebras.net/blog/real-time-computational-physics-with-wafer-scale-processing
(DIR) Post #AWkSnklJCo6sQnyz6e by p@raru.re
2023-06-16T12:09:26Z
1 likes, 0 repeats
https://en.wikipedia.org/wiki/Adaptive_mesh_refinement is what I'm using in a hpc project@niconiconi
(DIR) Post #AWkWcmQX4hb4E5Hu2S by gsuberland@chaos.social
2023-06-16T12:50:03Z
0 likes, 0 repeats
@ignaloidas @niconiconi I hate to think what the yields are on an ASIC that size.
(DIR) Post #AWkWcnAGKgB0VugQJk by ignaloidas@not.acu.lt
2023-06-16T12:52:16.897Z
0 likes, 0 repeats
@gsuberland@chaos.social @niconiconi@mk.absturztau.be 100%, it has redundancy built-in
(DIR) Post #AWkWsKQsvOfe2KrnQO by gsuberland@chaos.social
2023-06-16T12:54:00Z
0 likes, 0 repeats
@ignaloidas @niconiconi I did wonder. I guess once you get to that physical scale it's waaaay cheaper to build in a ton of redundancy than it is to have even one unit fail.
(DIR) Post #AWkWsNRhio6nNkGzcO by ignaloidas@not.acu.lt
2023-06-16T12:55:04.233Z
0 likes, 0 repeats
@gsuberland@chaos.social @niconiconi@mk.absturztau.be yeah, and it's not too difficult to do so given it's pretty uniform
(DIR) Post #AWkXCsiJB7b29IohSS by gsuberland@chaos.social
2023-06-16T12:49:35Z
0 likes, 0 repeats
@niconiconi I love how the style of the presentation looks very 1998. it took me a moment to realise that it wasn't actually super old.
(DIR) Post #AWkXCtOUeHLAG8YODA by niconiconi@mk.absturztau.be
2023-06-16T12:58:46.111Z
0 likes, 0 repeats
@gsuberland@chaos.social It's HPC veteran Jack Dongarra's (of LINPACK fame) Turing-award lecture. https://www.youtube.com/watch?v=cSO0Tc2w5Dg
(DIR) Post #AWkXuwXoIUVOlsutOK by gsuberland@chaos.social
2023-06-16T12:56:56Z
0 likes, 0 repeats
@ignaloidas @niconiconi I wonder how much extra they'd charge for a zero-defect binned one with all the redundant cores unlocked :D(I'm guessing about as much as a house)
(DIR) Post #AWkXuxDzleFWsiea92 by ignaloidas@not.acu.lt
2023-06-16T13:06:45.830Z
0 likes, 0 repeats
@gsuberland@chaos.social @niconiconi@mk.absturztau.be Too much, a random die defect calculator gives a yield for the whole chip at 0.2%, and this is with me assuming that TSMC has been able to very significantly improve their defect density over the last 4 years
(DIR) Post #AWkY5WDyUiQGKKgJY8 by gsuberland@chaos.social
2023-06-16T13:07:24Z
0 likes, 0 repeats
@ignaloidas @niconiconi 0.2%? ooouuuch.
(DIR) Post #AWkY5XuuCaMDZlm76W by ignaloidas@not.acu.lt
2023-06-16T13:08:40.173Z
0 likes, 0 repeats
@gsuberland@chaos.social @niconiconi@mk.absturztau.be it's the whole wafer, what do you expect 😄 corner to corner it's ~300mm, ~215mm sides length
(DIR) Post #AWkdCsdq2uFJeO6nZo by acsawdey@fosstodon.org
2023-06-16T13:49:20Z
0 likes, 0 repeats
@ignaloidas @gsuberland @niconiconi Much better to use smaller chips and spend the effort on packaging magic. Then again, as expensive as that kind of development work is, are you better off spending 100x more on a conventional computer that runs at 1% of peak rate? If algorithmic research can get your conventional computer to 10% of peak, developing a bespoke computer starts to look less promising.
(DIR) Post #AWkdCtPLCIFA1iKjcO by ignaloidas@not.acu.lt
2023-06-16T14:06:01.742Z
0 likes, 0 repeats
@acsawdey@fosstodon.org @gsuberland@chaos.social @niconiconi@mk.absturztau.be packaging magic can't really bring you 200Pb/s of interconnect bandwidth, there a lot that you gain out of not needing to jump out to the interposer and into another die.The main argument for smaller chips and packaging magic is usually either motivated by costs eaten by lower yields of big chips, or by the fact that the size of the die is limited and you can't increase it beyond the reticle limit without stepping on the patents that Cerebras has. The first problem Cerebras has solved by just having redundancy, the second is an advantage for them. And it's not like they're very concerned about prices either, a single one of those in a system goes for a couple million, and is designed for some very particular problems that general purpose computers just cannot get more performance for.It's the age of purpose-specific compute after all, problem-specific accelerators are the headline new features for many new computing products. One of the coolest talks last year from Hot Chips for me was from a group that designed their own chip purely for molecular dynamics simulation, cluster of which outperform essentially all conventional supercomputers on a massive scale (10x or something like that).
(DIR) Post #AWkdL4PYAef4TBKYpk by ignaloidas@not.acu.lt
2023-06-16T14:07:30.905Z
0 likes, 0 repeats
@acsawdey@fosstodon.org @gsuberland@chaos.social @niconiconi@mk.absturztau.be sorry, did I say 10x? I meant 100x
(DIR) Post #AWlwOQKtChYzclQIbY by acsawdey@fosstodon.org
2023-06-16T19:43:17Z
0 likes, 0 repeats
@ignaloidas @gsuberland @niconiconi That's pretty impressive ... Anton 3 is a 461 mm^2 chip in 7nm ... they are not fooling around.However ... the original problem was posed as being related to "processor in memory" i.e. "mem bandwidth isn't keeping up with flops" ... if that's truly what you're trying to do, then you don't have an interconnect problem. If you _actually_ want to run some kind of stencil code ... well then yes, if you can fit it on a wafer you win because: more wires.
(DIR) Post #AWlwORCPzgNiImT32W by ignaloidas@not.acu.lt
2023-06-17T05:15:42.436Z
0 likes, 0 repeats
@acsawdey@fosstodon.org @gsuberland@chaos.social @niconiconi@mk.absturztau.be well the problem @niconiconi@mk.absturztau.be is currently working on is indeed stencil code, so interconnect bandwidth is very important once you're trying to parallelize across multiple machines, though that's usually not that useful, since it's usually limited by memory bandwidth across a single machine. You really only look at parallelizing across machines once you can no longer fit the problem in memory of a single machine - which is quite rare for most uses. With Cerebras though, that gets switched around quite dramatically - as each compute element(CE) only has access to 48KB of memory, you essentially must parallelize across multiple machines - there's no implicit cross-CE way of communicating, everything is explicit sends and receives. But because the memory bandwidth and interconnect bandwidth are so large, they found that the problem actually becomes compute-limited, and the simulated size doesn't really matter as much as they can fit it within the wafer.https://www.cerebras.net/blog/totalenergies-and-cerebras-create-massively-scalable-stencil-algorithm/