[HN Gopher] GPUs Go Brrr
       ___________________________________________________________________
        
       GPUs Go Brrr
        
       Author : nmstoker
       Score  : 1013 points
       Date   : 2024-05-12 22:05 UTC (1 days ago)
        
 (HTM) web link (hazyresearch.stanford.edu)
 (TXT) w3m dump (hazyresearch.stanford.edu)
        
       | nmstoker wrote:
       | Some related material here too:
       | 
       | https://twitter.com/bfspector/status/1789749117104894179?t=k...
        
       | jauntywundrkind wrote:
       | The ThunderKittens mascot has great kitten/Sony-Aibo vibes.
       | Nicely generated, AI (I presume).
       | https://github.com/HazyResearch/ThunderKittens
        
         | layer8 wrote:
         | It looks off because the head isn't centered on the neck.
        
           | john_minsk wrote:
           | Great attention to detail! I, like the parent, was surprised
           | by the quality as well. However now I can't unsee it:-)
        
           | Satam wrote:
           | Easy fix: https://imgur.com/a/Ahwt6tr (although not sure
           | which one is actually better)
        
       | perfmode wrote:
       | This article rekindles the joy I experienced during CS 149
       | Parallel Programming.
        
         | figbert wrote:
         | Appreciate the recommendation, will check out the course!
        
       | behnamoh wrote:
       | tangential: When @sama talks about "Universal Basic Compute"
       | (UBC) as a substitute for Universal Basic Income, obviously he
       | means GPU, right? Who's going to benefit from such policies? Only
       | nvidia? It just seems such a dystopian future to live in: imagine
       | you can sell your UBC to others who know better how to use it, or
       | you can use it to mine bitcoin or whatever. But all the compute
       | is actually created by one company.
       | 
       | There are many reasons to hate nvidia, but honestly if this UBC
       | policy is even remotely being considered in some circles, I'd
       | join Linus Torvalds and say "nvidia, fuck you".
        
         | jra101 wrote:
         | You're blaming NVIDIA for Sam Altman's dumb idea?
        
           | behnamoh wrote:
           | nvidia's CEO literally keeps saying "the more you buy GPUs,
           | the more you save"--it's hard to believe nvidia has nothing
           | to do with such ideas.
        
             | WanderPanda wrote:
             | Him saying this always puts me off. Gives hard old sales-
             | guy vibes. I really wonder who/which demographic is
             | influenced in nvidias favor by this rethoric.
        
             | coffeebeqn wrote:
             | GPU CEO wants to sell more GPUs? What on earth
        
           | WanderPanda wrote:
           | One's "dumb idea" is another marketers "genius stroke". Seems
           | like he is playing the media puppets while he can
        
         | callalex wrote:
         | You're looking for logic. The only logic is "when a sucker buys
         | WorldCoin, sama bank account go brrrr".
         | 
         | That's the whole logic.
        
       | Animats wrote:
       | " _And we ask: if your matrix multiply is smaller than 16x16, are
       | you sure what you're doing is AI?_
       | 
       |  _From a philosophical point of view, we think a frame shift is
       | in order. A "register" certainly shouldn't be a 32-bit word like
       | on the CPUs of old. And a 1024-bit wide vector register, as CUDA
       | uses, is certainly a step in the right direction. But to us a
       | "register" is a 16x16 tile of data. We think AI wants this. "_
       | 
       | The hardware needs of AI are starting to focus. GPUs, after all,
       | were designed for an entirely different job. They're used for AI
       | because they have good matrix multiply hardware. "AI GPUs" get to
       | leave out some of the stuff in a real GPU (does an H100 even have
       | texture fill units?). Then there's a trend towards much shorter
       | numbers. 16 bit floating point? 8 bit? 2 bit? 1 bit? That will
       | settle out at some point. This paper indicates that hardware that
       | likes 16x16 tiles makes a lot of sense. It's certainly possible
       | to build such hardware. Someone reading this is probably writing
       | it in VHDL right now, or will be soon.
       | 
       | Then we'll see somewhat simpler, less general, and cheaper
       | devices that do "AI" operations with as little excess hardware
       | baggage as possible. Nice.
        
         | mvkel wrote:
         | Would you say this is ultimately "ASICs for AI"?
        
           | dartos wrote:
           | In the same way that CPUs are ASICs for integer operations,
           | that makes sense to me.
        
             | saagarjha wrote:
             | Most CPUs do just fine on floating point too.
        
               | actionfromafar wrote:
               | I'm still getting used to that.
        
               | dartos wrote:
               | Floating point arithmetic _is_ integer arithmetic on the
               | cpu level because of how floating point numbers work.
        
               | fwip wrote:
               | That's a good point - floating point operations are
               | implemented with integer-math circuits (or at least can
               | be - I'm not privy to how modern chip manufacturers
               | implement them). E.g: your ALU may have an 11-bit adder
               | specifically to add your f64 exponents.
               | 
               | Some slides to get the gist of it: https://users.encs.con
               | cordia.ca/~asim/COEN_6501/Lecture_Note...
        
         | dvt wrote:
         | > Then we'll see somewhat simpler, less general, and cheaper
         | devices that do "AI" operations with as little excess hardware
         | baggage as possible. Nice.
         | 
         | Apple has already been doing this for a few years now. The NPU
         | is totally different from the GPU or CPU on the die itself[1].
         | Nvidia is likely working on this as well, but I think a device
         | that's a gaming/entertainment/crypto/AI bundle (i.e. sticking
         | with the video card) is probably a better business move.
         | 
         | [1] https://github.com/hollance/neural-
         | engine/blob/master/docs/a...
        
           | talldayo wrote:
           | The NPUs on a lot of different systems occupy an awkward
           | spot. For extremely small models, they're the way to go for
           | low-power inference. But once you reach LLM or vision
           | transformer size, it makes a lot more sense to switch to GPU
           | shaders for that extra bit of large-model performance. For
           | stuff like Llama and Stable Diffusion, those Neural Engines
           | are practically wasted silicon. The biggest saving grace is
           | projects like ONNX attempting to sew them into a unified
           | non-15-competing-standards API, but even that won't change
           | how underpowered they are.
           | 
           | Nvidia escapes this by designing their GPU architecture to
           | incorporate NPU concepts at a fundamental level. It's less
           | redundant silicon and enables you to scale a single
           | architecture instead of flip-flopping to whichever one is
           | most convenient.
        
             | nxobject wrote:
             | It's currently doable for Apple - I think their strategy is
             | to slowly enhance iPhones, bit by bit, with special-purpose
             | models for dealing with media like photo subject
             | identification, OCR (in every language!), voice
             | transcription, etc. Apple's currently learning from
             | Microsoft's attempts to make AI stick everywhere.
        
               | joquarky wrote:
               | Soon our phones will dream beside us every night
               | (integrating new data into our personal model while on
               | the charger)
        
               | serialx wrote:
               | Well, iPhone already does that with photos. :)
        
               | pierrefermat1 wrote:
               | Do you have a link to where they breakdown what inference
               | for photos happens in realtime vs overnight/charging?
        
               | everforward wrote:
               | I think Apple is more interested in features that work
               | consistently than in giving power users the ability to
               | play with essentially alpha or beta AI features.
               | 
               | I would guess that their strategy is to not include
               | powerful client-side hardware, and supplement that with
               | some kind of "AiCloud" subscription to do the battery-
               | draining, heat-generating stuff on their cloud. They're
               | trading off their branding as a privacy focused company
               | under the (probably correct) belief that people will be
               | more willing to upload their data to iCloud's AI than
               | Microsoft's.
               | 
               | Fwiw, I think they're probably correct. It has always
               | struck me as odd that people want to run AI on their
               | phone. My impression of AI is that it creates very
               | generalized solutions to problems that would be difficult
               | to code, at the cost of being very compute inefficient.
               | 
               | I don't really want code like that running on my phone;
               | it's a poor platform for it. Thermal dissipation and form
               | factor limit the available processing power, and
               | batteries limit how long you can use the processing power
               | you have. I don't really want to waste either trying to
               | do subject identification locally. I'm going to upload
               | the photos to iCloud anyways; let me pay an extra
               | $1/month or whatever to have that identification happen
               | in the cloud, on a server built for it that has data
               | center thermal dissipation and is plugged into the wall.
        
               | talldayo wrote:
               | The pinch (as far as I can see it) is that you're right,
               | and Apple can't sell a freestanding service to save their
               | life. If we do get an AppleGPT pay-as-you-go service,
               | it's certain to be extraordinarily censored and locked-
               | down as the exclusive first-party option on iPhone. It
               | will feature "vertical integration" that no other AI can
               | have, alongside censorship so prudish that it would make
               | Maurey Povich gasp.
               | 
               | So... I think users will be stuck. They'll want to run
               | uncensored models on their phone, but Apple will want to
               | keep them in the walled garden at any cost. It feels like
               | the whole "Fortnite" situation all over again, where
               | _users_ can agree they want something but Apple can 't
               | decide.
        
               | unethical_ban wrote:
               | > It has always struck me as odd that people want to run
               | AI on their phone. My impression of AI is that it creates
               | very generalized solutions to problems that would be
               | difficult to code, at the cost of being very compute
               | inefficient.
               | 
               | I don't equate AI with coding. I want AI locally for
               | photo sorting and album management, for general questions
               | answering/list making that I use GPT for, and any number
               | of other things.
               | 
               | I try not to upload personal data to sites that aren't
               | E2E encrypted, so iCloud/Google photos is a no-go.
        
             | WhitneyLand wrote:
             | Anyone checked out the NPU on the new iPad? It's supposed
             | to be a bazillion times better according to Apple but I
             | haven't had a chance to dig into the reality.
             | 
             | I guess we can assume this is going to be what's used in
             | what's being called Apple's first AI phone, iPhone 16.
        
               | fassssst wrote:
               | It has 38 TOPS of INT8 performance. Not very remarkable
               | compared to consumer Nvidia GPU's which are like one or
               | two orders of magnitude faster.
        
               | talldayo wrote:
               | For reference, Nvidia's Jetson Orin NX robotics platform
               | is 35-50 TOPS on average. Apple _is_ catching up, but
               | Nvidia still has by-far the more flexible (and better
               | scaled) platform.
        
               | numpad0 wrote:
               | That 38 TOPS figure was a bit weird, it's literally below
               | baseline(45 TOPS) for "AI PC" branding
               | Qualcomm/Intel/Microsoft is launching this June, and also
               | 10x less than typical GPUs. I think it was just a clever
               | marketing exploiting the fact that "AI PC" branding
               | hasn't launched yet.
        
           | eru wrote:
           | And Google has their TPUs.
        
           | yosefk wrote:
           | For inference, Nvidia has DLA since 2017-ish if I remember
           | correctly, which is completely separate from the GPU.
        
         | WanderPanda wrote:
         | Wait but nvidia tensor-cores are exactly the hardware that
         | likes 16x16 tiles, no? I thought that was the whole point? The
         | hardware is already here and I'm sceptical if there is another
         | order of magnitude in performance to be gained from even more
         | specialized designs.
        
           | wtallis wrote:
           | What's the ratio of tensor cores to regular SIMD compute
           | ("CUDA cores") on NVIDIA's current chips?
        
             | creato wrote:
             | This is in the article: if you aren't using the tensor
             | cores, you aren't utilizing ~94% of the FLOPs available.
        
               | wtallis wrote:
               | Knowing what portion of the FLOPs are in the tensor cores
               | isn't quite the right thing to be looking at. The key
               | question is how much more tensor core performance can be
               | gained by reducing or eliminating the dies area devoted
               | to non-tensor compute and higher precision arithmetic.
               | Most of NVIDIA's GPUs are still designed primarily for
               | graphics: they have some fixed function units that can be
               | deleted in an AI-only chip, and a lot of die space
               | devoted to non-tensor compute because the tensor cores
               | don't naturally lend themselves to graphics work (though
               | NVIDIA has spent years coming up with ways to not leave
               | the tensor cores dark during graphics work, most notably
               | DLSS).
               | 
               | So the claims that NVIDIA's GPUs are already thoroughly
               | optimized for AI and that there's no low-hanging fruit
               | for further specialization don't seem too plausible,
               | unless you're only talking about the part of the
               | datacenter lineup that has already had nearly all fixed-
               | function graphics hardware excised. And even for Hopper
               | and Blackwell, there's some fat to be trimmed if you can
               | narrow your requirements.
        
               | incrudible wrote:
               | There is not a lot of fixed function left in the modern
               | graphics pipeline, economics of scale dictate that there
               | is no net benefit in trimming it.
        
               | wtallis wrote:
               | And yet, even NVIDIA _does_ trim it from chips like the
               | H100, which has no display outputs, RT cores, or video
               | encoders (though they keep the decoders), and only has
               | ROPs for two of the 72 TPCs.
        
               | smallmancontrov wrote:
               | Mind the Dark Silicon Fraction.
               | 
               | Some fraction of your transistors MUST go unused on
               | average or you melt the silicon. This was already a thing
               | in the 20nm days and I'm sure it has only gotten worse.
               | 100% TDP utilization might correspond to 60% device
               | utilization.
        
               | wtallis wrote:
               | That's true for CPUs. Does it really apply to GPUs and
               | other accelerators for embarrassingly parallel problems
               | where going slower but wider is always a valid option?
        
               | Sharlin wrote:
               | On the H100 specifically. The figure is likely different
               | on consumer cards.
        
         | choppaface wrote:
         | "NVidia's LIES..
         | 
         | On kernels such as flash attention, TMA and the L2 cache are
         | both fast enough so as to hide these problems reasonably well.
         | But to make the full use of the hardware, memory request must
         | be coalesced and bank conflicts avoided "
         | 
         | The depth of the competition is also starting to become
         | apparent. There's no way the documentation error was totally an
         | accident. Diagrams are the easiest to steal / copy and there
         | must have been some utility for nvidia to have left this in
         | place. Remember when Naveen Rao's Nervana was writing NVidia
         | Maxwell drivers that out-performed NVidia's own? Not every
         | documentation mishap in a high-growth product is a competition
         | counter-measure, but given that the researchers spent so long
         | reverse-engineering wgmma and given the China-US political
         | situation of the H100 in particular, it seems NVidia is up to
         | its old tricks to protect its moat.
         | 
         | So don't over-study the H100 peculiarities, as "what hardware
         | does AI want?" really encompasses the commercial situation as
         | well.
        
           | wiz21c wrote:
           | I don't understand. If they document their stuff with errors,
           | it will hurt users, be they chinese or US ? Or is it expected
           | that US users will call Nvidia's to ask for the correct
           | documentation ?
        
             | acka wrote:
             | It could be a case of classic market segmentation. The
             | lower tier customers get the incomplete or error-ridden
             | documentation, and the upper tier trusted
             | customers^W'partners' get access to the juicy stuff:
             | complete and mostly correct documentation, including stuff
             | intentionally left out of the lower tier package like
             | application notes containing secret hardware handshakes to
             | unlock hidden features, all under strict NDA of course.
        
             | choppaface wrote:
             | The vast majority of users use NVidia's own kernels versus
             | optimize their own. And those who do write custom kernels
             | are typically not trying to compete with NVidia's own GMM.
        
         | jiveturkey wrote:
         | hasn't google been building such devices for a decade now?
        
           | yayr wrote:
           | yep, and the main engineers have founded groq.com with an
           | architecture that among others precisely solved the memory
           | management issues
        
         | bcatanzaro wrote:
         | GPUs have evolved to be AI machines with as little baggage as
         | possible. People have been arguing GPUs were old technology and
         | therefore unsuited for AI since at least 2014 (when Nervana was
         | founded), but what they perhaps didn't expect is that the GPU
         | would evolve so quickly to be an AI machine.
        
           | celrod wrote:
           | Bill Dally from Nvidia argues that there is "no gain in
           | building a specialized accelerator", in part because current
           | overhead on top of the arithmetic is in the ballpark of 20%
           | (16% of IMMA and 22% for HMMA units)
           | https://www.youtube.com/watch?v=gofI47kfD28
        
             | AnthonyMouse wrote:
             | There does seem to be a somewhat obvious advantage: If all
             | it has to do is matrix multiplication and not every other
             | thing a general purpose GPU has to be good at then it costs
             | less to _design_. So now someone other than Nvidia or AMD
             | can do it, and then very easily distinguish themselves by
             | just sticking a ton of VRAM on it. Which is currently
             | reserved for GPUs that are extraordinarily expensive, even
             | though the extra VRAM doesn 't cost a fraction of the price
             | difference between those and an ordinary consumer GPU.
        
               | bjornsing wrote:
               | Exactly. And that means you not only save the 22% but
               | also a large chunk of the Nvidia margin.
        
               | Animats wrote:
               | And, sure enough, there's a new AI chip from
               | Intellifusion in China that's supposed to be 90% cheaper.
               | 48 TOPS in int8 training performance for US$140.[1]
               | 
               | [1] https://www.tomshardware.com/tech-
               | industry/artificial-intell...
        
               | pfdietz wrote:
               | I wonder what the cost of power to run these chips is. If
               | the power cost ends up being large compared to the
               | hardware cost, it could make sense to buy more chips and
               | run them when power is cheap. They could become a large
               | source of dispatchable demand.
        
               | papruapap wrote:
               | I really hope we see AI-PU (or with some other name,
               | INT16PU, why not) for the consumer market sometime soon.
               | Or been able to expand GPU memory using a pcie socket
               | (not sure if technically possible).
        
               | hhsectech wrote:
               | Isn't this what resizeable BAR and direct storage are
               | for?
        
               | PeterisP wrote:
               | The while point of GPU memory is that it's faster to
               | access than going to memory (like your main RAM) through
               | the PCIe bottleneck.
        
               | throwaway4aday wrote:
               | My uninformed question about this is why can't we make
               | the VRAM on GPUs expandable? I know that you need to
               | avoid having the data traverse some kind of bus that
               | trades overhead for wide compatibility like PCIe but if
               | you only want to use it for more RAM then can't you just
               | add more sockets whose traces go directly to where
               | they're needed? Even if it's only compatible with a
               | specific type of chip it would seem worthwhile for the
               | customer to buy a base GPU and add on however much VRAM
               | they need. I've heard of people replacing existing RAM
               | chips on their GPUs[0] so why can't this be built in as a
               | socket like motherboards use for RAM and CPUs?
               | 
               | [0] https://www.tomshardware.com/news/16gb-rtx-3070-mod
        
               | carbotaniuman wrote:
               | Replacing RAM chips on GPUs involves resoldering and
               | similar things - those (for the most part) maintain the
               | signal integrity and performance characteristics of the
               | original RAM. Adding sockets complicates the signal path
               | (iirc), so it's harder for the traces to go where they're
               | needed, and realistically given a trade-off between
               | speed/bandwidth and expandability I think the market goes
               | with the former.
        
               | giobox wrote:
               | Expandable VRAM on GPUs has been tried before - the
               | industry just hates it. It's like Apple devices - want
               | more internal storage? Buy a new computer so we can have
               | the fat margins.
               | 
               | The original REV A iMac in late 90s had slotted memory
               | for its ATI card, as one example - shipped with 2mb,
               | could be upgraded to 6mb after the fact with a 4MB SGRAM
               | DIMM. There are also a handful of more recent examples
               | floating around.
               | 
               | While I'm sure there are also packaging advantages to be
               | had by directly soldering memory chips instead of
               | slotting them etc, I strongly suspect the desire to keep
               | buyers upgrading the whole card ($$$) every few years
               | trumps this massively if you are a GPU vendor.
               | 
               | Put another way, what's in it for the GPU vendor to offer
               | memory slots? Possibly reduced revenue, if it became
               | industry norm.
        
               | Majromax wrote:
               | Expansion has to answer one fundamental question: if
               | you're likely to need more X tomorrow, why aren't you
               | just buying it today?
               | 
               | The answer to this question almost has to be "because it
               | will be cheaper to buy it tomorrow." However, GPUs bundle
               | together RAM and compute. If RAM is likely to be cheaper
               | tomorrow, isn't compute also probably going to be
               | cheaper?
               | 
               | If both RAM _and_ compute are likely cheaper tomorrow,
               | then the calculus still probably points towards a
               | wholesale replacement. Why not run /train models twice as
               | quickly alongside the RAM upgrades?
               | 
               | > I strongly suspect the desire to keep buyers upgrading
               | the whole card ($$$) every few years trumps this
               | massively if you are a GPU vendor.
               | 
               | Remember as well that expandable RAM doesn't unlock
               | higher-bandwidth interconnects. If you could take the
               | card from five years ago and load it up with 80 GB of
               | VRAM, you'd still not see the memory bandwidth of a
               | newly-bought H100.
               | 
               | If instead you just need the VRAM and don't care much
               | about bandwidth/latency, then it seems like you'd be
               | better off using unified memory and having system RAM be
               | the ultimate expansion.
        
               | hellofellows wrote:
               | hmm seems you're replying as a customer, but not as a GPU
               | vendor...
               | 
               | the thing is, there's not enough competition in the AI-
               | GPU space.
               | 
               | Current only option for no-wasting-time on running some
               | random research project from github? buy some card from
               | nvidia. cuda can run almost anything on github.
               | 
               | AMD gpu cards? that really depends...
               | 
               | and gamers often don't need more than 12?gb of GPU ram
               | for running games on 4k.. so most high-vram customers are
               | on the AI field.
               | 
               | > If you could take the card from five years ago and load
               | it up with 80 GB of VRAM, you'd still not see the memory
               | bandwidth of a newly-bought H100.
               | 
               | this is exactly what nvidia will fight against tooth-and-
               | nail -- if this is possible, its profit margin could be
               | slashed to 1/2 or even 1/8
        
               | AnthonyMouse wrote:
               | > The answer to this question almost has to be "because
               | it will be cheaper to buy it tomorrow."
               | 
               | No, it doesn't. It could just as easily be "because I
               | will have more money tomorrow." If faster compute is $300
               | and more VRAM is $200 and I have $300 today and will have
               | another $200 two years from now, I might very well like
               | to buy the $300 compute unit and enjoy the faster compute
               | for two years before I buy the extra VRAM, instead of
               | waiting until I have $500 to buy both together.
               | 
               | But for something which is already a modular component
               | like a GPU it's mostly irrelevant. If you have $300 now
               | then you buy the $300 GPU, then in two years when you
               | have another $200 you sell the one you have for $200 and
               | buy the one that costs $400, which is the same one that
               | cost $500 two years ago.
               | 
               | This is a much different situation than fully integrated
               | systems because the latter have components that lose
               | value at different _rates_ , or that make sense to
               | upgrade separately. You buy a $1000 tablet and then the
               | battery goes flat and it doesn't have enough RAM, so you
               | want to replace the battery and upgrade the RAM, but you
               | can't. The battery is proprietary and discontinued and
               | the RAM is soldered. So now even though that machine has
               | a satisfactory CPU, storage, chassis, screen and power
               | supply, which is still $700 worth of components, the
               | machine is only worth $150 because nothing is modular and
               | nobody wants it because it doesn't have enough RAM and
               | the battery dies after 10 minutes.
        
               | PeterisP wrote:
               | Technically we definitely can, but are there sufficiently
               | many people willing to pay a sufficiently high premium
               | for that feature? How much more would you be willing to
               | pay for an otherwise identical card that has the option
               | to expand RAM, and do you expect that a significant
               | portion of buyers would want to pay a non-trivial up-
               | front cost for that possibility?
        
               | throwaway48476 wrote:
               | Its a minor technical challenge with no financial benefit
               | for the GPU makers.
        
               | rdsubhas wrote:
               | Isn't that what NPUs are technically?
               | 
               | https://en.m.wikipedia.org/wiki/AI_accelerator
        
               | WithinReason wrote:
               | Designing it is easy and always has been. Programming it
               | is the bottleneck. Otherwise Nvidia wouldn't be in the
               | lead.
        
               | markhahn wrote:
               | but programming it is "import pytorch" - nothing nvidia-
               | specific there.
               | 
               | the mass press is very impressed by Cuda, but at least if
               | we're talking AI (and this article is, exclusively), it's
               | not the right interface.
               | 
               | and in fact, Nv's lead, if it exists, is because they
               | pushed tensor hardware earlier.
        
               | WithinReason wrote:
               | I'm talking about adding Pytorch support for your special
               | hardware.
               | 
               | Nv's lead is due to them having Pytorch support.
        
               | achierius wrote:
               | Someone does, in fact, have to implement everything
               | underneath that `import` call, and that work is _very_
               | hard to do for things that don't closely match Nvidia's
               | SIMT architecture. There's a reason people don't like
               | using dataflow architectures, even though from a pure
               | hardware PoV they're very powerful -- you can't map
               | CUDA's, or Pytorch's, or Tensorflow's model of the world
               | onto them.
        
               | KaoruAoiShiho wrote:
               | Eh if you're running in production you'll want something
               | lower level and faster than pytorch.
        
               | cma wrote:
               | There are other operations for things like normalization
               | in training, which is why most successful custom stuff
               | has focused on inference I think. As architectures
               | changed and needed various different things some custom
               | built training hardware got obsoleted, Keller talked
               | about that affecting Tesla's Dojo and making it less
               | viable (they bought a huge nvidia cluster after it was
               | up). I don't know if TPU ran into this, or they made
               | enough iterations fast enough to keep adding what they
               | needed as they needed it.
        
         | muyuu wrote:
         | it's going to be awkward in consumer hardware either way
         | 
         | if you segregate AI units from the GPU, the thing is both AI
         | and GPUs will continue to need massive amounts of matrix
         | multiplication and as little memory latency as possible
         | 
         | the move to have more of it wrapped in the GPU makes sense but
         | at least in the short and medium term, most devices won't be
         | able to justify the gargantuan silicon wafer space/die growth
         | that this would entail - also currently Nvidia's tech is ahead
         | and they don't make state of the art x86 or ARM CPUs
         | 
         | for the time being I think the current paradigm makes the most
         | sense, with small compute devices making inroads in the
         | consumer markets as non-generalist computers - note that more
         | AI-oriented pseudo-GPUs already exist and are successful since
         | the earlier Nvidia Tesla lineup and then the so-called "Nvidia
         | Data Center GPUs"
        
           | rfoo wrote:
           | > as little memory latency as possible
           | 
           | Should be "as much memory bandwidth as possible". GPUs are
           | designed to be (relatively) more insensitive to memory
           | latency than CPU.
        
             | muyuu wrote:
             | yep that's true, although AI compute modules do get
             | significant benefit from low latency cache as well
        
         | FuriouslyAdrift wrote:
         | AMD is already in their second generation of of Versal line.
         | 
         | https://www.amd.com/en/products/accelerators/alveo/v80.html
         | 
         | XDNA Architecture
         | 
         | https://www.amd.com/en/technologies/xdna.html
        
         | UncleOxidant wrote:
         | > Then there's a trend towards much shorter numbers. 16 bit
         | floating point? 8 bit? 2 bit? 1 bit?
         | 
         | There was that recent paper titled "The Era of 1-bit LLMs" [0]
         | which was actually suggeting a 1.58 bit LLM (2 bits in
         | practice).
         | 
         | > Someone reading this is probably writing it in VHDL right
         | now, or will be soon.
         | 
         | Yeah, I think I'm in the "will be soon" camp - FPGA board has
         | been ordered. Especially with the 2-bit data types outlined in
         | that paper [0] and more details in [1]. There's really a need
         | for custom hardware to do that 2-bit math efficiently.
         | Customizing one of the simpler open source RISC-V integer
         | implementations seems like something to try here adding in the
         | tiled matrix registers and custom instructions for dealing with
         | them (with the 2 bit data types).
         | 
         | [0] https://arxiv.org/abs/2402.17764 [1]
         | https://github.com/microsoft/unilm/blob/master/bitnet/The-Er...
        
       | uyzstvqs wrote:
       | What is needed are true NPUs as dedicated co-processors,
       | especially for prosumer desktop systems (devs, other
       | professionals, gamers). GPUs work in the enterprise, but they're
       | a hassle to use for AI on the personal computing side of the
       | market. Especially VRAM limitations, but also the lack of a
       | standard open API other than Vulkan (again, using video stuff for
       | AI).
        
         | dartos wrote:
         | Fwiw, Vulkan isn't specifically a graphics api and has had
         | compute specific features for a while now. (Potentially since
         | its inception)
        
           | the__alchemist wrote:
           | Compared to CUDA, Vulkan is... not fun to code compute in!
           | The serialization bridge and duplicating data structures and
           | functions between CPU and GPU is tedious.
        
             | dartos wrote:
             | I hear both CUDA and Vulkan are not fun to code in.
             | 
             | But yeah Vulkan is famously verbose. It takes about 1000
             | LoC to draw a triangle
        
               | KeplerBoy wrote:
               | CUDA is very much fun to code in!
               | 
               | Nvidia provides devs with great tools (Nsight Systems and
               | Nsight Compute), so you know where you have to optimize.
        
       | jokoon wrote:
       | this is why people should better study neuroscience, psychology
       | if they want to advance research in AI.
       | 
       | also things related to graph topology in neural networks maybe,
       | but probably not related to artificial NN.
       | 
       | I was given this video, which I found was pretty interesting:
       | https://www.youtube.com/watch?v=nkdZRBFtqSs (How Developers might
       | stop worrying about AI taking software jobs and Learn to Profit
       | from LLMs - YouTube)
        
         | dartos wrote:
         | I don't think psychology will have any bearing on AI.
         | 
         | I doubt neuroscience will either, but I'm not as sure on that.
         | 
         | The more impressive AI systems we have moved further away from
         | the neuron analogy that came from perceptions.
         | 
         | The whole "intelligence" and "neural" part of AI is a red
         | herring imo. Really poor ambiguous word choice for a specific,
         | technical idea.
        
           | sva_ wrote:
           | > I doubt neuroscience will either, but I'm not as sure on
           | that
           | 
           | The stuff on spiking networks and neuromorphic computing is
           | definitely interesting and inspired by neuroscience, but it
           | currently seems mostly like vaporware
        
             | dartos wrote:
             | Yep, I've heard about spiking networks, but haven't read
             | into them much yet.
        
           | fastball wrote:
           | *perceptrons
        
             | dartos wrote:
             | Darn autocorrect. Thank you.
        
               | actionfromafar wrote:
               | Haha, I didn't get it when I read "perceptions". Thought
               | ... of what? :-D
        
           | nradov wrote:
           | The question is whether current AI technologies represent any
           | progress towards a true human equivalent artificial _general_
           | intelligence. Most likely not, but no one knows for sure. If
           | the answer turns out to be no then real progress will likely
           | require theoretical insights from psychology, neuroscience,
           | and other fields.
        
             | dartos wrote:
             | Fwiw, I don't think we're any closer to general
             | intelligence then we were 5 years ago.
             | 
             | Other than that, I agree, especially since you added "and
             | other fields." Psychology might eventually give us a useful
             | definition of "intelligence," so that'd be something.
             | 
             | Obviously all research can influence other areas of
             | research.
        
           | Symmetry wrote:
           | It's easy to overstate, but shouldn't be understated either
           | with, as an example, solving problems with learning in AI
           | providing insights into how dopamine works in brains.
           | 
           | https://www.technologyreview.com/2020/01/15/130868/deepmind-.
           | ..
           | 
           | There are obvious, huge differences between what goes on in a
           | computer and what happens in a a brain. Neurons can't do back
           | propagation is a glaring one. But they do do something that
           | ends up being analogous to back propagation and you can't
           | tell _a priori_ whether some property of AI or neuroscience
           | might be applicable to the other or not.
           | 
           | The best way to learn about AI isn't to learn neuroscience.
           | it's to learn AI. But if I were an AI lab I'd still hire
           | someone to read neuroscience papers and check to see whether
           | they might have something useful in them.
        
         | renewiltord wrote:
         | There are loads of psychologists and neuroscientists today. Has
         | any of them in the last few years produced anything advancing
         | AI? The proof of the pudding is in the eating so if they have
         | at a higher rate than just straight CS/Mathematics and related
         | then there's probably some truth to it.
        
         | chmod775 wrote:
         | I can't seem to figure out the connection between this comment
         | and the article at hand, except that they're both about AI.
        
       | WanderPanda wrote:
       | Is this "just" CUTLASS in user friendly?
        
       | phinnaeus wrote:
       | FYI the caption of the "spirit animals" image says "canadian
       | goose" instead of "Canada Goose".
        
         | fastball wrote:
         | Canadian goose seems better in [current year], to avoid
         | confusion with the clothing brand.
        
         | wglb wrote:
         | An error too often made.
        
         | downrightmike wrote:
         | Don't worry, the Geese are en route to location, resolution
         | incoming. Stand by.
        
           | hoherd wrote:
           | In my experience, Canadian geese are never en route to
           | anywhere. They stay next to the pond year round and crap
           | everywhere you might want to step. EG:
           | https://sanjosespotlight.com/going-to-santa-clara-central-
           | pa...
        
         | adzm wrote:
         | Likely a regional thing; they are consistently called Canadian
         | Geese where I grew up and where I currently live.
        
         | bombcar wrote:
         | It's a Canada Goose from Canada. A Canadian Canada Goose, or
         | Canadian Goose.
        
           | gosub100 wrote:
           | https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffal.
           | ..
        
         | xarope wrote:
         | I am missing the reference to the canadian goose and the
         | retriever puppy as spirit animals. Is that to say the H100 is
         | an ornery thing, but the RTX4090 is friendly?
        
           | Mtinie wrote:
           | I'd assumed (like you) it meant that the H100 is ornery AND
           | pickier about what it consumes, while the RTX4090 is playful
           | and will eat damn near anything within reach of its mouth
           | (with its sharp, velociraptor-like puppy teeth), whether you
           | want it to or not.
           | 
           | But that may be straining the meme somewhat. :)
        
         | adrian_b wrote:
         | I consider bad the habit of English to use nouns also as
         | adjectives, because it causes many ambiguities, some of which
         | can be very annoying, even if they are a rich source of jokes
         | and word plays.
         | 
         | In most languages the use of a noun as an adjective is marked,
         | by a particle or by an affix or at least by a different stress
         | pattern (like moving the stress to the last syllable), which
         | removes the ambiguities.
         | 
         | So for most non-native speakers "Canadian goose" makes much
         | more sense than "Canada goose" (which may feel like "Canada and
         | a goose" or "a goose that is also Canada" and not like "a goose
         | from Canada").
        
           | actionfromafar wrote:
           | Now you made me think of ways to English Adjective my text
           | for word play... make it stop.
        
           | kitd wrote:
           | "Canada" isn't being used as an adjective though. The name of
           | the species is "Canada Goose", like "Long Island Shellfish"
           | or "Dublin Bay Prawns".
        
           | p0w3n3d wrote:
           | always the former noun is describing the latter. Butter fly
           | is not a flying butter (as my children's teacher told them to
           | make a joke about butterfly) but a fly made of butter
           | instead.
        
         | bn-l wrote:
         | Who cares
        
         | silisili wrote:
         | I've only heard people in my entire lifetime call them Canadian
         | Geese.
         | 
         | The only time I've ever even seen or heard of Canada
         | Goose/Geese are people on the internet telling others they are
         | wrong.
         | 
         | I think it's time to just accept it as correct.
        
           | FearNotDaniel wrote:
           | Absolutely, it's like living in London and eventually having
           | to accept that tourists will always say "Big Ben" when they
           | mean the clock tower of the Palace of Westminster, which
           | encloses the bell whose actual name is Big Ben. The name of
           | the tower is, de facto, Big Ben, and life gets so much easier
           | when you drop the urge to tell people they are wrong all the
           | time...
           | 
           | Edit: TIL the tower was properly renamed "Elizabeth Tower" in
           | 2012 [0] but I seriously doubt a single person in the last 12
           | years has ever used that name...
           | 
           | [0] https://en.wikipedia.org/wiki/Big_Ben
        
             | globular-toast wrote:
             | I wouldn't put that in the same category. If you say Canada
             | Goose everyone still knows what you mean. If you say
             | Elizabeth Tower, they probably don't.
        
           | hatthew wrote:
           | In real life, I have only ever heard Canada Goose.
        
       | apsec112 wrote:
       | Interesting! Would this support fp8? Does anyone know how it
       | would compare to Triton?
        
       | renonce wrote:
       | > NVIDIA's lies. This is an extraordinarily misleading
       | representation of the actual 128b swizzled wgmma layout. This
       | diagram cost us three weeks of life that we will not get back,
       | hence the public shaming.
       | 
       | Wondering if anyone would be surprised that a huge amount of
       | progress in AI is on the engineering side (optimizing matmuls),
       | and that a huge portion of the engineering is about reverse
       | engineering NVIDIA chips
        
         | DeathArrow wrote:
         | Architecture doesn't make a difference. Big enough models
         | trained with big enough data tend to give the same results
         | regardless of architecture. So yes, most advances in AI are
         | mostly due to the fact we can now multiply matrices very fast.
        
           | elcomet wrote:
           | That's not completely true. The architecture must behave well
           | for scaling, which is not trivial. Basic multi-layer
           | perceptrons do not scale well for example, the gradient will
           | vanish or explode deeper in the network.
        
             | 3abiton wrote:
             | And data quality. Ensuring the sourcing and quality is very
             | important to get a good model.
        
               | fleischhauf wrote:
               | this, if you have money to spend in improving your model,
               | more training data is the first thing I'd take a look at
        
             | Tarrosion wrote:
             | How do modern foundation models avoid multi-layer
             | perceptron scaling issues? Don't they have big feed-forward
             | components in addition to the transformers?
        
               | heavenlyblue wrote:
               | They don't do global optimisation of all layers at the
               | same time, instead training all layers independently of
               | each other.
        
           | rfoo wrote:
           | idk, they do give the same results, but given the memory
           | bottleneck it feels like we are at a point when architecture
           | innovations matter again, for example check out DeepSeek V2
           | tech report, they modded model arch specifically for lower
           | cost inference (by making k/v cache smaller)
        
           | __loam wrote:
           | Different architecture can result in hundreds of millions of
           | dollars more in training costs no?
        
       | latchkey wrote:
       | Really impressed by the writing style of this post and very much
       | looking forward to this on AMD MI300x. Let me know if you want
       | some time on mine.
        
         | jsemrau wrote:
         | Really? It gives me PTSD from the Wallstreetbets days.
        
           | forrestthewoods wrote:
           | I also enjoyed the article's style. I utterly despise
           | "academic paper speak". It is, imho, not the most effective
           | style to communicate complex ideas. I find it so much easier
           | to learn from a more casual "blog post" or in-person
           | presentation over stiff, rigid academic speak.
        
             | kaycey2022 wrote:
             | I find both to be useful in different stages. The casual
             | style is very helpful when starting out. But once I have
             | put in a few weeks or months of study in, then the rigor
             | and preciseness of academic style is good as well.
             | 
             | I agree with you in the sense that something has "died" in
             | writings the follow academic paper speak these days. Just
             | yesterday I saw an ancient article surfaced by Scientific
             | American and Peter Norvig on System Analysis by Strachey.
             | It uses quite a bit of formal language but is super
             | approachable at the same time. That kind of skill is rarely
             | seen these days.
        
           | david927 wrote:
           | > _the Wallstreetbets days._
           | 
           | https://twitter.com/TheRoaringKitty/status/17900418133798504.
           | ..
        
         | globular-toast wrote:
         | Good writing is clear and unambiguous. With speech there is an
         | opportunity to interrupt and ask for clarification. Writing has
         | one chance to get the message across. A reader shouldn't have
         | to consult knowyourmeme.com to figure out what the heck the
         | authors are trying to say. I don't even know what the title
         | means here. That's how far they've missed the mark.
        
           | _obviously wrote:
           | Wow that really sucks for you. I just read it in 5 minutes
           | and feel much more informed about the subject pf nvidia
           | memory twizzlization. It's kind of funny to me that
           | presumably young college guys are writing in a style that's
           | very readable for my old ass.
        
             | unethical_ban wrote:
             | >that really sucks for you
             | 
             | How can I put this in your vernacular...
             | 
             | "Most polite genZ meme enjoyer"
        
           | aetimmes wrote:
           | Even if you're not familiar with the "go brrr" meme (which is
           | the only use of meme-idiom in the article and is used exactly
           | twice), its meaning is easily inferred via context clues from
           | the opening paragraphs.
           | 
           | Good writing is also entertaining and engaging.
        
             | globular-toast wrote:
             | Keyword being _also_.
        
             | throwaway1492 wrote:
             | As someone who witnessed A-10 CAS fuck some stuff up in a
             | combat zone ie the real "brrrrt" I've been mystified by the
             | meme and current useage. No one knows where it comes from
             | nor the slaughter it represents.
        
               | aoeusnth1 wrote:
               | You're mistaken, the "go brrr" format comes from the
               | money printer meme in 2020.
        
               | onemiketwelve wrote:
               | as intense as a a10 might be, it's short lived and only
               | affects a few dudes on the receiving end. When the
               | federal reserve goes brrr, it has far reaching impact
               | that affects every single person in the global economy.
               | 
               | https://brrr.money/
        
         | tracker1 wrote:
         | Have you done much AI work against AMD products? I'm not going
         | to plunk down $2500+ for an RTX 4090, but have been considering
         | an RX 7900XTX for playing around with, or at least getting
         | started. Just curious how well it will or won't work in
         | practice, or if saving a bit more and getting a 7900 XT over
         | the XTX might be a better option, and how much less vram might
         | impact usefulness in practice.
        
           | latchkey wrote:
           | My only work with consumer AMD GPUs was mining ethereum, I
           | had 150,000 of them.
           | 
           | If you want to use enterprise AMD gpus, I'm renting them.
           | That said, I haven't even had a chance to run/play with them
           | myself yet, they have been rented since I got them last
           | month.
           | 
           | Yes, we are getting more.
        
       | brcmthrowaway wrote:
       | NVIDIA needs to be broken up
        
         | huhlig wrote:
         | Into what? Where would you draw such lines?
        
           | robocat wrote:
           | Into tiles ;-p
           | 
           | GPU compute is already broken up - there is a supply chain of
           | other cooperating players that work together to deliver GPU
           | compute to end users:
           | 
           | TSMC, SK hynix, Synopsys, cloud providers (Azure/Amazon
           | etcetera), model providers (OpenAI/Anthropic etcetera).
           | 
           | Why single out NVidia in the chain? Plus the different
           | critical parts of the chain are in different jurisdictions.
           | Split up NVidia and somebody else will take over that spot in
           | the ecosystem. This interview with Synopsys is rather
           | enlightening: https://www.acquired.fm/episodes/the-software-
           | behind-silicon...
           | 
           | How does the profit currently get split between the different
           | links? Profit is the forcing variable for market cap and
           | profit is the indicator of advantage. Break up NVidia and
           | where does the profit move?
        
         | latchkey wrote:
         | The better alternative is to root for AMD and others to develop
         | their own products so that regardless of breaking NV up or not,
         | there are alternative solutions for people to use. They all
         | leapfrog each other with new releases now any way. Why put all
         | your eggs into one basket.
        
           | simondotau wrote:
           | George Hotz went down the AMD rabbit hole for a while and
           | concluded that the driver software -- more precisely the
           | firmware which runs on the cards themselves -- is so badly
           | written that there's no hope of them becoming serious
           | contenders in AI without some major changes in AMD's
           | priorities.
        
             | latchkey wrote:
             | I'm not defending their software. It does honestly have a
             | ton of issues.
             | 
             | George Hotz tried to get a consumer card to work. He also
             | refused my public invitations to have free time on my
             | enterprise cards, calling me an AMD shill.
             | 
             | AMD listened and responded to him and gave him even the
             | difficult things that he was demanding. He has the tools to
             | make it work now and if he needs more, AMD already seems
             | willing to give it. That is progress.
             | 
             | To simply throw out George as the be-all and end-all of a
             | $245B company... frankly absurd.
        
               | shmerl wrote:
               | Indeed, AMD willing to open firmware is something Nvidia
               | never has done.
        
               | creato wrote:
               | The fact that consumer and "pro"(?) GPUs don't use
               | (mostly) the same software is not confidence inspiring.
               | It means that AMD's already apparently limited capacity
               | for software development is stretched thinner than it
               | otherwise would be.
               | 
               | Also, if the consumer GPUs are hopelessly broken but the
               | enterprise GPUs are fine, that greatly limits the number
               | of people that can contribute to making the AMD AI
               | software ecosystem better. How much of the utility of the
               | NVIDIA software ecosystem comes from gaming GPU owners
               | tinkering in their free time? Or grad students doing
               | small scale research?
               | 
               | I think these kinds of things are a big part of why
               | NVIDIA's software is so much better than AMD right now.
        
               | wruza wrote:
               | _that greatly limits the number of people that can
               | contribute to making the AMD AI software ecosystem
               | better_
               | 
               | I'd say it simply dials it down to zero. No one's gonna
               | buy an enterprise AMD card for playing with AI, so no
               | one's gonna contribute to that either. As a local AI
               | enthusiast, this "but he used consumer card" complaint
               | makes no sense to me.
        
               | latchkey wrote:
               | > _No one's gonna buy an enterprise AMD card for playing
               | with AI_
               | 
               | My hypothesis is that the buying mentality stems from the
               | inability to rent. Hence, me opening up a rental
               | business.
               | 
               | Today, you can buy 7900's and they work with ROCm. As
               | George pointed out, there are some low level issues with
               | them, that AMD is working with him to resolve. That
               | doesn't mean they absolutely don't work.
               | 
               | https://rocm.docs.amd.com/projects/install-on-
               | linux/en/lates...
        
               | latchkey wrote:
               | Agreed that AMD needs to work on the developer flywheel.
               | Again, not defending their software.
               | 
               | One way to improve the flywheel and make the ecosystem
               | better, is to make their hardware available for rent.
               | Something that previously was not available outside of
               | hyperscalers and HPC.
        
               | simondotau wrote:
               | > To simply throw out George as the be-all and end-all of
               | a $245B company... frankly absurd.
               | 
               | I didn't do that, and I don't appreciate this misreading
               | of my post. Please don't drag me into whatever drama
               | is/was going on between you two.
               | 
               | The only point I was making was that George's experience
               | with AMD products reflected poorly on AMD software
               | engineering circa 2023. Whether George is ultimately
               | successful in convincing AMD to publicly release what he
               | needs is beside the point. Whether he is ultimately
               | successful convincing their GPUs to perform his
               | expectations is beside the point.
        
               | latchkey wrote:
               | > _The only point I was making was that George 's
               | experience with AMD products reflected poorly on AMD
               | software engineering circa 2023._
               | 
               | Except that isn't the point you said...
               | 
               | "there's no hope of them becoming serious contenders in
               | AI without some major changes in AMD's priorities"
               | 
               | My point in showing you (not dragging you into) the
               | drama, is to tell you that George is not a credible
               | witness for your beliefs.
        
             | callalex wrote:
             | Egohotz is brilliant in many ways, but taking him at his
             | word when it comes to working with others has been a
             | mistake since at least around 2010. This is well
             | documented.
        
               | simondotau wrote:
               | Who said anything about taking him at his word?
               | Everything he has done regarding AMD GPUs has been in
               | public. I'm sure there are plenty of valid criticisms one
               | can make of his skills/strategy/attitude/approach, but
               | accusing him of being _generally_ untrustworthy in this
               | endeavour is utterly nonsensical.
        
               | imtringued wrote:
               | I can reliably crash my system using kobold.cpp with
               | Vulkan running an AMD GPU. All it takes is a slightly too
               | high batch size.
        
               | latchkey wrote:
               | What is slightly too high of a batch size? If max size is
               | 100 and you're at 99, of course 100 will crash it.
        
           | PeterisP wrote:
           | We've rooted for that for years, but looking at what AMD does
           | and doesn't do, I've lost hope for this. AMD don't seem to
           | want to do what it takes; it's not that they're trying and
           | failing, but they're simply not even committing to attempt to
           | do the same things that nVidia does for their software
           | infrastructure.
        
             | latchkey wrote:
             | We are still early. I started my bet on Lisa Su around
             | August of last year... she publicly doubled down on AI
             | around October/November. Dec 6th, MI300x was announced.
             | 
             | Big ships take time to course correct. Look at their hiring
             | for AI related positions and release schedule for ROCm. As
             | well as multiple companies like mine springing up to
             | purchase MI300x and satisfy rental demand.
             | 
             | It is only May. We didn't even receive our AIA's until
             | April. Another company just announced their MI300x hardware
             | server offering today.
        
         | silveraxe93 wrote:
         | NVIDIA is so damn good at its job that it took over the market.
         | There's no regulatory or similar barriers to entry. It's
         | literally that they do a damn good job and the competition
         | can't be as good.
         | 
         | You look at that and want to take a sledgehammer to a golden
         | goose? I don't get these people
        
           | michaelt wrote:
           | True: nvidia has been consistently investing for over a
           | decade.
           | 
           | They saw there was nascent compute use of GPUs, using
           | programmable shaders. They produced CUDA, made it accessible
           | on every one of their GPUs (not just the high-markup
           | professional products) and they put resources into it year
           | after year after year.
           | 
           | Not just investing in the product, also the support tools
           | (e.g. a full graphical profiler for your kernels) and
           | training materials (e.g. providing free cloud GPU credits for
           | Udacity courses) and libraries and open source contributions.
           | 
           | This is what it looks like when a company has a vision, plans
           | beyond the next quarter, and makes long-term investments.
        
       | diginova wrote:
       | What should I do if I want to understand such articles in
       | complete? where to start on the roadmap?
        
         | kolinko wrote:
         | This is a good course on gpu programming. Around 4.0 lesson
         | you'll get the required basics:
         | https://youtube.com/playlist?list=PLzn6LN6WhlN06hIOA_ge6Srgd...
         | 
         | Also, write your own cuda kernel to do vector-matrix
         | multiplication (if you use pycuda, you can focus on the kernel,
         | and write everything else with python). Just tell chatgpt that
         | you want to write your own implementation that multiplies a
         | 4000-element vector by 4000x12000 matrix, and to guide you
         | through the whole process.
         | 
         | For renting gpus, runpods is great - right now they have
         | everything from lower tier gpus to h100s. You can start with a
         | lesser gpu at the beginning.
        
         | abstractcontrol wrote:
         | For a deep dive, maybe take a look at the Spiral matrix
         | multiplication playlist:
         | https://www.youtube.com/playlist?list=PL04PGV4cTuIWT_NXvvZsn...
         | 
         | I spent 2 months implementing a matmult kernel in Spiral and
         | optimizing it.
        
           | justplay wrote:
           | sorry for noob question, how gpu programming is helpful ?
        
             | abstractcontrol wrote:
             | NNs for example are (mostly) a sequence of matrix
             | multiplication operations, and GPUs are very good at those.
             | Much better than CPUs. AI is hot at the moment, and Nvidia
             | is producing the kind of hardware that can run large models
             | efficiently which is why it's a 2 trillion-dollar company
             | right now.
             | 
             | However, in the Spiral series, I aim to go beyond just
             | making an ML library for running NN models and break new
             | ground.
             | 
             | Newer GPUs actually support dynamic memory allocation,
             | recursion, and the GPU threads have their own stacks, so
             | you could in fact treat them as sequential devices and
             | write games and simulators directly on them. I think once I
             | finish the NL Holdem game, I'll be able to get over 100x
             | fold improvements by running the whole program on the GPU
             | versus the old approach of writing the sequential part on a
             | CPU and only using the GPU to accelerate a NN model
             | powering the computer agents.
             | 
             | I am not sure if this is a good answer, but this is how GPU
             | programming would be helpful to me. It all comes down to
             | performance.
             | 
             | The problem with programming them is that the program you
             | are trying to speed up needs to be specially structured, so
             | it utilizes the full capacity of the device.
        
           | selimthegrim wrote:
           | Are Winograd's algorithms useful to implement as a learning
           | exercise?
        
             | abstractcontrol wrote:
             | Never tried those, so I couldn't say. I guess it would.
             | 
             | Even so, creating all the abstractions needed to implement
             | even regular matrix multiplication in Spiral in a generic
             | fashion took me two months, so I'd consider that good
             | enough exercise.
             | 
             | You could do it a lot faster by specializing for specific
             | matrix sizes, like in the Cuda examples repo by Nvidia, but
             | then you'd miss the opportunity to do the tensor magic that
             | I did in the playlist.
        
               | selimthegrim wrote:
               | You are the author of the playlist/maker of the videos?
        
       | DeathArrow wrote:
       | So do their kernels and library also speed up RTX 4090?
        
       | cl3misch wrote:
       | > The unswizzled shared memory layouts suffer from very poor
       | coalescing
       | 
       | If I didn't know any better I'd consider it technobabble
        
       | imiric wrote:
       | Hasn't this research been done by teams building NPUs today? E.g.
       | chips built by Groq use an architecture built specifically for
       | AI, which is why they're able to deliver the performance they do.
       | On the consumer side, Apple silicon is also quite capable.
       | 
       | I'm not in this field at all, but it seems to me that using
       | general purpose processors that communicate over (relatively)
       | slow lanes can only get us so far. Rethinking the design at the
       | hardware level, and eventually bringing the price down for the
       | consumer market seems like a better long-term strategy.
        
         | resource_waste wrote:
         | >On the consumer side, Apple silicon is also quite capable.
         | 
         | I am not sure that is true. A glance/or long stay at the reddit
         | localllama subreddit basically has a bunch of frustrated CPU
         | users trying their absolute best to get anything to work at
         | useful speeds.
         | 
         | When you can get an Nvidia GPU for a few hundred dollars or a
         | full blown gaming laptop with a 4050 6gb vram for $900, its
         | hard to call a CPU based AI capable.
         | 
         | Heck we don't have GPUs at work, and CPU based is just not
         | really reasonable without using tiny models and waiting. We
         | ended up requesting GPU computers.
         | 
         | I think there is a 'this is technically possible', and there is
         | a 'this is really nice'. Nvidia has been really nice to use.
         | CPU has been miserable and frustrating.
        
           | imiric wrote:
           | I don't think NVIDIA's reign will last long. The recent AI
           | resurgence is not even a decade old. We can't expect the
           | entire industry to shift overnight, but we are seeing rapid
           | improvements in the capability of non-GPU hardware to run AI
           | workloads. The architecture change has been instrumental for
           | this, and Apple is well positioned to move the field forward,
           | even if their current gen hardware is lacking compared to
           | traditional GPUs. Their silicon is not even 5 years old, yet
           | it's unbeatable for traditional workloads and power
           | efficiency, and competitive for AI ones. What do you think it
           | will be capable of in 5 years from now? Same for Groq, and
           | other NPU manufacturers. Betting on NVIDIA doesn't seem like
           | a good long-term strategy, unless they also shift their
           | architecture.
        
           | serialx wrote:
           | Actually, llama.cpp running on Apple silicon uses GPU(Metal
           | Compute Shader) to inference LLM models. Token generation is
           | also very memory bandwidth bottlenecked. On high end Apple
           | silicon it's about 400MB/s to 800MB/s, comparable to NVIDIA
           | RTX 4090, which has memory bandwidth of 1000MB/s. Not to
           | mention that Apple silicon has unified memory architecture
           | and has high memory models (128GB, up to 192GB), which is
           | necessary to run large LLMs like Llama 3 70B, which roughly
           | takes 40~75GB of RAM to work reasonably.
        
             | resource_waste wrote:
             | These are really nice rants of techno-blabble.
             | 
             | The reality of things: Its not useful.
             | 
             | No one actually uses it.
             | 
             | You can post Apple's official tech specs, but it doesnt
             | change that people aren't using it because it doesnt work.
             | (or at least isnt as cost effective)
             | 
             | >Not to mention that Apple silicon has unified memory
             | architecture and has high memory models (128GB, up to
             | 192GB)
             | 
             | This NEEDS to end. This integrated GPU nonsense is not
             | equivalent and is disinformation. It is immoral to continue
             | to push this narrative.
             | 
             | Also, 128GB isnt high memory. 512GB is high memory.
        
               | brrrrrm wrote:
               | I use it all the time?
        
               | imtringued wrote:
               | The number of people running llama3 70b on NVidia gaming
               | GPUs is absolutely tiny. You're going to need at least
               | two of the highest end 24 GB VRAM GPUs and even then you
               | are still reliant on 4 bit quantization with almost
               | nothing left for your context window.
        
       | roschdal wrote:
       | ChatGPT - the largest electricity bill in the world.
        
       | winternewt wrote:
       | I believe that reducing the power consumption and increasing the
       | speed of AI inference will be best served by switching to analog,
       | approximate circuits. We don't need perfect floating-point
       | multiplication and addition, we just need something that takes an
       | two input voltages and produces an output voltage that is close
       | enough to what multiplying the input voltages would yield.
        
         | brap wrote:
         | I don't know why you're being downvoted, that's an active area
         | of research AFAIK
        
           | gitfan86 wrote:
           | Maybe because that is a VERY different problem than the one
           | discussed here.
           | 
           | Building a single analog chip with 1 billion neurons would
           | cost billions of dollars in a best case scenario. A Nvidia
           | card with 1 billion digital neurons is in the hundreds of
           | dollars of range.
           | 
           | Those costs could come down eventually, but at that point
           | CUDA may be long gone.
        
         | brazzy wrote:
         | Sounds pretty impossble to me do that with a sufficient
         | combination of range and precision.
        
           | atoav wrote:
           | What do you mean with inpossible? You are aware that what
           | radio equipment does is often equivalent of analog operations
           | like multiplication, addition, etc. just at high frequencies?
           | 
           | Sure accuracy is an issue, but this is not as impossible as
           | you may think it would be. The main question will be if the
           | benefits by going analog outweigh the issues arising from it.
        
             | Symmetry wrote:
             | In general the problem with analog is that every sequential
             | operation introduces noise. If you're just doing a couple
             | of multiplications to frequency shift a signal up and down
             | that's fine. But if you've got hundreds of steps and you're
             | also trying to pack huge numbers of parallel steps into a
             | very small physical area.
        
         | dnedic wrote:
         | How do you inspect what is happening then without having ADCs
         | sampling every weight, taking up huge die area?
        
         | jkaptur wrote:
         | Maybe a silly question (I don't know anything about this) - how
         | do you program / reprogram it?
        
           | Arch485 wrote:
           | Realistically, you'd train your model the same way it's done
           | today and then custom-order analog ones with the weights
           | programmed in. The advantage here would be faster inference
           | (assuming analog circuits actually work out), but custom
           | manufacturing circuits would only really work at scale.
           | 
           | I don't think reprogrammable analog circuits would really be
           | feasible, at least with today's tech. You'd need to modify
           | the resistors etc. to make it work.
        
         | rsp1984 wrote:
         | TBH that sounds like a nightmare to debug.
        
         | danielheath wrote:
         | I know someone working in this direction; they've described the
         | big challenges as:                 * Finding ways to use extant
         | chip fab technology to produce something that can do analog
         | logic. I've heard CMOS flash presented a plausible option.
         | * Designing something that isn't an antenna.       * You would
         | likely have to finetune your model for each physical chip
         | you're running it on (the manufacturing tolerances aren't going
         | to give exact results)
         | 
         | The big advantage is that instead of using 16 wires to
         | represent a float16, you use the voltage on 1 wire to represent
         | that number (which plausibly has far more precision than a
         | float32). Additionally, you can e.g. wire two values directly
         | together rather than loading numbers into an ALU, so the die
         | space & power savings are potentially many, many orders of
         | magnitude.
        
           | bobmcnamara wrote:
           | > which plausibly has far more precision than a float32
           | 
           | +/- 1e-45 to 3.4e38. granted, roughly half of that is between
           | -1 and 1.
           | 
           | When we worked with low power silicon, much of the
           | optimization was running with minimal headroom - no point
           | railing the bits 0/1 when .4/.6 will do just fine.
           | 
           | > Additionally, you can e.g. wire two values directly
           | together rather than loading numbers into an ALU
           | 
           | You may want an adder. Wiring two circuit outputs directly
           | together makes them fight, which is usually bad for signals.
        
           | tasty_freeze wrote:
           | > which plausibly has far more precision than a float32
           | 
           | If that was true, then a DRAM cell could represent 32 bits
           | instead of one bit. But the analog world is noisy and lossy,
           | so you couldn't get anywhere near 32 bits of
           | precision/accuracy.
           | 
           | Yes, very carefully designed analog circuits can get over 20
           | bits of precision, say A/D converters, but they are huge
           | (relative to digital circuits), consume a lot of power, have
           | low bandwidth as compared to GHz digital circuits, and
           | require lots of shielding and power supply filtering.
           | 
           | This is spit-balling, but the types of circuits you can
           | create for a neural network type chip is certainly under 8
           | bits, maybe 6 bits. But it gets worse. Unlike digital
           | circuits where signal can be copied losslessly, a chain of
           | analog circuits compounds the noise and accuracy losses stage
           | by stage. To make it work you'd need frequent requantization
           | to prevent getting nothing but mud out.
        
         | cptroot wrote:
         | Here's an example of Veritasium talking about this from 2022:
         | https://www.youtube.com/watch?v=GVsUOuSjvcg
        
         | Symmetry wrote:
         | I think we're far away from analog circuits being practically
         | useful, but one place that where we might embrace the tolerance
         | for imprecision is in noisy digital circuits. Accepting that
         | one in a million, say, bits in an output will be flipped to
         | achieve a better performance/power ratio. Probably not when
         | working with float32s where a single infinity[1] could totally
         | mess things but for int8s the occasional 128 when you wanted a
         | 0 seems like something that should be tolerable.
         | 
         | [1] Are H100s' maxtrix floating point units actually IEEE 754
         | compliant? I don't actually know.
        
       | _spl wrote:
       | It reminds me of when I first read about superscalar CPU
       | architecture and was amazed. GPUs are really next level.
        
       | DeathArrow wrote:
       | It would be nice if such improvements find their way in pytorch
       | and scikit-learn.
        
         | kmacdough wrote:
         | I'm sure they will. Right now it's, though, it's bleeding edge
         | and it'll take some time for these ideas to mature and be
         | adapted to the particular idioms of these more stable packages.
        
       | bombela wrote:
       | I cannot tell for sure if units are really all power of 10.
       | 
       | I found some datasheet that states 80GB of VRAM, and a BAR of
       | 80GiB. All caches are also in power of two. The bandwidth are all
       | power of 10 though.
       | 
       | https://www.nvidia.com/content/dam/en-zz/Solutions/gtcs22/da...
        
       | joaquincabezas wrote:
       | wow their graphs at the GitHub README
       | (https://github.com/HazyResearch/ThunderKittens/blob/main/att...)
       | make me extremely dizzy. Are these wavy bars even legal? :P
        
         | bogtog wrote:
         | I second this. It's like they're trying to incorporate some
         | optical illusion. I'd even prefer just seeing numbers without
         | any bars
        
         | hoosieree wrote:
         | It looks like the xkcd theme for matplotlib[1]. But I agree the
         | waves are too extreme.
         | 
         | [1]:
         | https://matplotlib.org/stable/gallery/showcase/xkcd.html#sph...
        
       | badgersnake wrote:
       | That's the whole point, VCs invested heavily in GPUs anticipating
       | a crypto boom and when that never happened they had to find some
       | other snake oil to peddle that happened to require GPUs.
        
         | verbify wrote:
         | My experience is that when crypto was in the news, my non-
         | technical friends, family, and colleagues would ask me what is
         | bitcoin and were generally confused.
         | 
         | My experience with the AI boom couldn't be more different -
         | everyone from my colleagues to my mum are using chatgpt as a
         | daily tool.
         | 
         | I really don't think that AI and crypto are comparable in terms
         | of their current practical usage.
        
           | kkielhofner wrote:
           | Comparing crypto and AI is really tired and you make the best
           | point - real people are using these GPUs to actually do
           | things of value and improve their daily lives.
           | 
           | At the peak of the crypto boom/hype cycle I took on a little
           | project to look at the top 10 blockchain
           | networks/coins/whatever.
           | 
           | From what I could tell a very, very, very generous estimate
           | is that crypto at best has MAUs in the low tens of millions.
           | 
           | ChatGPT alone got to 100 million MAUs within a year of
           | release and has only grown since.
           | 
           | ChatGPT 10x'd actual real world usage of GPUs (and resulting
           | power and other resources) in a year vs ~15 years for crypto.
           | 
           | > I really don't think that AI and crypto are comparable in
           | terms of their current practical usage.
           | 
           | A massive understatement!
        
             | latchkey wrote:
             | GPUs stopped being used for crypto because Ethereum
             | switched from PoW to PoS and that decimated the whole gpu
             | mining industry. Ethereum was the only profitable thing to
             | mine, that also had a usecase. The rest of the chains
             | dumped in price and became unprofitable to mine at scale.
             | Not enough market depth to unload the tokens at scale.
             | 
             | In other words, it has nothing to do with AI.
        
           | adra wrote:
           | Wow, what a different in perspective. I've met maybe a few
           | people period that have at least mentioned that they've ever
           | used AI tools (ever) in their personal lives, frequency be
           | damned. Maybe you're just a lot more insistent in weaving in
           | questions about using AI tools in daily conversation.
           | 
           | At work setting in a tech company, there seems to be a
           | handful that are very in love with AI, a bunch that use it
           | here or there, and a large majority that (at least
           | publically) don't even use it. It's be interesting to see
           | what company enforced spyware would say about ai uptake
           | though for real.
        
       | panki27 wrote:
       | Warp scheduler, 4 quadrants, tensor memory accelerator,
       | unswizzled wgmma layouts...
       | 
       | The line between GPU lingo and Star Trek technobabble fades away
       | further and further.
        
         | Agentlien wrote:
         | Your comment prompted me to take a step back and look at these
         | terms with new eyes. That made me smile, because you're so
         | right.
        
         | araes wrote:
         | There was some awareness reading the article, yet "we're
         | warping through the quadrant in our tensor accelerator" is
         | pretty Trek.
         | 
         | Have had that thought occasionally with some of the other
         | articles. What it must read like to somebody who gets a ref
         | link for an article over here. Wandered into some Trek nerd
         | convention discussing warp cores.
        
           | winwang wrote:
           | I mean, if we're talking about "accelerating by modifying the
           | metric tensor" then yeah, that would be pretty sci-fi :)
           | 
           | https://en.wikipedia.org/wiki/Metric_tensor_(general_relativ.
           | ..
        
       | weinzierl wrote:
       | _" For this post, we're going to focus on the NVIDIA H100 [...
       | because] we think the trends it implies are going to continue in
       | future generations, and probably from other manufacturers, too."_
       | 
       | Is it though? Wouldn't we expect to see more advanced packaging
       | technology eventually?
       | 
       | If that happens the increased memory bandwidth could be an
       | enabler for a unified memory architecture like in the Nvidia
       | Jetson line. In turn that would make a lot of what the article
       | says make GPU go Brr today moot.
        
       | lucidrains wrote:
       | would be interested to see thunderkittens (great name!) tackle
       | the flash attention backwards pass, which is an order of
       | magnitude harder than the forward
        
       | LordShredda wrote:
       | Standford research team just published an article with a wojak in
       | it. That by itself is bigger news than AI
        
       | chefandy wrote:
       | One of my biggest struggles in doing AI stuff on consumer
       | hardware is heat. I noticed zero discussion of this so I assume
       | it's an implementation detail on small systems that doesn't
       | really factor into more robust setups. Is that the really case,
       | or is this just diving into the comp sci layer of hardware
       | utilization and ignoring things like heat because it's not
       | salient to this subtopic?
        
         | nostrebored wrote:
         | It factors into robust setups but is part and parcel of doing
         | any HPC where you're pushing through a ton of TFLOPS. It's a
         | problem that is assumed to have been solved when you're doing
         | this kind of work.
        
       | danjl wrote:
       | I bet traditional image processing would love to be implemented
       | in ThunderKitten.
        
       ___________________________________________________________________
       (page generated 2024-05-13 23:01 UTC)