[HN Gopher] GPUs Go Brrr
___________________________________________________________________
GPUs Go Brrr
Author : nmstoker
Score : 1013 points
Date : 2024-05-12 22:05 UTC (1 days ago)
(HTM) web link (hazyresearch.stanford.edu)
(TXT) w3m dump (hazyresearch.stanford.edu)
| nmstoker wrote:
| Some related material here too:
|
| https://twitter.com/bfspector/status/1789749117104894179?t=k...
| jauntywundrkind wrote:
| The ThunderKittens mascot has great kitten/Sony-Aibo vibes.
| Nicely generated, AI (I presume).
| https://github.com/HazyResearch/ThunderKittens
| layer8 wrote:
| It looks off because the head isn't centered on the neck.
| john_minsk wrote:
| Great attention to detail! I, like the parent, was surprised
| by the quality as well. However now I can't unsee it:-)
| Satam wrote:
| Easy fix: https://imgur.com/a/Ahwt6tr (although not sure
| which one is actually better)
| perfmode wrote:
| This article rekindles the joy I experienced during CS 149
| Parallel Programming.
| figbert wrote:
| Appreciate the recommendation, will check out the course!
| behnamoh wrote:
| tangential: When @sama talks about "Universal Basic Compute"
| (UBC) as a substitute for Universal Basic Income, obviously he
| means GPU, right? Who's going to benefit from such policies? Only
| nvidia? It just seems such a dystopian future to live in: imagine
| you can sell your UBC to others who know better how to use it, or
| you can use it to mine bitcoin or whatever. But all the compute
| is actually created by one company.
|
| There are many reasons to hate nvidia, but honestly if this UBC
| policy is even remotely being considered in some circles, I'd
| join Linus Torvalds and say "nvidia, fuck you".
| jra101 wrote:
| You're blaming NVIDIA for Sam Altman's dumb idea?
| behnamoh wrote:
| nvidia's CEO literally keeps saying "the more you buy GPUs,
| the more you save"--it's hard to believe nvidia has nothing
| to do with such ideas.
| WanderPanda wrote:
| Him saying this always puts me off. Gives hard old sales-
| guy vibes. I really wonder who/which demographic is
| influenced in nvidias favor by this rethoric.
| coffeebeqn wrote:
| GPU CEO wants to sell more GPUs? What on earth
| WanderPanda wrote:
| One's "dumb idea" is another marketers "genius stroke". Seems
| like he is playing the media puppets while he can
| callalex wrote:
| You're looking for logic. The only logic is "when a sucker buys
| WorldCoin, sama bank account go brrrr".
|
| That's the whole logic.
| Animats wrote:
| " _And we ask: if your matrix multiply is smaller than 16x16, are
| you sure what you're doing is AI?_
|
| _From a philosophical point of view, we think a frame shift is
| in order. A "register" certainly shouldn't be a 32-bit word like
| on the CPUs of old. And a 1024-bit wide vector register, as CUDA
| uses, is certainly a step in the right direction. But to us a
| "register" is a 16x16 tile of data. We think AI wants this. "_
|
| The hardware needs of AI are starting to focus. GPUs, after all,
| were designed for an entirely different job. They're used for AI
| because they have good matrix multiply hardware. "AI GPUs" get to
| leave out some of the stuff in a real GPU (does an H100 even have
| texture fill units?). Then there's a trend towards much shorter
| numbers. 16 bit floating point? 8 bit? 2 bit? 1 bit? That will
| settle out at some point. This paper indicates that hardware that
| likes 16x16 tiles makes a lot of sense. It's certainly possible
| to build such hardware. Someone reading this is probably writing
| it in VHDL right now, or will be soon.
|
| Then we'll see somewhat simpler, less general, and cheaper
| devices that do "AI" operations with as little excess hardware
| baggage as possible. Nice.
| mvkel wrote:
| Would you say this is ultimately "ASICs for AI"?
| dartos wrote:
| In the same way that CPUs are ASICs for integer operations,
| that makes sense to me.
| saagarjha wrote:
| Most CPUs do just fine on floating point too.
| actionfromafar wrote:
| I'm still getting used to that.
| dartos wrote:
| Floating point arithmetic _is_ integer arithmetic on the
| cpu level because of how floating point numbers work.
| fwip wrote:
| That's a good point - floating point operations are
| implemented with integer-math circuits (or at least can
| be - I'm not privy to how modern chip manufacturers
| implement them). E.g: your ALU may have an 11-bit adder
| specifically to add your f64 exponents.
|
| Some slides to get the gist of it: https://users.encs.con
| cordia.ca/~asim/COEN_6501/Lecture_Note...
| dvt wrote:
| > Then we'll see somewhat simpler, less general, and cheaper
| devices that do "AI" operations with as little excess hardware
| baggage as possible. Nice.
|
| Apple has already been doing this for a few years now. The NPU
| is totally different from the GPU or CPU on the die itself[1].
| Nvidia is likely working on this as well, but I think a device
| that's a gaming/entertainment/crypto/AI bundle (i.e. sticking
| with the video card) is probably a better business move.
|
| [1] https://github.com/hollance/neural-
| engine/blob/master/docs/a...
| talldayo wrote:
| The NPUs on a lot of different systems occupy an awkward
| spot. For extremely small models, they're the way to go for
| low-power inference. But once you reach LLM or vision
| transformer size, it makes a lot more sense to switch to GPU
| shaders for that extra bit of large-model performance. For
| stuff like Llama and Stable Diffusion, those Neural Engines
| are practically wasted silicon. The biggest saving grace is
| projects like ONNX attempting to sew them into a unified
| non-15-competing-standards API, but even that won't change
| how underpowered they are.
|
| Nvidia escapes this by designing their GPU architecture to
| incorporate NPU concepts at a fundamental level. It's less
| redundant silicon and enables you to scale a single
| architecture instead of flip-flopping to whichever one is
| most convenient.
| nxobject wrote:
| It's currently doable for Apple - I think their strategy is
| to slowly enhance iPhones, bit by bit, with special-purpose
| models for dealing with media like photo subject
| identification, OCR (in every language!), voice
| transcription, etc. Apple's currently learning from
| Microsoft's attempts to make AI stick everywhere.
| joquarky wrote:
| Soon our phones will dream beside us every night
| (integrating new data into our personal model while on
| the charger)
| serialx wrote:
| Well, iPhone already does that with photos. :)
| pierrefermat1 wrote:
| Do you have a link to where they breakdown what inference
| for photos happens in realtime vs overnight/charging?
| everforward wrote:
| I think Apple is more interested in features that work
| consistently than in giving power users the ability to
| play with essentially alpha or beta AI features.
|
| I would guess that their strategy is to not include
| powerful client-side hardware, and supplement that with
| some kind of "AiCloud" subscription to do the battery-
| draining, heat-generating stuff on their cloud. They're
| trading off their branding as a privacy focused company
| under the (probably correct) belief that people will be
| more willing to upload their data to iCloud's AI than
| Microsoft's.
|
| Fwiw, I think they're probably correct. It has always
| struck me as odd that people want to run AI on their
| phone. My impression of AI is that it creates very
| generalized solutions to problems that would be difficult
| to code, at the cost of being very compute inefficient.
|
| I don't really want code like that running on my phone;
| it's a poor platform for it. Thermal dissipation and form
| factor limit the available processing power, and
| batteries limit how long you can use the processing power
| you have. I don't really want to waste either trying to
| do subject identification locally. I'm going to upload
| the photos to iCloud anyways; let me pay an extra
| $1/month or whatever to have that identification happen
| in the cloud, on a server built for it that has data
| center thermal dissipation and is plugged into the wall.
| talldayo wrote:
| The pinch (as far as I can see it) is that you're right,
| and Apple can't sell a freestanding service to save their
| life. If we do get an AppleGPT pay-as-you-go service,
| it's certain to be extraordinarily censored and locked-
| down as the exclusive first-party option on iPhone. It
| will feature "vertical integration" that no other AI can
| have, alongside censorship so prudish that it would make
| Maurey Povich gasp.
|
| So... I think users will be stuck. They'll want to run
| uncensored models on their phone, but Apple will want to
| keep them in the walled garden at any cost. It feels like
| the whole "Fortnite" situation all over again, where
| _users_ can agree they want something but Apple can 't
| decide.
| unethical_ban wrote:
| > It has always struck me as odd that people want to run
| AI on their phone. My impression of AI is that it creates
| very generalized solutions to problems that would be
| difficult to code, at the cost of being very compute
| inefficient.
|
| I don't equate AI with coding. I want AI locally for
| photo sorting and album management, for general questions
| answering/list making that I use GPT for, and any number
| of other things.
|
| I try not to upload personal data to sites that aren't
| E2E encrypted, so iCloud/Google photos is a no-go.
| WhitneyLand wrote:
| Anyone checked out the NPU on the new iPad? It's supposed
| to be a bazillion times better according to Apple but I
| haven't had a chance to dig into the reality.
|
| I guess we can assume this is going to be what's used in
| what's being called Apple's first AI phone, iPhone 16.
| fassssst wrote:
| It has 38 TOPS of INT8 performance. Not very remarkable
| compared to consumer Nvidia GPU's which are like one or
| two orders of magnitude faster.
| talldayo wrote:
| For reference, Nvidia's Jetson Orin NX robotics platform
| is 35-50 TOPS on average. Apple _is_ catching up, but
| Nvidia still has by-far the more flexible (and better
| scaled) platform.
| numpad0 wrote:
| That 38 TOPS figure was a bit weird, it's literally below
| baseline(45 TOPS) for "AI PC" branding
| Qualcomm/Intel/Microsoft is launching this June, and also
| 10x less than typical GPUs. I think it was just a clever
| marketing exploiting the fact that "AI PC" branding
| hasn't launched yet.
| eru wrote:
| And Google has their TPUs.
| yosefk wrote:
| For inference, Nvidia has DLA since 2017-ish if I remember
| correctly, which is completely separate from the GPU.
| WanderPanda wrote:
| Wait but nvidia tensor-cores are exactly the hardware that
| likes 16x16 tiles, no? I thought that was the whole point? The
| hardware is already here and I'm sceptical if there is another
| order of magnitude in performance to be gained from even more
| specialized designs.
| wtallis wrote:
| What's the ratio of tensor cores to regular SIMD compute
| ("CUDA cores") on NVIDIA's current chips?
| creato wrote:
| This is in the article: if you aren't using the tensor
| cores, you aren't utilizing ~94% of the FLOPs available.
| wtallis wrote:
| Knowing what portion of the FLOPs are in the tensor cores
| isn't quite the right thing to be looking at. The key
| question is how much more tensor core performance can be
| gained by reducing or eliminating the dies area devoted
| to non-tensor compute and higher precision arithmetic.
| Most of NVIDIA's GPUs are still designed primarily for
| graphics: they have some fixed function units that can be
| deleted in an AI-only chip, and a lot of die space
| devoted to non-tensor compute because the tensor cores
| don't naturally lend themselves to graphics work (though
| NVIDIA has spent years coming up with ways to not leave
| the tensor cores dark during graphics work, most notably
| DLSS).
|
| So the claims that NVIDIA's GPUs are already thoroughly
| optimized for AI and that there's no low-hanging fruit
| for further specialization don't seem too plausible,
| unless you're only talking about the part of the
| datacenter lineup that has already had nearly all fixed-
| function graphics hardware excised. And even for Hopper
| and Blackwell, there's some fat to be trimmed if you can
| narrow your requirements.
| incrudible wrote:
| There is not a lot of fixed function left in the modern
| graphics pipeline, economics of scale dictate that there
| is no net benefit in trimming it.
| wtallis wrote:
| And yet, even NVIDIA _does_ trim it from chips like the
| H100, which has no display outputs, RT cores, or video
| encoders (though they keep the decoders), and only has
| ROPs for two of the 72 TPCs.
| smallmancontrov wrote:
| Mind the Dark Silicon Fraction.
|
| Some fraction of your transistors MUST go unused on
| average or you melt the silicon. This was already a thing
| in the 20nm days and I'm sure it has only gotten worse.
| 100% TDP utilization might correspond to 60% device
| utilization.
| wtallis wrote:
| That's true for CPUs. Does it really apply to GPUs and
| other accelerators for embarrassingly parallel problems
| where going slower but wider is always a valid option?
| Sharlin wrote:
| On the H100 specifically. The figure is likely different
| on consumer cards.
| choppaface wrote:
| "NVidia's LIES..
|
| On kernels such as flash attention, TMA and the L2 cache are
| both fast enough so as to hide these problems reasonably well.
| But to make the full use of the hardware, memory request must
| be coalesced and bank conflicts avoided "
|
| The depth of the competition is also starting to become
| apparent. There's no way the documentation error was totally an
| accident. Diagrams are the easiest to steal / copy and there
| must have been some utility for nvidia to have left this in
| place. Remember when Naveen Rao's Nervana was writing NVidia
| Maxwell drivers that out-performed NVidia's own? Not every
| documentation mishap in a high-growth product is a competition
| counter-measure, but given that the researchers spent so long
| reverse-engineering wgmma and given the China-US political
| situation of the H100 in particular, it seems NVidia is up to
| its old tricks to protect its moat.
|
| So don't over-study the H100 peculiarities, as "what hardware
| does AI want?" really encompasses the commercial situation as
| well.
| wiz21c wrote:
| I don't understand. If they document their stuff with errors,
| it will hurt users, be they chinese or US ? Or is it expected
| that US users will call Nvidia's to ask for the correct
| documentation ?
| acka wrote:
| It could be a case of classic market segmentation. The
| lower tier customers get the incomplete or error-ridden
| documentation, and the upper tier trusted
| customers^W'partners' get access to the juicy stuff:
| complete and mostly correct documentation, including stuff
| intentionally left out of the lower tier package like
| application notes containing secret hardware handshakes to
| unlock hidden features, all under strict NDA of course.
| choppaface wrote:
| The vast majority of users use NVidia's own kernels versus
| optimize their own. And those who do write custom kernels
| are typically not trying to compete with NVidia's own GMM.
| jiveturkey wrote:
| hasn't google been building such devices for a decade now?
| yayr wrote:
| yep, and the main engineers have founded groq.com with an
| architecture that among others precisely solved the memory
| management issues
| bcatanzaro wrote:
| GPUs have evolved to be AI machines with as little baggage as
| possible. People have been arguing GPUs were old technology and
| therefore unsuited for AI since at least 2014 (when Nervana was
| founded), but what they perhaps didn't expect is that the GPU
| would evolve so quickly to be an AI machine.
| celrod wrote:
| Bill Dally from Nvidia argues that there is "no gain in
| building a specialized accelerator", in part because current
| overhead on top of the arithmetic is in the ballpark of 20%
| (16% of IMMA and 22% for HMMA units)
| https://www.youtube.com/watch?v=gofI47kfD28
| AnthonyMouse wrote:
| There does seem to be a somewhat obvious advantage: If all
| it has to do is matrix multiplication and not every other
| thing a general purpose GPU has to be good at then it costs
| less to _design_. So now someone other than Nvidia or AMD
| can do it, and then very easily distinguish themselves by
| just sticking a ton of VRAM on it. Which is currently
| reserved for GPUs that are extraordinarily expensive, even
| though the extra VRAM doesn 't cost a fraction of the price
| difference between those and an ordinary consumer GPU.
| bjornsing wrote:
| Exactly. And that means you not only save the 22% but
| also a large chunk of the Nvidia margin.
| Animats wrote:
| And, sure enough, there's a new AI chip from
| Intellifusion in China that's supposed to be 90% cheaper.
| 48 TOPS in int8 training performance for US$140.[1]
|
| [1] https://www.tomshardware.com/tech-
| industry/artificial-intell...
| pfdietz wrote:
| I wonder what the cost of power to run these chips is. If
| the power cost ends up being large compared to the
| hardware cost, it could make sense to buy more chips and
| run them when power is cheap. They could become a large
| source of dispatchable demand.
| papruapap wrote:
| I really hope we see AI-PU (or with some other name,
| INT16PU, why not) for the consumer market sometime soon.
| Or been able to expand GPU memory using a pcie socket
| (not sure if technically possible).
| hhsectech wrote:
| Isn't this what resizeable BAR and direct storage are
| for?
| PeterisP wrote:
| The while point of GPU memory is that it's faster to
| access than going to memory (like your main RAM) through
| the PCIe bottleneck.
| throwaway4aday wrote:
| My uninformed question about this is why can't we make
| the VRAM on GPUs expandable? I know that you need to
| avoid having the data traverse some kind of bus that
| trades overhead for wide compatibility like PCIe but if
| you only want to use it for more RAM then can't you just
| add more sockets whose traces go directly to where
| they're needed? Even if it's only compatible with a
| specific type of chip it would seem worthwhile for the
| customer to buy a base GPU and add on however much VRAM
| they need. I've heard of people replacing existing RAM
| chips on their GPUs[0] so why can't this be built in as a
| socket like motherboards use for RAM and CPUs?
|
| [0] https://www.tomshardware.com/news/16gb-rtx-3070-mod
| carbotaniuman wrote:
| Replacing RAM chips on GPUs involves resoldering and
| similar things - those (for the most part) maintain the
| signal integrity and performance characteristics of the
| original RAM. Adding sockets complicates the signal path
| (iirc), so it's harder for the traces to go where they're
| needed, and realistically given a trade-off between
| speed/bandwidth and expandability I think the market goes
| with the former.
| giobox wrote:
| Expandable VRAM on GPUs has been tried before - the
| industry just hates it. It's like Apple devices - want
| more internal storage? Buy a new computer so we can have
| the fat margins.
|
| The original REV A iMac in late 90s had slotted memory
| for its ATI card, as one example - shipped with 2mb,
| could be upgraded to 6mb after the fact with a 4MB SGRAM
| DIMM. There are also a handful of more recent examples
| floating around.
|
| While I'm sure there are also packaging advantages to be
| had by directly soldering memory chips instead of
| slotting them etc, I strongly suspect the desire to keep
| buyers upgrading the whole card ($$$) every few years
| trumps this massively if you are a GPU vendor.
|
| Put another way, what's in it for the GPU vendor to offer
| memory slots? Possibly reduced revenue, if it became
| industry norm.
| Majromax wrote:
| Expansion has to answer one fundamental question: if
| you're likely to need more X tomorrow, why aren't you
| just buying it today?
|
| The answer to this question almost has to be "because it
| will be cheaper to buy it tomorrow." However, GPUs bundle
| together RAM and compute. If RAM is likely to be cheaper
| tomorrow, isn't compute also probably going to be
| cheaper?
|
| If both RAM _and_ compute are likely cheaper tomorrow,
| then the calculus still probably points towards a
| wholesale replacement. Why not run /train models twice as
| quickly alongside the RAM upgrades?
|
| > I strongly suspect the desire to keep buyers upgrading
| the whole card ($$$) every few years trumps this
| massively if you are a GPU vendor.
|
| Remember as well that expandable RAM doesn't unlock
| higher-bandwidth interconnects. If you could take the
| card from five years ago and load it up with 80 GB of
| VRAM, you'd still not see the memory bandwidth of a
| newly-bought H100.
|
| If instead you just need the VRAM and don't care much
| about bandwidth/latency, then it seems like you'd be
| better off using unified memory and having system RAM be
| the ultimate expansion.
| hellofellows wrote:
| hmm seems you're replying as a customer, but not as a GPU
| vendor...
|
| the thing is, there's not enough competition in the AI-
| GPU space.
|
| Current only option for no-wasting-time on running some
| random research project from github? buy some card from
| nvidia. cuda can run almost anything on github.
|
| AMD gpu cards? that really depends...
|
| and gamers often don't need more than 12?gb of GPU ram
| for running games on 4k.. so most high-vram customers are
| on the AI field.
|
| > If you could take the card from five years ago and load
| it up with 80 GB of VRAM, you'd still not see the memory
| bandwidth of a newly-bought H100.
|
| this is exactly what nvidia will fight against tooth-and-
| nail -- if this is possible, its profit margin could be
| slashed to 1/2 or even 1/8
| AnthonyMouse wrote:
| > The answer to this question almost has to be "because
| it will be cheaper to buy it tomorrow."
|
| No, it doesn't. It could just as easily be "because I
| will have more money tomorrow." If faster compute is $300
| and more VRAM is $200 and I have $300 today and will have
| another $200 two years from now, I might very well like
| to buy the $300 compute unit and enjoy the faster compute
| for two years before I buy the extra VRAM, instead of
| waiting until I have $500 to buy both together.
|
| But for something which is already a modular component
| like a GPU it's mostly irrelevant. If you have $300 now
| then you buy the $300 GPU, then in two years when you
| have another $200 you sell the one you have for $200 and
| buy the one that costs $400, which is the same one that
| cost $500 two years ago.
|
| This is a much different situation than fully integrated
| systems because the latter have components that lose
| value at different _rates_ , or that make sense to
| upgrade separately. You buy a $1000 tablet and then the
| battery goes flat and it doesn't have enough RAM, so you
| want to replace the battery and upgrade the RAM, but you
| can't. The battery is proprietary and discontinued and
| the RAM is soldered. So now even though that machine has
| a satisfactory CPU, storage, chassis, screen and power
| supply, which is still $700 worth of components, the
| machine is only worth $150 because nothing is modular and
| nobody wants it because it doesn't have enough RAM and
| the battery dies after 10 minutes.
| PeterisP wrote:
| Technically we definitely can, but are there sufficiently
| many people willing to pay a sufficiently high premium
| for that feature? How much more would you be willing to
| pay for an otherwise identical card that has the option
| to expand RAM, and do you expect that a significant
| portion of buyers would want to pay a non-trivial up-
| front cost for that possibility?
| throwaway48476 wrote:
| Its a minor technical challenge with no financial benefit
| for the GPU makers.
| rdsubhas wrote:
| Isn't that what NPUs are technically?
|
| https://en.m.wikipedia.org/wiki/AI_accelerator
| WithinReason wrote:
| Designing it is easy and always has been. Programming it
| is the bottleneck. Otherwise Nvidia wouldn't be in the
| lead.
| markhahn wrote:
| but programming it is "import pytorch" - nothing nvidia-
| specific there.
|
| the mass press is very impressed by Cuda, but at least if
| we're talking AI (and this article is, exclusively), it's
| not the right interface.
|
| and in fact, Nv's lead, if it exists, is because they
| pushed tensor hardware earlier.
| WithinReason wrote:
| I'm talking about adding Pytorch support for your special
| hardware.
|
| Nv's lead is due to them having Pytorch support.
| achierius wrote:
| Someone does, in fact, have to implement everything
| underneath that `import` call, and that work is _very_
| hard to do for things that don't closely match Nvidia's
| SIMT architecture. There's a reason people don't like
| using dataflow architectures, even though from a pure
| hardware PoV they're very powerful -- you can't map
| CUDA's, or Pytorch's, or Tensorflow's model of the world
| onto them.
| KaoruAoiShiho wrote:
| Eh if you're running in production you'll want something
| lower level and faster than pytorch.
| cma wrote:
| There are other operations for things like normalization
| in training, which is why most successful custom stuff
| has focused on inference I think. As architectures
| changed and needed various different things some custom
| built training hardware got obsoleted, Keller talked
| about that affecting Tesla's Dojo and making it less
| viable (they bought a huge nvidia cluster after it was
| up). I don't know if TPU ran into this, or they made
| enough iterations fast enough to keep adding what they
| needed as they needed it.
| muyuu wrote:
| it's going to be awkward in consumer hardware either way
|
| if you segregate AI units from the GPU, the thing is both AI
| and GPUs will continue to need massive amounts of matrix
| multiplication and as little memory latency as possible
|
| the move to have more of it wrapped in the GPU makes sense but
| at least in the short and medium term, most devices won't be
| able to justify the gargantuan silicon wafer space/die growth
| that this would entail - also currently Nvidia's tech is ahead
| and they don't make state of the art x86 or ARM CPUs
|
| for the time being I think the current paradigm makes the most
| sense, with small compute devices making inroads in the
| consumer markets as non-generalist computers - note that more
| AI-oriented pseudo-GPUs already exist and are successful since
| the earlier Nvidia Tesla lineup and then the so-called "Nvidia
| Data Center GPUs"
| rfoo wrote:
| > as little memory latency as possible
|
| Should be "as much memory bandwidth as possible". GPUs are
| designed to be (relatively) more insensitive to memory
| latency than CPU.
| muyuu wrote:
| yep that's true, although AI compute modules do get
| significant benefit from low latency cache as well
| FuriouslyAdrift wrote:
| AMD is already in their second generation of of Versal line.
|
| https://www.amd.com/en/products/accelerators/alveo/v80.html
|
| XDNA Architecture
|
| https://www.amd.com/en/technologies/xdna.html
| UncleOxidant wrote:
| > Then there's a trend towards much shorter numbers. 16 bit
| floating point? 8 bit? 2 bit? 1 bit?
|
| There was that recent paper titled "The Era of 1-bit LLMs" [0]
| which was actually suggeting a 1.58 bit LLM (2 bits in
| practice).
|
| > Someone reading this is probably writing it in VHDL right
| now, or will be soon.
|
| Yeah, I think I'm in the "will be soon" camp - FPGA board has
| been ordered. Especially with the 2-bit data types outlined in
| that paper [0] and more details in [1]. There's really a need
| for custom hardware to do that 2-bit math efficiently.
| Customizing one of the simpler open source RISC-V integer
| implementations seems like something to try here adding in the
| tiled matrix registers and custom instructions for dealing with
| them (with the 2 bit data types).
|
| [0] https://arxiv.org/abs/2402.17764 [1]
| https://github.com/microsoft/unilm/blob/master/bitnet/The-Er...
| uyzstvqs wrote:
| What is needed are true NPUs as dedicated co-processors,
| especially for prosumer desktop systems (devs, other
| professionals, gamers). GPUs work in the enterprise, but they're
| a hassle to use for AI on the personal computing side of the
| market. Especially VRAM limitations, but also the lack of a
| standard open API other than Vulkan (again, using video stuff for
| AI).
| dartos wrote:
| Fwiw, Vulkan isn't specifically a graphics api and has had
| compute specific features for a while now. (Potentially since
| its inception)
| the__alchemist wrote:
| Compared to CUDA, Vulkan is... not fun to code compute in!
| The serialization bridge and duplicating data structures and
| functions between CPU and GPU is tedious.
| dartos wrote:
| I hear both CUDA and Vulkan are not fun to code in.
|
| But yeah Vulkan is famously verbose. It takes about 1000
| LoC to draw a triangle
| KeplerBoy wrote:
| CUDA is very much fun to code in!
|
| Nvidia provides devs with great tools (Nsight Systems and
| Nsight Compute), so you know where you have to optimize.
| jokoon wrote:
| this is why people should better study neuroscience, psychology
| if they want to advance research in AI.
|
| also things related to graph topology in neural networks maybe,
| but probably not related to artificial NN.
|
| I was given this video, which I found was pretty interesting:
| https://www.youtube.com/watch?v=nkdZRBFtqSs (How Developers might
| stop worrying about AI taking software jobs and Learn to Profit
| from LLMs - YouTube)
| dartos wrote:
| I don't think psychology will have any bearing on AI.
|
| I doubt neuroscience will either, but I'm not as sure on that.
|
| The more impressive AI systems we have moved further away from
| the neuron analogy that came from perceptions.
|
| The whole "intelligence" and "neural" part of AI is a red
| herring imo. Really poor ambiguous word choice for a specific,
| technical idea.
| sva_ wrote:
| > I doubt neuroscience will either, but I'm not as sure on
| that
|
| The stuff on spiking networks and neuromorphic computing is
| definitely interesting and inspired by neuroscience, but it
| currently seems mostly like vaporware
| dartos wrote:
| Yep, I've heard about spiking networks, but haven't read
| into them much yet.
| fastball wrote:
| *perceptrons
| dartos wrote:
| Darn autocorrect. Thank you.
| actionfromafar wrote:
| Haha, I didn't get it when I read "perceptions". Thought
| ... of what? :-D
| nradov wrote:
| The question is whether current AI technologies represent any
| progress towards a true human equivalent artificial _general_
| intelligence. Most likely not, but no one knows for sure. If
| the answer turns out to be no then real progress will likely
| require theoretical insights from psychology, neuroscience,
| and other fields.
| dartos wrote:
| Fwiw, I don't think we're any closer to general
| intelligence then we were 5 years ago.
|
| Other than that, I agree, especially since you added "and
| other fields." Psychology might eventually give us a useful
| definition of "intelligence," so that'd be something.
|
| Obviously all research can influence other areas of
| research.
| Symmetry wrote:
| It's easy to overstate, but shouldn't be understated either
| with, as an example, solving problems with learning in AI
| providing insights into how dopamine works in brains.
|
| https://www.technologyreview.com/2020/01/15/130868/deepmind-.
| ..
|
| There are obvious, huge differences between what goes on in a
| computer and what happens in a a brain. Neurons can't do back
| propagation is a glaring one. But they do do something that
| ends up being analogous to back propagation and you can't
| tell _a priori_ whether some property of AI or neuroscience
| might be applicable to the other or not.
|
| The best way to learn about AI isn't to learn neuroscience.
| it's to learn AI. But if I were an AI lab I'd still hire
| someone to read neuroscience papers and check to see whether
| they might have something useful in them.
| renewiltord wrote:
| There are loads of psychologists and neuroscientists today. Has
| any of them in the last few years produced anything advancing
| AI? The proof of the pudding is in the eating so if they have
| at a higher rate than just straight CS/Mathematics and related
| then there's probably some truth to it.
| chmod775 wrote:
| I can't seem to figure out the connection between this comment
| and the article at hand, except that they're both about AI.
| WanderPanda wrote:
| Is this "just" CUTLASS in user friendly?
| phinnaeus wrote:
| FYI the caption of the "spirit animals" image says "canadian
| goose" instead of "Canada Goose".
| fastball wrote:
| Canadian goose seems better in [current year], to avoid
| confusion with the clothing brand.
| wglb wrote:
| An error too often made.
| downrightmike wrote:
| Don't worry, the Geese are en route to location, resolution
| incoming. Stand by.
| hoherd wrote:
| In my experience, Canadian geese are never en route to
| anywhere. They stay next to the pond year round and crap
| everywhere you might want to step. EG:
| https://sanjosespotlight.com/going-to-santa-clara-central-
| pa...
| adzm wrote:
| Likely a regional thing; they are consistently called Canadian
| Geese where I grew up and where I currently live.
| bombcar wrote:
| It's a Canada Goose from Canada. A Canadian Canada Goose, or
| Canadian Goose.
| gosub100 wrote:
| https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffal.
| ..
| xarope wrote:
| I am missing the reference to the canadian goose and the
| retriever puppy as spirit animals. Is that to say the H100 is
| an ornery thing, but the RTX4090 is friendly?
| Mtinie wrote:
| I'd assumed (like you) it meant that the H100 is ornery AND
| pickier about what it consumes, while the RTX4090 is playful
| and will eat damn near anything within reach of its mouth
| (with its sharp, velociraptor-like puppy teeth), whether you
| want it to or not.
|
| But that may be straining the meme somewhat. :)
| adrian_b wrote:
| I consider bad the habit of English to use nouns also as
| adjectives, because it causes many ambiguities, some of which
| can be very annoying, even if they are a rich source of jokes
| and word plays.
|
| In most languages the use of a noun as an adjective is marked,
| by a particle or by an affix or at least by a different stress
| pattern (like moving the stress to the last syllable), which
| removes the ambiguities.
|
| So for most non-native speakers "Canadian goose" makes much
| more sense than "Canada goose" (which may feel like "Canada and
| a goose" or "a goose that is also Canada" and not like "a goose
| from Canada").
| actionfromafar wrote:
| Now you made me think of ways to English Adjective my text
| for word play... make it stop.
| kitd wrote:
| "Canada" isn't being used as an adjective though. The name of
| the species is "Canada Goose", like "Long Island Shellfish"
| or "Dublin Bay Prawns".
| p0w3n3d wrote:
| always the former noun is describing the latter. Butter fly
| is not a flying butter (as my children's teacher told them to
| make a joke about butterfly) but a fly made of butter
| instead.
| bn-l wrote:
| Who cares
| silisili wrote:
| I've only heard people in my entire lifetime call them Canadian
| Geese.
|
| The only time I've ever even seen or heard of Canada
| Goose/Geese are people on the internet telling others they are
| wrong.
|
| I think it's time to just accept it as correct.
| FearNotDaniel wrote:
| Absolutely, it's like living in London and eventually having
| to accept that tourists will always say "Big Ben" when they
| mean the clock tower of the Palace of Westminster, which
| encloses the bell whose actual name is Big Ben. The name of
| the tower is, de facto, Big Ben, and life gets so much easier
| when you drop the urge to tell people they are wrong all the
| time...
|
| Edit: TIL the tower was properly renamed "Elizabeth Tower" in
| 2012 [0] but I seriously doubt a single person in the last 12
| years has ever used that name...
|
| [0] https://en.wikipedia.org/wiki/Big_Ben
| globular-toast wrote:
| I wouldn't put that in the same category. If you say Canada
| Goose everyone still knows what you mean. If you say
| Elizabeth Tower, they probably don't.
| hatthew wrote:
| In real life, I have only ever heard Canada Goose.
| apsec112 wrote:
| Interesting! Would this support fp8? Does anyone know how it
| would compare to Triton?
| renonce wrote:
| > NVIDIA's lies. This is an extraordinarily misleading
| representation of the actual 128b swizzled wgmma layout. This
| diagram cost us three weeks of life that we will not get back,
| hence the public shaming.
|
| Wondering if anyone would be surprised that a huge amount of
| progress in AI is on the engineering side (optimizing matmuls),
| and that a huge portion of the engineering is about reverse
| engineering NVIDIA chips
| DeathArrow wrote:
| Architecture doesn't make a difference. Big enough models
| trained with big enough data tend to give the same results
| regardless of architecture. So yes, most advances in AI are
| mostly due to the fact we can now multiply matrices very fast.
| elcomet wrote:
| That's not completely true. The architecture must behave well
| for scaling, which is not trivial. Basic multi-layer
| perceptrons do not scale well for example, the gradient will
| vanish or explode deeper in the network.
| 3abiton wrote:
| And data quality. Ensuring the sourcing and quality is very
| important to get a good model.
| fleischhauf wrote:
| this, if you have money to spend in improving your model,
| more training data is the first thing I'd take a look at
| Tarrosion wrote:
| How do modern foundation models avoid multi-layer
| perceptron scaling issues? Don't they have big feed-forward
| components in addition to the transformers?
| heavenlyblue wrote:
| They don't do global optimisation of all layers at the
| same time, instead training all layers independently of
| each other.
| rfoo wrote:
| idk, they do give the same results, but given the memory
| bottleneck it feels like we are at a point when architecture
| innovations matter again, for example check out DeepSeek V2
| tech report, they modded model arch specifically for lower
| cost inference (by making k/v cache smaller)
| __loam wrote:
| Different architecture can result in hundreds of millions of
| dollars more in training costs no?
| latchkey wrote:
| Really impressed by the writing style of this post and very much
| looking forward to this on AMD MI300x. Let me know if you want
| some time on mine.
| jsemrau wrote:
| Really? It gives me PTSD from the Wallstreetbets days.
| forrestthewoods wrote:
| I also enjoyed the article's style. I utterly despise
| "academic paper speak". It is, imho, not the most effective
| style to communicate complex ideas. I find it so much easier
| to learn from a more casual "blog post" or in-person
| presentation over stiff, rigid academic speak.
| kaycey2022 wrote:
| I find both to be useful in different stages. The casual
| style is very helpful when starting out. But once I have
| put in a few weeks or months of study in, then the rigor
| and preciseness of academic style is good as well.
|
| I agree with you in the sense that something has "died" in
| writings the follow academic paper speak these days. Just
| yesterday I saw an ancient article surfaced by Scientific
| American and Peter Norvig on System Analysis by Strachey.
| It uses quite a bit of formal language but is super
| approachable at the same time. That kind of skill is rarely
| seen these days.
| david927 wrote:
| > _the Wallstreetbets days._
|
| https://twitter.com/TheRoaringKitty/status/17900418133798504.
| ..
| globular-toast wrote:
| Good writing is clear and unambiguous. With speech there is an
| opportunity to interrupt and ask for clarification. Writing has
| one chance to get the message across. A reader shouldn't have
| to consult knowyourmeme.com to figure out what the heck the
| authors are trying to say. I don't even know what the title
| means here. That's how far they've missed the mark.
| _obviously wrote:
| Wow that really sucks for you. I just read it in 5 minutes
| and feel much more informed about the subject pf nvidia
| memory twizzlization. It's kind of funny to me that
| presumably young college guys are writing in a style that's
| very readable for my old ass.
| unethical_ban wrote:
| >that really sucks for you
|
| How can I put this in your vernacular...
|
| "Most polite genZ meme enjoyer"
| aetimmes wrote:
| Even if you're not familiar with the "go brrr" meme (which is
| the only use of meme-idiom in the article and is used exactly
| twice), its meaning is easily inferred via context clues from
| the opening paragraphs.
|
| Good writing is also entertaining and engaging.
| globular-toast wrote:
| Keyword being _also_.
| throwaway1492 wrote:
| As someone who witnessed A-10 CAS fuck some stuff up in a
| combat zone ie the real "brrrrt" I've been mystified by the
| meme and current useage. No one knows where it comes from
| nor the slaughter it represents.
| aoeusnth1 wrote:
| You're mistaken, the "go brrr" format comes from the
| money printer meme in 2020.
| onemiketwelve wrote:
| as intense as a a10 might be, it's short lived and only
| affects a few dudes on the receiving end. When the
| federal reserve goes brrr, it has far reaching impact
| that affects every single person in the global economy.
|
| https://brrr.money/
| tracker1 wrote:
| Have you done much AI work against AMD products? I'm not going
| to plunk down $2500+ for an RTX 4090, but have been considering
| an RX 7900XTX for playing around with, or at least getting
| started. Just curious how well it will or won't work in
| practice, or if saving a bit more and getting a 7900 XT over
| the XTX might be a better option, and how much less vram might
| impact usefulness in practice.
| latchkey wrote:
| My only work with consumer AMD GPUs was mining ethereum, I
| had 150,000 of them.
|
| If you want to use enterprise AMD gpus, I'm renting them.
| That said, I haven't even had a chance to run/play with them
| myself yet, they have been rented since I got them last
| month.
|
| Yes, we are getting more.
| brcmthrowaway wrote:
| NVIDIA needs to be broken up
| huhlig wrote:
| Into what? Where would you draw such lines?
| robocat wrote:
| Into tiles ;-p
|
| GPU compute is already broken up - there is a supply chain of
| other cooperating players that work together to deliver GPU
| compute to end users:
|
| TSMC, SK hynix, Synopsys, cloud providers (Azure/Amazon
| etcetera), model providers (OpenAI/Anthropic etcetera).
|
| Why single out NVidia in the chain? Plus the different
| critical parts of the chain are in different jurisdictions.
| Split up NVidia and somebody else will take over that spot in
| the ecosystem. This interview with Synopsys is rather
| enlightening: https://www.acquired.fm/episodes/the-software-
| behind-silicon...
|
| How does the profit currently get split between the different
| links? Profit is the forcing variable for market cap and
| profit is the indicator of advantage. Break up NVidia and
| where does the profit move?
| latchkey wrote:
| The better alternative is to root for AMD and others to develop
| their own products so that regardless of breaking NV up or not,
| there are alternative solutions for people to use. They all
| leapfrog each other with new releases now any way. Why put all
| your eggs into one basket.
| simondotau wrote:
| George Hotz went down the AMD rabbit hole for a while and
| concluded that the driver software -- more precisely the
| firmware which runs on the cards themselves -- is so badly
| written that there's no hope of them becoming serious
| contenders in AI without some major changes in AMD's
| priorities.
| latchkey wrote:
| I'm not defending their software. It does honestly have a
| ton of issues.
|
| George Hotz tried to get a consumer card to work. He also
| refused my public invitations to have free time on my
| enterprise cards, calling me an AMD shill.
|
| AMD listened and responded to him and gave him even the
| difficult things that he was demanding. He has the tools to
| make it work now and if he needs more, AMD already seems
| willing to give it. That is progress.
|
| To simply throw out George as the be-all and end-all of a
| $245B company... frankly absurd.
| shmerl wrote:
| Indeed, AMD willing to open firmware is something Nvidia
| never has done.
| creato wrote:
| The fact that consumer and "pro"(?) GPUs don't use
| (mostly) the same software is not confidence inspiring.
| It means that AMD's already apparently limited capacity
| for software development is stretched thinner than it
| otherwise would be.
|
| Also, if the consumer GPUs are hopelessly broken but the
| enterprise GPUs are fine, that greatly limits the number
| of people that can contribute to making the AMD AI
| software ecosystem better. How much of the utility of the
| NVIDIA software ecosystem comes from gaming GPU owners
| tinkering in their free time? Or grad students doing
| small scale research?
|
| I think these kinds of things are a big part of why
| NVIDIA's software is so much better than AMD right now.
| wruza wrote:
| _that greatly limits the number of people that can
| contribute to making the AMD AI software ecosystem
| better_
|
| I'd say it simply dials it down to zero. No one's gonna
| buy an enterprise AMD card for playing with AI, so no
| one's gonna contribute to that either. As a local AI
| enthusiast, this "but he used consumer card" complaint
| makes no sense to me.
| latchkey wrote:
| > _No one's gonna buy an enterprise AMD card for playing
| with AI_
|
| My hypothesis is that the buying mentality stems from the
| inability to rent. Hence, me opening up a rental
| business.
|
| Today, you can buy 7900's and they work with ROCm. As
| George pointed out, there are some low level issues with
| them, that AMD is working with him to resolve. That
| doesn't mean they absolutely don't work.
|
| https://rocm.docs.amd.com/projects/install-on-
| linux/en/lates...
| latchkey wrote:
| Agreed that AMD needs to work on the developer flywheel.
| Again, not defending their software.
|
| One way to improve the flywheel and make the ecosystem
| better, is to make their hardware available for rent.
| Something that previously was not available outside of
| hyperscalers and HPC.
| simondotau wrote:
| > To simply throw out George as the be-all and end-all of
| a $245B company... frankly absurd.
|
| I didn't do that, and I don't appreciate this misreading
| of my post. Please don't drag me into whatever drama
| is/was going on between you two.
|
| The only point I was making was that George's experience
| with AMD products reflected poorly on AMD software
| engineering circa 2023. Whether George is ultimately
| successful in convincing AMD to publicly release what he
| needs is beside the point. Whether he is ultimately
| successful convincing their GPUs to perform his
| expectations is beside the point.
| latchkey wrote:
| > _The only point I was making was that George 's
| experience with AMD products reflected poorly on AMD
| software engineering circa 2023._
|
| Except that isn't the point you said...
|
| "there's no hope of them becoming serious contenders in
| AI without some major changes in AMD's priorities"
|
| My point in showing you (not dragging you into) the
| drama, is to tell you that George is not a credible
| witness for your beliefs.
| callalex wrote:
| Egohotz is brilliant in many ways, but taking him at his
| word when it comes to working with others has been a
| mistake since at least around 2010. This is well
| documented.
| simondotau wrote:
| Who said anything about taking him at his word?
| Everything he has done regarding AMD GPUs has been in
| public. I'm sure there are plenty of valid criticisms one
| can make of his skills/strategy/attitude/approach, but
| accusing him of being _generally_ untrustworthy in this
| endeavour is utterly nonsensical.
| imtringued wrote:
| I can reliably crash my system using kobold.cpp with
| Vulkan running an AMD GPU. All it takes is a slightly too
| high batch size.
| latchkey wrote:
| What is slightly too high of a batch size? If max size is
| 100 and you're at 99, of course 100 will crash it.
| PeterisP wrote:
| We've rooted for that for years, but looking at what AMD does
| and doesn't do, I've lost hope for this. AMD don't seem to
| want to do what it takes; it's not that they're trying and
| failing, but they're simply not even committing to attempt to
| do the same things that nVidia does for their software
| infrastructure.
| latchkey wrote:
| We are still early. I started my bet on Lisa Su around
| August of last year... she publicly doubled down on AI
| around October/November. Dec 6th, MI300x was announced.
|
| Big ships take time to course correct. Look at their hiring
| for AI related positions and release schedule for ROCm. As
| well as multiple companies like mine springing up to
| purchase MI300x and satisfy rental demand.
|
| It is only May. We didn't even receive our AIA's until
| April. Another company just announced their MI300x hardware
| server offering today.
| silveraxe93 wrote:
| NVIDIA is so damn good at its job that it took over the market.
| There's no regulatory or similar barriers to entry. It's
| literally that they do a damn good job and the competition
| can't be as good.
|
| You look at that and want to take a sledgehammer to a golden
| goose? I don't get these people
| michaelt wrote:
| True: nvidia has been consistently investing for over a
| decade.
|
| They saw there was nascent compute use of GPUs, using
| programmable shaders. They produced CUDA, made it accessible
| on every one of their GPUs (not just the high-markup
| professional products) and they put resources into it year
| after year after year.
|
| Not just investing in the product, also the support tools
| (e.g. a full graphical profiler for your kernels) and
| training materials (e.g. providing free cloud GPU credits for
| Udacity courses) and libraries and open source contributions.
|
| This is what it looks like when a company has a vision, plans
| beyond the next quarter, and makes long-term investments.
| diginova wrote:
| What should I do if I want to understand such articles in
| complete? where to start on the roadmap?
| kolinko wrote:
| This is a good course on gpu programming. Around 4.0 lesson
| you'll get the required basics:
| https://youtube.com/playlist?list=PLzn6LN6WhlN06hIOA_ge6Srgd...
|
| Also, write your own cuda kernel to do vector-matrix
| multiplication (if you use pycuda, you can focus on the kernel,
| and write everything else with python). Just tell chatgpt that
| you want to write your own implementation that multiplies a
| 4000-element vector by 4000x12000 matrix, and to guide you
| through the whole process.
|
| For renting gpus, runpods is great - right now they have
| everything from lower tier gpus to h100s. You can start with a
| lesser gpu at the beginning.
| abstractcontrol wrote:
| For a deep dive, maybe take a look at the Spiral matrix
| multiplication playlist:
| https://www.youtube.com/playlist?list=PL04PGV4cTuIWT_NXvvZsn...
|
| I spent 2 months implementing a matmult kernel in Spiral and
| optimizing it.
| justplay wrote:
| sorry for noob question, how gpu programming is helpful ?
| abstractcontrol wrote:
| NNs for example are (mostly) a sequence of matrix
| multiplication operations, and GPUs are very good at those.
| Much better than CPUs. AI is hot at the moment, and Nvidia
| is producing the kind of hardware that can run large models
| efficiently which is why it's a 2 trillion-dollar company
| right now.
|
| However, in the Spiral series, I aim to go beyond just
| making an ML library for running NN models and break new
| ground.
|
| Newer GPUs actually support dynamic memory allocation,
| recursion, and the GPU threads have their own stacks, so
| you could in fact treat them as sequential devices and
| write games and simulators directly on them. I think once I
| finish the NL Holdem game, I'll be able to get over 100x
| fold improvements by running the whole program on the GPU
| versus the old approach of writing the sequential part on a
| CPU and only using the GPU to accelerate a NN model
| powering the computer agents.
|
| I am not sure if this is a good answer, but this is how GPU
| programming would be helpful to me. It all comes down to
| performance.
|
| The problem with programming them is that the program you
| are trying to speed up needs to be specially structured, so
| it utilizes the full capacity of the device.
| selimthegrim wrote:
| Are Winograd's algorithms useful to implement as a learning
| exercise?
| abstractcontrol wrote:
| Never tried those, so I couldn't say. I guess it would.
|
| Even so, creating all the abstractions needed to implement
| even regular matrix multiplication in Spiral in a generic
| fashion took me two months, so I'd consider that good
| enough exercise.
|
| You could do it a lot faster by specializing for specific
| matrix sizes, like in the Cuda examples repo by Nvidia, but
| then you'd miss the opportunity to do the tensor magic that
| I did in the playlist.
| selimthegrim wrote:
| You are the author of the playlist/maker of the videos?
| DeathArrow wrote:
| So do their kernels and library also speed up RTX 4090?
| cl3misch wrote:
| > The unswizzled shared memory layouts suffer from very poor
| coalescing
|
| If I didn't know any better I'd consider it technobabble
| imiric wrote:
| Hasn't this research been done by teams building NPUs today? E.g.
| chips built by Groq use an architecture built specifically for
| AI, which is why they're able to deliver the performance they do.
| On the consumer side, Apple silicon is also quite capable.
|
| I'm not in this field at all, but it seems to me that using
| general purpose processors that communicate over (relatively)
| slow lanes can only get us so far. Rethinking the design at the
| hardware level, and eventually bringing the price down for the
| consumer market seems like a better long-term strategy.
| resource_waste wrote:
| >On the consumer side, Apple silicon is also quite capable.
|
| I am not sure that is true. A glance/or long stay at the reddit
| localllama subreddit basically has a bunch of frustrated CPU
| users trying their absolute best to get anything to work at
| useful speeds.
|
| When you can get an Nvidia GPU for a few hundred dollars or a
| full blown gaming laptop with a 4050 6gb vram for $900, its
| hard to call a CPU based AI capable.
|
| Heck we don't have GPUs at work, and CPU based is just not
| really reasonable without using tiny models and waiting. We
| ended up requesting GPU computers.
|
| I think there is a 'this is technically possible', and there is
| a 'this is really nice'. Nvidia has been really nice to use.
| CPU has been miserable and frustrating.
| imiric wrote:
| I don't think NVIDIA's reign will last long. The recent AI
| resurgence is not even a decade old. We can't expect the
| entire industry to shift overnight, but we are seeing rapid
| improvements in the capability of non-GPU hardware to run AI
| workloads. The architecture change has been instrumental for
| this, and Apple is well positioned to move the field forward,
| even if their current gen hardware is lacking compared to
| traditional GPUs. Their silicon is not even 5 years old, yet
| it's unbeatable for traditional workloads and power
| efficiency, and competitive for AI ones. What do you think it
| will be capable of in 5 years from now? Same for Groq, and
| other NPU manufacturers. Betting on NVIDIA doesn't seem like
| a good long-term strategy, unless they also shift their
| architecture.
| serialx wrote:
| Actually, llama.cpp running on Apple silicon uses GPU(Metal
| Compute Shader) to inference LLM models. Token generation is
| also very memory bandwidth bottlenecked. On high end Apple
| silicon it's about 400MB/s to 800MB/s, comparable to NVIDIA
| RTX 4090, which has memory bandwidth of 1000MB/s. Not to
| mention that Apple silicon has unified memory architecture
| and has high memory models (128GB, up to 192GB), which is
| necessary to run large LLMs like Llama 3 70B, which roughly
| takes 40~75GB of RAM to work reasonably.
| resource_waste wrote:
| These are really nice rants of techno-blabble.
|
| The reality of things: Its not useful.
|
| No one actually uses it.
|
| You can post Apple's official tech specs, but it doesnt
| change that people aren't using it because it doesnt work.
| (or at least isnt as cost effective)
|
| >Not to mention that Apple silicon has unified memory
| architecture and has high memory models (128GB, up to
| 192GB)
|
| This NEEDS to end. This integrated GPU nonsense is not
| equivalent and is disinformation. It is immoral to continue
| to push this narrative.
|
| Also, 128GB isnt high memory. 512GB is high memory.
| brrrrrm wrote:
| I use it all the time?
| imtringued wrote:
| The number of people running llama3 70b on NVidia gaming
| GPUs is absolutely tiny. You're going to need at least
| two of the highest end 24 GB VRAM GPUs and even then you
| are still reliant on 4 bit quantization with almost
| nothing left for your context window.
| roschdal wrote:
| ChatGPT - the largest electricity bill in the world.
| winternewt wrote:
| I believe that reducing the power consumption and increasing the
| speed of AI inference will be best served by switching to analog,
| approximate circuits. We don't need perfect floating-point
| multiplication and addition, we just need something that takes an
| two input voltages and produces an output voltage that is close
| enough to what multiplying the input voltages would yield.
| brap wrote:
| I don't know why you're being downvoted, that's an active area
| of research AFAIK
| gitfan86 wrote:
| Maybe because that is a VERY different problem than the one
| discussed here.
|
| Building a single analog chip with 1 billion neurons would
| cost billions of dollars in a best case scenario. A Nvidia
| card with 1 billion digital neurons is in the hundreds of
| dollars of range.
|
| Those costs could come down eventually, but at that point
| CUDA may be long gone.
| brazzy wrote:
| Sounds pretty impossble to me do that with a sufficient
| combination of range and precision.
| atoav wrote:
| What do you mean with inpossible? You are aware that what
| radio equipment does is often equivalent of analog operations
| like multiplication, addition, etc. just at high frequencies?
|
| Sure accuracy is an issue, but this is not as impossible as
| you may think it would be. The main question will be if the
| benefits by going analog outweigh the issues arising from it.
| Symmetry wrote:
| In general the problem with analog is that every sequential
| operation introduces noise. If you're just doing a couple
| of multiplications to frequency shift a signal up and down
| that's fine. But if you've got hundreds of steps and you're
| also trying to pack huge numbers of parallel steps into a
| very small physical area.
| dnedic wrote:
| How do you inspect what is happening then without having ADCs
| sampling every weight, taking up huge die area?
| jkaptur wrote:
| Maybe a silly question (I don't know anything about this) - how
| do you program / reprogram it?
| Arch485 wrote:
| Realistically, you'd train your model the same way it's done
| today and then custom-order analog ones with the weights
| programmed in. The advantage here would be faster inference
| (assuming analog circuits actually work out), but custom
| manufacturing circuits would only really work at scale.
|
| I don't think reprogrammable analog circuits would really be
| feasible, at least with today's tech. You'd need to modify
| the resistors etc. to make it work.
| rsp1984 wrote:
| TBH that sounds like a nightmare to debug.
| danielheath wrote:
| I know someone working in this direction; they've described the
| big challenges as: * Finding ways to use extant
| chip fab technology to produce something that can do analog
| logic. I've heard CMOS flash presented a plausible option.
| * Designing something that isn't an antenna. * You would
| likely have to finetune your model for each physical chip
| you're running it on (the manufacturing tolerances aren't going
| to give exact results)
|
| The big advantage is that instead of using 16 wires to
| represent a float16, you use the voltage on 1 wire to represent
| that number (which plausibly has far more precision than a
| float32). Additionally, you can e.g. wire two values directly
| together rather than loading numbers into an ALU, so the die
| space & power savings are potentially many, many orders of
| magnitude.
| bobmcnamara wrote:
| > which plausibly has far more precision than a float32
|
| +/- 1e-45 to 3.4e38. granted, roughly half of that is between
| -1 and 1.
|
| When we worked with low power silicon, much of the
| optimization was running with minimal headroom - no point
| railing the bits 0/1 when .4/.6 will do just fine.
|
| > Additionally, you can e.g. wire two values directly
| together rather than loading numbers into an ALU
|
| You may want an adder. Wiring two circuit outputs directly
| together makes them fight, which is usually bad for signals.
| tasty_freeze wrote:
| > which plausibly has far more precision than a float32
|
| If that was true, then a DRAM cell could represent 32 bits
| instead of one bit. But the analog world is noisy and lossy,
| so you couldn't get anywhere near 32 bits of
| precision/accuracy.
|
| Yes, very carefully designed analog circuits can get over 20
| bits of precision, say A/D converters, but they are huge
| (relative to digital circuits), consume a lot of power, have
| low bandwidth as compared to GHz digital circuits, and
| require lots of shielding and power supply filtering.
|
| This is spit-balling, but the types of circuits you can
| create for a neural network type chip is certainly under 8
| bits, maybe 6 bits. But it gets worse. Unlike digital
| circuits where signal can be copied losslessly, a chain of
| analog circuits compounds the noise and accuracy losses stage
| by stage. To make it work you'd need frequent requantization
| to prevent getting nothing but mud out.
| cptroot wrote:
| Here's an example of Veritasium talking about this from 2022:
| https://www.youtube.com/watch?v=GVsUOuSjvcg
| Symmetry wrote:
| I think we're far away from analog circuits being practically
| useful, but one place that where we might embrace the tolerance
| for imprecision is in noisy digital circuits. Accepting that
| one in a million, say, bits in an output will be flipped to
| achieve a better performance/power ratio. Probably not when
| working with float32s where a single infinity[1] could totally
| mess things but for int8s the occasional 128 when you wanted a
| 0 seems like something that should be tolerable.
|
| [1] Are H100s' maxtrix floating point units actually IEEE 754
| compliant? I don't actually know.
| _spl wrote:
| It reminds me of when I first read about superscalar CPU
| architecture and was amazed. GPUs are really next level.
| DeathArrow wrote:
| It would be nice if such improvements find their way in pytorch
| and scikit-learn.
| kmacdough wrote:
| I'm sure they will. Right now it's, though, it's bleeding edge
| and it'll take some time for these ideas to mature and be
| adapted to the particular idioms of these more stable packages.
| bombela wrote:
| I cannot tell for sure if units are really all power of 10.
|
| I found some datasheet that states 80GB of VRAM, and a BAR of
| 80GiB. All caches are also in power of two. The bandwidth are all
| power of 10 though.
|
| https://www.nvidia.com/content/dam/en-zz/Solutions/gtcs22/da...
| joaquincabezas wrote:
| wow their graphs at the GitHub README
| (https://github.com/HazyResearch/ThunderKittens/blob/main/att...)
| make me extremely dizzy. Are these wavy bars even legal? :P
| bogtog wrote:
| I second this. It's like they're trying to incorporate some
| optical illusion. I'd even prefer just seeing numbers without
| any bars
| hoosieree wrote:
| It looks like the xkcd theme for matplotlib[1]. But I agree the
| waves are too extreme.
|
| [1]:
| https://matplotlib.org/stable/gallery/showcase/xkcd.html#sph...
| badgersnake wrote:
| That's the whole point, VCs invested heavily in GPUs anticipating
| a crypto boom and when that never happened they had to find some
| other snake oil to peddle that happened to require GPUs.
| verbify wrote:
| My experience is that when crypto was in the news, my non-
| technical friends, family, and colleagues would ask me what is
| bitcoin and were generally confused.
|
| My experience with the AI boom couldn't be more different -
| everyone from my colleagues to my mum are using chatgpt as a
| daily tool.
|
| I really don't think that AI and crypto are comparable in terms
| of their current practical usage.
| kkielhofner wrote:
| Comparing crypto and AI is really tired and you make the best
| point - real people are using these GPUs to actually do
| things of value and improve their daily lives.
|
| At the peak of the crypto boom/hype cycle I took on a little
| project to look at the top 10 blockchain
| networks/coins/whatever.
|
| From what I could tell a very, very, very generous estimate
| is that crypto at best has MAUs in the low tens of millions.
|
| ChatGPT alone got to 100 million MAUs within a year of
| release and has only grown since.
|
| ChatGPT 10x'd actual real world usage of GPUs (and resulting
| power and other resources) in a year vs ~15 years for crypto.
|
| > I really don't think that AI and crypto are comparable in
| terms of their current practical usage.
|
| A massive understatement!
| latchkey wrote:
| GPUs stopped being used for crypto because Ethereum
| switched from PoW to PoS and that decimated the whole gpu
| mining industry. Ethereum was the only profitable thing to
| mine, that also had a usecase. The rest of the chains
| dumped in price and became unprofitable to mine at scale.
| Not enough market depth to unload the tokens at scale.
|
| In other words, it has nothing to do with AI.
| adra wrote:
| Wow, what a different in perspective. I've met maybe a few
| people period that have at least mentioned that they've ever
| used AI tools (ever) in their personal lives, frequency be
| damned. Maybe you're just a lot more insistent in weaving in
| questions about using AI tools in daily conversation.
|
| At work setting in a tech company, there seems to be a
| handful that are very in love with AI, a bunch that use it
| here or there, and a large majority that (at least
| publically) don't even use it. It's be interesting to see
| what company enforced spyware would say about ai uptake
| though for real.
| panki27 wrote:
| Warp scheduler, 4 quadrants, tensor memory accelerator,
| unswizzled wgmma layouts...
|
| The line between GPU lingo and Star Trek technobabble fades away
| further and further.
| Agentlien wrote:
| Your comment prompted me to take a step back and look at these
| terms with new eyes. That made me smile, because you're so
| right.
| araes wrote:
| There was some awareness reading the article, yet "we're
| warping through the quadrant in our tensor accelerator" is
| pretty Trek.
|
| Have had that thought occasionally with some of the other
| articles. What it must read like to somebody who gets a ref
| link for an article over here. Wandered into some Trek nerd
| convention discussing warp cores.
| winwang wrote:
| I mean, if we're talking about "accelerating by modifying the
| metric tensor" then yeah, that would be pretty sci-fi :)
|
| https://en.wikipedia.org/wiki/Metric_tensor_(general_relativ.
| ..
| weinzierl wrote:
| _" For this post, we're going to focus on the NVIDIA H100 [...
| because] we think the trends it implies are going to continue in
| future generations, and probably from other manufacturers, too."_
|
| Is it though? Wouldn't we expect to see more advanced packaging
| technology eventually?
|
| If that happens the increased memory bandwidth could be an
| enabler for a unified memory architecture like in the Nvidia
| Jetson line. In turn that would make a lot of what the article
| says make GPU go Brr today moot.
| lucidrains wrote:
| would be interested to see thunderkittens (great name!) tackle
| the flash attention backwards pass, which is an order of
| magnitude harder than the forward
| LordShredda wrote:
| Standford research team just published an article with a wojak in
| it. That by itself is bigger news than AI
| chefandy wrote:
| One of my biggest struggles in doing AI stuff on consumer
| hardware is heat. I noticed zero discussion of this so I assume
| it's an implementation detail on small systems that doesn't
| really factor into more robust setups. Is that the really case,
| or is this just diving into the comp sci layer of hardware
| utilization and ignoring things like heat because it's not
| salient to this subtopic?
| nostrebored wrote:
| It factors into robust setups but is part and parcel of doing
| any HPC where you're pushing through a ton of TFLOPS. It's a
| problem that is assumed to have been solved when you're doing
| this kind of work.
| danjl wrote:
| I bet traditional image processing would love to be implemented
| in ThunderKitten.
___________________________________________________________________
(page generated 2024-05-13 23:01 UTC)