[HN Gopher] Grace Hopper, Nvidia's Halfway APU
       ___________________________________________________________________
        
       Grace Hopper, Nvidia's Halfway APU
        
       Author : PaulHoule
       Score  : 118 points
       Date   : 2024-08-09 22:52 UTC (1 days ago)
        
 (HTM) web link (chipsandcheese.com)
 (TXT) w3m dump (chipsandcheese.com)
        
       | tedunangst wrote:
       | Irrelevant, but the intro reminded me that nvidia also used to
       | dabble in chipsets like nforce, back when there was supplier
       | variety in such.
        
         | jauntywundrkind wrote:
         | SoundStorm vs Dolby is such a turning point story. Nvidia had a
         | 5 billion op/s DSP and Dolby digital encoding on that chipset.
         | Computers were coming into their own as powerful universal
         | systems that could do anything.
         | 
         | Then Dolby cancelled the license. To this day you still need
         | very fancy sound cards or exotic motherboards to be able to
         | output good surround sound to a large number of av receivers.
         | There are some open DTS standards that Linux can do too, dunno
         | about windows/Mac.
         | 
         | But it just felt like we slid so far down, that Dolby went &
         | made everything so much worse.
         | 
         | (Media software can do Dolby pass-through to let the high
         | quality sound files through, yes. But this means you can't do
         | any effect processing, like audio normalization/compression for
         | example. And if you are playing games your amp may be getting
         | only basic low quality surround surround, not the good many
         | channel stuff.)
        
           | throwaway81523 wrote:
           | Do you mean AC3? Ffmpeg has been able to do that since
           | forever.
           | 
           | https://en.wikipedia.org/wiki/Dolby_Digital
        
             | jauntywundrkind wrote:
             | Theres some debate about what patents apply, but even Dolby
             | had to admit defeat as of 2017. So yes, a 640kbit/s 6
             | channel format is available for encoding on ffmpeg & some
             | others.
             | 
             | I don't know if games are smart enough to use this?
             | 
             | It also feels like a very low bar. It's not awful bitrate
             | for 6 channels but neither is it great. It's not a pitiful
             | number of channels but again neither is it great.
             | 
             | Last & most crucially, just because one piece of software
             | can emit ac3 doesn't make it particularly useful for a
             | system. I should be able to have multiple different apps
             | doing surround sound, sending notifications to back
             | channels or panning sounds as I prefer. Yes ffmpeg can
             | encode 5.1 media audio to an AVR but that doesn't really
             | substitute for an actual surround system.
             | 
             | This is more a software problem, now that the 5.1 AC3
             | patents are expired. And there have been some stacks in the
             | past where this worked on Linux for example. But it seems
             | like modern hardware (with a Sound Open Firmware) has
             | changed a bit and PipeWire needs to come up with a new way
             | of doing ac3/a52 encoding. https://gitlab.freedesktop.org/p
             | ipewire/pipewire/-/issues/32...
        
               | ssl-3 wrote:
               | I once went down a rabbit hole of trying to get realtime
               | AC3 encoding on my desktop PC, and I broadly failed.
               | 
               | That was a long time ago. It is now 2024.
               | 
               | Do we still need that today? For modern AVRs we have
               | HDMI, with 8 channels worth of up to 24bit 192kHz
               | lossless digital audio baked in.
               | 
               | For old AVRs with multichannel analog inputs,
               | motherboards with 6 or 8 channels of built-in audio are
               | still common-enough, as are separate sound cards with
               | similar functionality.
               | 
               | What's the advantage of realtime AC3 encoding today, do
               | you suppose?
        
               | throwaway81523 wrote:
               | One reason to want Dolby encoding is to play back on your
               | consumer home theater gear that decode it. Alternatively
               | though, just don't use that kind of gear.
        
           | izacus wrote:
           | I'm bit confused about your last paragraph - what's low
           | quality about Dolby Atmos / DTS:X output you get for games
           | these days?
        
         | MegaDeKay wrote:
         | One place you'll find said chipset is in the OG XBox, where
         | they provided the Southbridge "MCPX" chip as well as the GPU.
         | 
         | https://classic.copetti.org/writings/consoles/xbox/#io
        
         | m463 wrote:
         | I think that stopped when intel said nvidia couldn't produce
         | chipsets for some cpu architecture they were coming out with.
         | 
         | I don't know if this was market savvy or a footshoot that made
         | their ecosystem weaker.
        
           | wtallis wrote:
           | The transition point was when Intel moved the DRAM controller
           | and PCIe root complex onto the CPU die, merging in the
           | northbridge and leaving the southbridge as the only separate
           | part of the chipset. The disappearance of the Front Side Bus
           | meant Intel platforms no longer had a good place for an
           | integrated GPU other than on the CPU package itself, and it
           | was years before Intel's iGPUs caught up to the Nvidia 9400M
           | iGPU.
           | 
           | In principle, Nvidia could have made chipsets for Intel's
           | newer platforms where the southbridge connects to the CPU
           | over what is essentially four lanes of PCIe, but Intel locked
           | out third parties from that market. But there wasn't much
           | room for Nvidia to provide any significant advantages over
           | Intel's own chipsets, except perhaps by undercutting some of
           | Intel's product segmentation.
           | 
           | (On the AMD side, the DRAM controller was on the CPU starting
           | in 2003, but there was still a separate northbridge for
           | providing AGP/PCIe, with a relatively high-speed
           | HyperTransport link to the CPU. AMD dropped HT starting with
           | their APUs in 2011 and the rest of the desktop processors
           | starting with the introduction of the Ryzen family.)
        
             | whaleofatw2022 wrote:
             | The argument was before that transition.
             | 
             | AFAIR the contentious point was that Nvidia had a license
             | to the bus for P6 arch (by virtue of Xbox) but did not have
             | a license for the P4 bus.
             | 
             | AMD was also more than happy to have NVDA build chipsets
             | for Hammer/etc especially due to them not having a video
             | core... -at the time-.
             | 
             | Once the AMD/ATI merger started, that was the real writing
             | on the wall.
        
       | MobiusHorizons wrote:
       | I am really surprised to see the performance of the CPU and
       | especially the latency characteristics are so poor. The article
       | alludes to the design likely being tuned for specific workloads,
       | which seems like a good explanation. But I can't help wonder if
       | throughput at the cost of high memory latency is just not a good
       | strategy for CPUs even with the excellent branch predictors and
       | clever OOO work that modern CPUs bring to the table. Is this a
       | bad take? Are we just not seeing the intended use-case where this
       | thing really shines compared to anything else?
        
         | p1necone wrote:
         | This kind of hardware makes sense for video games, and I guess
         | GPU heavy workloads like AI might be similar? Most games have
         | middling compute requirements but will take as much GPU power
         | as you can give them if you're trying to run at high
         | resolutions/settings. Although getting smooth gameplay at very
         | high frame rates (~120hz+) does need a decent CPU in a lot of
         | games.
         | 
         | Look at how atrocious the CPUs were in the PS4/Xbone generation
         | for an example of this.
        
           | wmf wrote:
           | Grace Hopper was not designed for games though.
        
           | pjmlp wrote:
           | And yet PS 4 / XBox ONE rule the games console market still,
           | because only more polygons isn't worth buying a PS 5 or
           | XSeries, for a large market segment, hence the negative sales
           | and trying to cater to PC gamers as alternative.
        
         | edward28 wrote:
         | These CPUs are intended just to run miscellaneous tasks, such
         | as loading AI models or running the cluster operating system.
         | They don't need to be performant, just efficient, as the GPU
         | does all the heavy lifting. NVIDIA also provides an option to
         | swap the grace chips out with an x86 chip, which could deliver
         | better performance depending on the remaining power budget
         | though.
        
           | MobiusHorizons wrote:
           | If this is all there is to it, why do they have the high
           | frequency and high l3 cache? Those seem to be optimizing for
           | something, not just a "good enough" configuration for a part
           | that is not the bottleneck
        
             | riotnrrd wrote:
             | Data augmentation in CPU-space is often compute-light, but
             | requires rapid access to memory. There are libraries (like
             | NVIDIA's Dali) that can do augmentation on the GPU, but
             | this takes up GPU resources that could be used by training.
             | Having a multi-core CPU with fast caches is a good
             | compromise.
        
         | tonyarkles wrote:
         | We're using the Orin AGX for edge ML. Not the same setup
         | (Ampere) but it's a similar situation. The GPU is excellent for
         | what we need it to do, but the CPU cores are painful. We're
         | lucky... the CPUs aren't great but there's 12 of them and we
         | can get away with carefully pipelining our data flows across
         | multiple threads to get the throughput we need even though some
         | individual stage latencies aren't what we'd like.
        
         | freeqaz wrote:
         | What's the point of having the GPU on die for this? Are they
         | expecting people to deploy one of these nodes without dedicated
         | GPUs? It has a ton of NVLink connections which makes me think
         | that these will often be deployed alongside GPUs which feels
         | weird.
         | 
         | The flip side of this is if the GPU can access the main system
         | memory then I could see this being useful for loading big
         | models with much more efficient "offloading" of layers. Even
         | though bandwidth between GPU->LPDDR5 is going to be slow, it's
         | still faster than what traditional PCI-E would allow.
         | 
         | The caveat here is that I imagine these machines are $$$ and
         | enterprise only. If something like this was brought to the
         | consumer market though I think it would be very enticing.
         | 
         | (If anybody from AMD is reading this, I feel like an
         | architecture like this would be awesome to have. I would love
         | to run Llama 3.1 405b at home and today I see zero path towards
         | doing that for any "reasonable" amount of money (<$10k?).)
         | 
         | Edit: It's at the bottom of the article. These are designed to
         | be meshed together via NVLink into one big cluster.
         | 
         | Makes sense. I'm really curious how the system RAM would be
         | used in LLM training scenarios, or if these boxes are going to
         | be used for totally different tasks that I have little context
         | into.
        
       | dagmx wrote:
       | The article talks about the difference in the pre-fetcher between
       | the two neoverse setups (Graviton and Grace Hopper). However
       | isn't the prefetcher part of the core design in neoverse? How
       | would they differ?
        
         | MobiusHorizons wrote:
         | I believe the difference is in the cache hierarchy (more l3
         | less l2) and generally high latency to dram even higher latency
         | to hbm. This makes the prefetcher behave differently between
         | the two implementations, because the l2 cache isn't able to
         | absorb the latency
        
           | dagmx wrote:
           | That was my initial read but they have this line which made
           | me wonder if it was somehow more than that
           | 
           | > I suspect Grace has a very aggressive prefetcher willing to
           | queue up a ton of outstanding requests from a single core.
        
             | MobiusHorizons wrote:
             | Oh good point, maybe that is configurable as well.
        
       | rkwasny wrote:
       | Yeah so I also benchmarked GH200 yesterday and I am also a bit
       | puzzled TBH:
       | 
       | https://github.com/mag-/gpu_benchmark
        
         | adrian_b wrote:
         | I suggest that wherever you write "TFLOPS", you should also
         | write the data type for which they were measured.
         | 
         | Without knowing whether the operations have been performed on
         | FP32 or on FP16 or on another data type, all the numbers
         | written on that page are meaningless.
        
       | bmacho wrote:
       | > The first signs of trouble appeared when vi, a simple text
       | editor, took more than several seconds to load.
       | 
       | Can it run vi?
        
       | erulabs wrote:
       | If AI remains in the cloud, nvidia wins. But I can't help but
       | think that if AI becomes "self-hosted", if we return to a world
       | where people own their own machines, AMDs APUs and interconnect
       | technology will be absolutely dominant. Training may still be
       | Nvidias wheelhouse, but for a single device able to do all the
       | things (inference, rendering, and computing), AMD, at least
       | currently, would seem to be the winner. I'd love someone more
       | knowledgeable in AI scaling to correct me here though.
       | 
       | Maybe that's all far enough afield to make the current state of
       | things irrelevant?
        
         | teaearlgraycold wrote:
         | You need orders of magnitude more compute for training than for
         | inference. Nvidia still wins in your scenario.
         | 
         | Currently rendering and local GPGPU compute is Nvidia dominated
         | and I don't see AMD competently going after the market
         | segments.
        
           | demaga wrote:
           | But you also run inference orders of magnitudes more times,
           | so it should still amount to more compute than training?
        
             | teaearlgraycold wrote:
             | That matters more to the electricity company than the
             | silicon company. The profit margins on the datacenter
             | training hardware are stupidly high compared to an AMD APU.
        
               | binary132 wrote:
               | If there are tens of thousands of training GPUs but
               | billions of APUs, then what? BTW, training is such a high
               | cost that it seems like a major motive for the customer
               | to reduce costs there.
        
               | k__ wrote:
               | This.
               | 
               | Most will probably use something like Llama as base.
        
               | talldayo wrote:
               | > If there are tens of thousands of training GPUs but
               | billions of APUs, then what?
               | 
               | Believe it or not, we've actually been grappling with
               | this scenario for almost a decade at this point.
               | Originally the answer was to unite hardware manufacturers
               | around a common featureset that could compete with
               | (albeit not replace) CUDA. Khronos was prepared to
               | elevate OpenCL to an industry standard, but Apple pulled
               | their support for it and let the industry collapse into
               | proprietary competition again. I bet they're kicking
               | themselves over that one, if they still hold a stronger
               | grudge against Nvidia than Khronos at least.
               | 
               | So - logically, there's actually a one-size-fits-all
               | solution for this problem. It was even going to get
               | managed by the same people handling Vulkan. The problem
               | was corporate greed and shortsighted investment that let
               | OpenCL languish while CUDA was under active heavy
               | development.
               | 
               | > BTW, training is such a high cost that it seems like a
               | major motive for the customer to reduce costs there.
               | 
               | Eh, that's kinda like saying "app development is so
               | expensive that consumers will eventually care". Consumers
               | just buy the end product; they are never exposed to
               | building the software or concerned with the cost of the
               | development. This is especially true with businesses like
               | OpenAI that just give you free access to a decent LLM (or
               | Apple and their "it's free for now" mentality).
        
             | marcosdumay wrote:
             | Besides, if you separate them, the people doing the
             | training will put way more effort into optimizing their
             | hardware ROI than the ones doing inference.
        
           | mistercow wrote:
           | I think this is the big point of uncertainty in Nvidia's
           | future: will we find new training techniques which require
           | significantly less compute, and/or are better suited to some
           | fundamentally different architecture than GPUs? I'm reluctant
           | to bet no on that long term, and "long term" for ML right now
           | is not very long.
        
             | brigadier132 wrote:
             | If we find a new training technique that is that much more
             | efficient why do you think we wont just increase the amount
             | of training we do be n times? (or even more since it's now
             | accessible to train custom models for smaller businesses)
        
               | mistercow wrote:
               | We might, but it's also plausible that it would change
               | the ecosystem so much that centralized models are no
               | longer so prominent. For example, suppose that with much
               | cheaper training, most training is on your specific data
               | and behaviors so that you have a model (or ensemble of
               | models) tailored to your own needs. You still need a
               | foundation model, but those are much smaller so that they
               | can run on device, so even with overparameterization and
               | distillation, the training costs are orders of magnitude
               | smaller.
               | 
               | Or, in the small business case (mind you, "long term" for
               | tech reaching small businesses is looooong), these
               | businesses again need much smaller models because a) they
               | don't need a model well versed in Shakespeare and multi
               | variable calculus, and b) they want inference to be as
               | low cost as possible.
               | 
               | These are just scenarios off the top of my head. The
               | broader point is that a dramatic drop in training cost is
               | a wildcard whose effects are really hard to predict.
        
             | marcosdumay wrote:
             | I'd bet that any AI that is really useful for the tasks
             | people want to push LLMs into will answer "yes" to both
             | parts of your question.
             | 
             | But I don't know what "long term" is exactly, and have no
             | idea how to time this thing. Besides, I'd bet the sibling
             | evoking the Jevon's paradox is correct.
        
             | acchow wrote:
             | I'm betting the opposite: new model architectures will
             | unlock greater abilities at the cost of massive compute.
        
         | passion__desire wrote:
         | If compute is gonna play the role of electricity in coming
         | decades, then having a compute wall similar to Tesla powerwall
         | is a necessity.
        
           | ta988 wrote:
           | Only if improvements in speed and energy savings slow down
        
             | CooCooCaCha wrote:
             | And if models don't get any larger, which they will
        
           | CooCooCaCha wrote:
           | Powerwall and electric car in the garage, compute wall in the
           | closet, 3d printer and other building tools in the
           | manufacturing room, hydroponics setup in the indoor farm
           | room, and AI assistant to help manage it all. The home
           | becomes a mixed living area and factory.
        
             | crowcroft wrote:
             | The vision of this sounds so cool, but man, for a lot of
             | use cases at the moment most 'smart home' stuff is still
             | complicated and temperamental.
             | 
             | How do we get from here to there, cause I want to get there
             | so bad.
        
           | rbanffy wrote:
           | Powerwall makes sense because you can't generate energy at
           | any time and, therefore, you store it. Computers are not like
           | that - you don't "store" computations for when you need them
           | - you either use capacity or you don't. That makes it
           | practical to centralise computing and only pay for what you
           | use.
        
             | jfoutz wrote:
             | I was going to make a pedantic argument/joke about
             | memoization.
             | 
             | It is kind of an interesting thought though. A big wall of
             | SSD is a fabulous amount of storage. and maybe a clever
             | read only architecture, would be cheaper than SSD. and a
             | clever data structure for shared high order bits, maybe,
             | maybe there is potential for some device to look up matrix
             | multiply results, or close approximations that could be
             | cheaply refined.
             | 
             | Right now, I doubt it. But big static cache it is a kind of
             | interesting idea to kick around Saturday afternoon.
        
               | rbanffy wrote:
               | > maybe there is potential for some device to look up
               | matrix multiply results, or close approximations that
               | could be cheaply refined.
               | 
               | Shard that across the planet and you'd have a global
               | cache for calculations. Or a lookup for every possible AI
               | prompt and its results.
        
               | marcosdumay wrote:
               | > I was going to make a pedantic argument/joke about
               | memoization.
               | 
               | You are reading the GP the wrong way around.
               | 
               | You store partial results exactly because you can't store
               | computation. Computation is perishable1, you either use
               | it or lose it. And one way to use it is to create partial
               | results you can save for later.
               | 
               | 1 - Well, partially so. Hardware utilization is
               | perishable, but computation also consumes inputs (mostly
               | energy) that aren't. How much it perishable depends on
               | the ratio of those two costs, and your mobile phone has a
               | completely different outlook from a supercomputer.
        
         | moffkalast wrote:
         | Nvidia still has 12-16GB VRAM offerings for around $300-400,
         | which are exceptionally well optimized and supported on the
         | software side. Still by far the most cost effective option if
         | you also value your time imo. The Strix Halo better have high
         | tier Mac level bandwidth plus ROCm support and be priced below
         | $1k or it's just not competitive with that because it'll still
         | be slower than even partial cuda offloading.
        
         | wmf wrote:
         | You may be seeing something that isn't there. I don't even know
         | if MI300A is available to buy, what it costs, or if you'll be
         | forced to buy four of them which would push prices close to DGX
         | territory anyway.
        
         | BaculumMeumEst wrote:
         | I remember reading geohot advocate for 7900XTX as a cost
         | effective card for deep learning. I read AMD is backing off
         | from the high end GPU market, though. Is there any chance they
         | will at least continue to offer cards with lots of VRAM?
        
         | acchow wrote:
         | The cloud is more efficient at utilizing hardware. Except for
         | low-latency or low-connection requirements, the move to cloud
         | will continue.
        
       | benreesman wrote:
       | I'm torn: NVIDIA has a fucking insane braintrust of some of the
       | most elite hackers in both software and extreme cutting edge
       | digital logic. You do not want to meet an NVIDIA greybeard in a
       | dark alley, they will fuck you up.
       | 
       | But this bullshit with Jensen signing girls' breasts like he's
       | Robert Plant and telling young people to learn prompt engineering
       | instead of C++ and generally pulling a pump and dump shamelessly
       | while wearing a leather jacket?
       | 
       | Fuck that: if LLMs could write cuDNN-caliber kernels that's how
       | you would do it.
       | 
       | It's ok in my book to live the rockstar life for the 15 minutes
       | until someone other than Lisa Su ships an FMA unit.
       | 
       | The 3T cap and the forward PE and the market manipulation and the
       | dated signature apparel are still cringe and if I had the capital
       | and trading facility to LEAP DOOM the stock? I'd want as much as
       | there is.
       | 
       | The fact that your CPU sucks ass just proves this isn't about
       | real competition just now.
        
         | almostgotcaught wrote:
         | Sir this is a Wendy's
        
           | benreesman wrote:
           | This is Y-Combinator. Garry Tan is still tweeting
           | embarrassing Utopianism to faint applause and @pg is still
           | vaguely endorsing a rapidly decaying pseudo-argument that
           | we're north of securities fraud.
           | 
           | At Wendy's I get a burger that's a little smaller every year.
           | 
           | On this I get Enron but smoothed over by Dustin's
           | OpenPhilanthropy lobbyism.
           | 
           | I'll take the burger.
           | 
           | edit:
           | 
           | tinygrad IS brat.
           | 
           | YC is old and quite weird.
        
             | defrost wrote:
             | Hell to the Yeah, it's filled with old weird posts:
             | https://news.ycombinator.com/item?id=567736
        
               | benreesman wrote:
               | I'm not important enough to do opposition research on, it
               | bewilders me that anyone cares.
               | 
               | I was 25 when I apologized for trolling too much on HN,
               | and frankly I've posted worse comments since: it's a
               | hazard of posting to a noteworthy and highly scrutinized
               | community under one's own name over decades.
               | 
               | I'd like to renew the apology for the low-quality, low-
               | value comments that have happened since. I answer to the
               | community on that.
               | 
               | To you specifically, I'll answer in the way you imply to
               | anyone with the minerals to grow up online under their
               | trivially permanent handle.
               | 
               | My job opportunities and livelihood move up and down with
               | the climate on my attitudes in this forum but I never
               | adopted a pseudonym.
               | 
               | In spite of your early join date which I respect in
               | general as a default I remain perplexed at what you've
               | wagered to the tune of authenticity.
        
               | defrost wrote:
               | There's no drama as far as I'm concerned, I got a
               | sensible chuckle from your comment & figured it deserved
               | a tickle in return; the obvious vector being anyone here
               | since 2008 has earned a tweak for calling the HN crowd
               | 'old' (something many can agree with).
               | 
               | My "opposition research" was entirely two clicks, profile
               | (see account age), Submissions (see oldest).
               | 
               | As for pseudonym's, I've been online since Usenet and
               | have never once felt the need to advertise on the new
               | fangled web (1.0, 2.0, or 3), handles were good enough
               | for Ham Radio, and TWKM - Those Who Know Me Know Who I Am
               | (and it's not at all that interesting unless you like
               | yarns about onions on belts and all that jazz).
        
               | benreesman wrote:
               | I'm pretty autistic, after haggling with Mistral this is
               | what it says a neurotypical person would say to diffuse a
               | conflict:
               | 
               | I want to apologize sincerely for my recent comments,
               | particularly my last response to you. Upon reflection, I
               | realize that my words were hurtful, disrespectful, and
               | completely inappropriate, especially given the light-
               | hearted nature of your previous comment. I am truly sorry
               | for any offense or harm I may have caused.
               | 
               | Your comment was clearly intended as a friendly jest, and
               | I regret that I responded with such hostility. There is
               | no excuse for my behavior, and I am committed to learning
               | from this mistake and ensuring it does not happen again.
               | 
               | I also want to address my earlier comments in this
               | thread. I now understand that my attempts to justify my
               | past behavior and dismiss genuine concerns came across as
               | defensive and disrespectful. Instead of taking
               | responsibility for my actions, I tried to deflect and
               | downplay their impact, which only served to escalate the
               | situation.
               | 
               | I value this community and the opportunity it provides
               | for open dialogue and growth. I understand that my
               | actions have consequences, and I am determined to be more
               | mindful, respectful, and considerate in my future
               | interactions. I promise to strive for higher quality
               | participation and to treat all members of this community
               | with the kindness and respect they deserve.
               | 
               | Once again, I am truly sorry for my offensive remarks and
               | any harm they may have caused. I appreciate the
               | understanding and patience you and the community have
               | shown, and I hope that my future actions will reflect my
               | commitment to change and help rebuild any trust that may
               | have been lost.
        
               | defrost wrote:
               | Cheers for that, it's a good apology.
               | 
               | Again, no drama - my sincere apologies for inadvertently
               | poking an old issue, there was no intent to be hurtful on
               | my part.
               | 
               | I have a thick skin, I'm Australian, we're frostily
               | polite to those we despise and call you names if we like
               | you - it can be offputting to some. :)
        
               | benreesman wrote:
               | The best hacker I know is from Perth, I picked up the
               | habit of the word "legend" as a result.
               | 
               | You've been a good sport legend.
        
               | benreesman wrote:
               | It's my hope that this thread is over.
               | 
               | You joined early, I've been around even longer.
               | 
               | You can find a penitent post from me about an aspiration
               | of higher quality participation, I don't have automation
               | set up to cherry-pick your comments in under a minute.
               | 
               | My username is my real name, my profile includes further
               | PII. Your account looks familiar but if anyone recognizes
               | it on sight it's a regime that post-dates @pg handing the
               | steering wheel to Altman in a "Battery Club" sort of way.
               | 
               | With all the respect to a fellow community member
               | possible, and it's not much, kindly fuck yourself with
               | something sharp.
        
               | defrost wrote:
               | Err .. you getting enough sleep there?
        
       | waynecochran wrote:
       | Side note: The acronym APU was used in the title but not once
       | defined or referenced in the article?
        
         | falcor84 wrote:
         | Here's my reasoning of what an APU is based on letter indices:
         | if A is 1, C is 3 and G is 7, then to get an APU, you need to
         | do what it takes to go from GPU to a CPU, and then apply an
         | extra 50% effort.
        
           | sebastiennight wrote:
           | This... is technically wrong, but it's the best kind of
           | wrong.
        
         | layer8 wrote:
         | It's an established term (originally by AMD) for a combination
         | of CPU and GPU on a single die. In other words, it's a CPU with
         | integrated accelerated graphics (iGPU). APU stands for
         | Accelerated Processing Unit.
         | 
         | Nvidia's Grace Hopper isn't quite that (it's primarily a GPU
         | with a bit of CPU sprinkled in), hence "halfway" I guess.
        
       | alexhutcheson wrote:
       | Somewhat tangential, but did Nvidia ever confirm if they
       | cancelled their project to develop custom cores implementing the
       | ARM instruction set (Project Denver, and later Carmel)?
       | 
       | It's interesting to me that they've settled on using standard
       | Neoverse cores, when almost everything else is custom designed
       | and tuned for the expected workloads.
        
         | adrian_b wrote:
         | Already in Nvidia Orin, which has replaced Xavier (with Carmel
         | cores) a couple of years ago, the CPU cores have been
         | Cortex-A78AE.
         | 
         | So Nvidia has given up on designing CPU cores, already for some
         | years.
         | 
         | The Carmel core had a performance similar to Cortex-A75, even
         | if it was launched by the time when Cortex-A76 was already
         | available. Moreover, Carmel had very low clock frequencies,
         | which diminished its performance even more. Like also Qualcomm
         | or Samsung, Nvidia has not been able to keep up with the Arm
         | Holdings design teams. (Now Qualcomm is back in the CPU design
         | business only because they have acquired Nuvia.)
        
       | jokoon wrote:
       | It always made sense to have a single chip instead of 2, I just
       | want to buy a single package with both things on the same die.
       | 
       | That might make things much simpler for people who write kernel,
       | drivers and video games.
       | 
       | The history of CPU and GPU prevented that, it was always more
       | profitable for CPU and GPU vendors to sell them separately.
       | 
       | Having 2 specialized chips makes more sense because it's
       | flexible, but since frequencies are stagnating, having more cores
       | make sense, and AI means massively parallel things are not only
       | for graphics.
       | 
       | Smartphones are much modern in that regard. Nobody upgrades their
       | GPU or CPU anymore, might as well have a single, soldered product
       | that last a long time instead.
       | 
       | That may not be the end of building your own computer, but I just
       | hope it will make things simpler and in a smaller package.
        
         | tliltocatl wrote:
         | It's not about profit, it's about power and pin budget. Proper
         | GPU needs lots of memory bandwidth=lots of memory-dedicated
         | pins (HBM kinda solves this, but has tons of other issues). And
         | on power/thermal side having two chips each with dedicated
         | power circuits, heatsinks and radiators is always better then
         | one. The only reason NOT to have to chips is either space
         | (that's why we have integrated graphics and it sucks
         | performance-wise), packaging costs (not really a concern for
         | consumer GPU/CPU where we are now) or interconnect costs (but
         | for both gaming and compute CPU-GPU bandwith is negligible
         | compared to GPU-RAM).
        
       | astromaniak wrote:
       | This is good for datacenters, but.. NVidia stopped doing anything
       | for consumers market.
        
       | rbanffy wrote:
       | > The downside is Genoa-X has more than 1 GB of last level cache,
       | and a single core only allocates into 96 MB of it.
       | 
       | I wonder if AMD could license the IBM Telum cache implementation
       | where one core complex could offer unused cache lines to other
       | cores, increasing overall occupancy.
       | 
       | Would be quite neat, even if cross-complex bandwidth and latency
       | is not awesome, it still should be better than hitting DRAM.
        
       | sirlancer wrote:
       | In my tests of a Supermicro ARS-111GL-NHR with a Nvidia GH200
       | chipset, I found that my benchmarks performed far better with the
       | RHEL 9 aarch64+64k kernel versus the standard aarch64 kernel.
       | Particularly with LLM workloads. Which kernel was used in these
       | tests?
        
         | metadat wrote:
         | "Far better" is a little vague, what was the actual difference?
        
           | magicalhippo wrote:
           | Not OP but was curious about the "+64k" thing and found
           | this[1] article claiming around 15% increase across several
           | different workloads using GH200.
           | 
           | FWIW for those unaware like me, 64k refers to 64kB pages, in
           | contrast to the typical 4kB.
           | 
           | [1]: https://www.phoronix.com/review/aarch64-64k-kernel-perf
        
       ___________________________________________________________________
       (page generated 2024-08-10 23:01 UTC)