[HN Gopher] Grace Hopper, Nvidia's Halfway APU
___________________________________________________________________
Grace Hopper, Nvidia's Halfway APU
Author : PaulHoule
Score : 118 points
Date : 2024-08-09 22:52 UTC (1 days ago)
(HTM) web link (chipsandcheese.com)
(TXT) w3m dump (chipsandcheese.com)
| tedunangst wrote:
| Irrelevant, but the intro reminded me that nvidia also used to
| dabble in chipsets like nforce, back when there was supplier
| variety in such.
| jauntywundrkind wrote:
| SoundStorm vs Dolby is such a turning point story. Nvidia had a
| 5 billion op/s DSP and Dolby digital encoding on that chipset.
| Computers were coming into their own as powerful universal
| systems that could do anything.
|
| Then Dolby cancelled the license. To this day you still need
| very fancy sound cards or exotic motherboards to be able to
| output good surround sound to a large number of av receivers.
| There are some open DTS standards that Linux can do too, dunno
| about windows/Mac.
|
| But it just felt like we slid so far down, that Dolby went &
| made everything so much worse.
|
| (Media software can do Dolby pass-through to let the high
| quality sound files through, yes. But this means you can't do
| any effect processing, like audio normalization/compression for
| example. And if you are playing games your amp may be getting
| only basic low quality surround surround, not the good many
| channel stuff.)
| throwaway81523 wrote:
| Do you mean AC3? Ffmpeg has been able to do that since
| forever.
|
| https://en.wikipedia.org/wiki/Dolby_Digital
| jauntywundrkind wrote:
| Theres some debate about what patents apply, but even Dolby
| had to admit defeat as of 2017. So yes, a 640kbit/s 6
| channel format is available for encoding on ffmpeg & some
| others.
|
| I don't know if games are smart enough to use this?
|
| It also feels like a very low bar. It's not awful bitrate
| for 6 channels but neither is it great. It's not a pitiful
| number of channels but again neither is it great.
|
| Last & most crucially, just because one piece of software
| can emit ac3 doesn't make it particularly useful for a
| system. I should be able to have multiple different apps
| doing surround sound, sending notifications to back
| channels or panning sounds as I prefer. Yes ffmpeg can
| encode 5.1 media audio to an AVR but that doesn't really
| substitute for an actual surround system.
|
| This is more a software problem, now that the 5.1 AC3
| patents are expired. And there have been some stacks in the
| past where this worked on Linux for example. But it seems
| like modern hardware (with a Sound Open Firmware) has
| changed a bit and PipeWire needs to come up with a new way
| of doing ac3/a52 encoding. https://gitlab.freedesktop.org/p
| ipewire/pipewire/-/issues/32...
| ssl-3 wrote:
| I once went down a rabbit hole of trying to get realtime
| AC3 encoding on my desktop PC, and I broadly failed.
|
| That was a long time ago. It is now 2024.
|
| Do we still need that today? For modern AVRs we have
| HDMI, with 8 channels worth of up to 24bit 192kHz
| lossless digital audio baked in.
|
| For old AVRs with multichannel analog inputs,
| motherboards with 6 or 8 channels of built-in audio are
| still common-enough, as are separate sound cards with
| similar functionality.
|
| What's the advantage of realtime AC3 encoding today, do
| you suppose?
| throwaway81523 wrote:
| One reason to want Dolby encoding is to play back on your
| consumer home theater gear that decode it. Alternatively
| though, just don't use that kind of gear.
| izacus wrote:
| I'm bit confused about your last paragraph - what's low
| quality about Dolby Atmos / DTS:X output you get for games
| these days?
| MegaDeKay wrote:
| One place you'll find said chipset is in the OG XBox, where
| they provided the Southbridge "MCPX" chip as well as the GPU.
|
| https://classic.copetti.org/writings/consoles/xbox/#io
| m463 wrote:
| I think that stopped when intel said nvidia couldn't produce
| chipsets for some cpu architecture they were coming out with.
|
| I don't know if this was market savvy or a footshoot that made
| their ecosystem weaker.
| wtallis wrote:
| The transition point was when Intel moved the DRAM controller
| and PCIe root complex onto the CPU die, merging in the
| northbridge and leaving the southbridge as the only separate
| part of the chipset. The disappearance of the Front Side Bus
| meant Intel platforms no longer had a good place for an
| integrated GPU other than on the CPU package itself, and it
| was years before Intel's iGPUs caught up to the Nvidia 9400M
| iGPU.
|
| In principle, Nvidia could have made chipsets for Intel's
| newer platforms where the southbridge connects to the CPU
| over what is essentially four lanes of PCIe, but Intel locked
| out third parties from that market. But there wasn't much
| room for Nvidia to provide any significant advantages over
| Intel's own chipsets, except perhaps by undercutting some of
| Intel's product segmentation.
|
| (On the AMD side, the DRAM controller was on the CPU starting
| in 2003, but there was still a separate northbridge for
| providing AGP/PCIe, with a relatively high-speed
| HyperTransport link to the CPU. AMD dropped HT starting with
| their APUs in 2011 and the rest of the desktop processors
| starting with the introduction of the Ryzen family.)
| whaleofatw2022 wrote:
| The argument was before that transition.
|
| AFAIR the contentious point was that Nvidia had a license
| to the bus for P6 arch (by virtue of Xbox) but did not have
| a license for the P4 bus.
|
| AMD was also more than happy to have NVDA build chipsets
| for Hammer/etc especially due to them not having a video
| core... -at the time-.
|
| Once the AMD/ATI merger started, that was the real writing
| on the wall.
| MobiusHorizons wrote:
| I am really surprised to see the performance of the CPU and
| especially the latency characteristics are so poor. The article
| alludes to the design likely being tuned for specific workloads,
| which seems like a good explanation. But I can't help wonder if
| throughput at the cost of high memory latency is just not a good
| strategy for CPUs even with the excellent branch predictors and
| clever OOO work that modern CPUs bring to the table. Is this a
| bad take? Are we just not seeing the intended use-case where this
| thing really shines compared to anything else?
| p1necone wrote:
| This kind of hardware makes sense for video games, and I guess
| GPU heavy workloads like AI might be similar? Most games have
| middling compute requirements but will take as much GPU power
| as you can give them if you're trying to run at high
| resolutions/settings. Although getting smooth gameplay at very
| high frame rates (~120hz+) does need a decent CPU in a lot of
| games.
|
| Look at how atrocious the CPUs were in the PS4/Xbone generation
| for an example of this.
| wmf wrote:
| Grace Hopper was not designed for games though.
| pjmlp wrote:
| And yet PS 4 / XBox ONE rule the games console market still,
| because only more polygons isn't worth buying a PS 5 or
| XSeries, for a large market segment, hence the negative sales
| and trying to cater to PC gamers as alternative.
| edward28 wrote:
| These CPUs are intended just to run miscellaneous tasks, such
| as loading AI models or running the cluster operating system.
| They don't need to be performant, just efficient, as the GPU
| does all the heavy lifting. NVIDIA also provides an option to
| swap the grace chips out with an x86 chip, which could deliver
| better performance depending on the remaining power budget
| though.
| MobiusHorizons wrote:
| If this is all there is to it, why do they have the high
| frequency and high l3 cache? Those seem to be optimizing for
| something, not just a "good enough" configuration for a part
| that is not the bottleneck
| riotnrrd wrote:
| Data augmentation in CPU-space is often compute-light, but
| requires rapid access to memory. There are libraries (like
| NVIDIA's Dali) that can do augmentation on the GPU, but
| this takes up GPU resources that could be used by training.
| Having a multi-core CPU with fast caches is a good
| compromise.
| tonyarkles wrote:
| We're using the Orin AGX for edge ML. Not the same setup
| (Ampere) but it's a similar situation. The GPU is excellent for
| what we need it to do, but the CPU cores are painful. We're
| lucky... the CPUs aren't great but there's 12 of them and we
| can get away with carefully pipelining our data flows across
| multiple threads to get the throughput we need even though some
| individual stage latencies aren't what we'd like.
| freeqaz wrote:
| What's the point of having the GPU on die for this? Are they
| expecting people to deploy one of these nodes without dedicated
| GPUs? It has a ton of NVLink connections which makes me think
| that these will often be deployed alongside GPUs which feels
| weird.
|
| The flip side of this is if the GPU can access the main system
| memory then I could see this being useful for loading big
| models with much more efficient "offloading" of layers. Even
| though bandwidth between GPU->LPDDR5 is going to be slow, it's
| still faster than what traditional PCI-E would allow.
|
| The caveat here is that I imagine these machines are $$$ and
| enterprise only. If something like this was brought to the
| consumer market though I think it would be very enticing.
|
| (If anybody from AMD is reading this, I feel like an
| architecture like this would be awesome to have. I would love
| to run Llama 3.1 405b at home and today I see zero path towards
| doing that for any "reasonable" amount of money (<$10k?).)
|
| Edit: It's at the bottom of the article. These are designed to
| be meshed together via NVLink into one big cluster.
|
| Makes sense. I'm really curious how the system RAM would be
| used in LLM training scenarios, or if these boxes are going to
| be used for totally different tasks that I have little context
| into.
| dagmx wrote:
| The article talks about the difference in the pre-fetcher between
| the two neoverse setups (Graviton and Grace Hopper). However
| isn't the prefetcher part of the core design in neoverse? How
| would they differ?
| MobiusHorizons wrote:
| I believe the difference is in the cache hierarchy (more l3
| less l2) and generally high latency to dram even higher latency
| to hbm. This makes the prefetcher behave differently between
| the two implementations, because the l2 cache isn't able to
| absorb the latency
| dagmx wrote:
| That was my initial read but they have this line which made
| me wonder if it was somehow more than that
|
| > I suspect Grace has a very aggressive prefetcher willing to
| queue up a ton of outstanding requests from a single core.
| MobiusHorizons wrote:
| Oh good point, maybe that is configurable as well.
| rkwasny wrote:
| Yeah so I also benchmarked GH200 yesterday and I am also a bit
| puzzled TBH:
|
| https://github.com/mag-/gpu_benchmark
| adrian_b wrote:
| I suggest that wherever you write "TFLOPS", you should also
| write the data type for which they were measured.
|
| Without knowing whether the operations have been performed on
| FP32 or on FP16 or on another data type, all the numbers
| written on that page are meaningless.
| bmacho wrote:
| > The first signs of trouble appeared when vi, a simple text
| editor, took more than several seconds to load.
|
| Can it run vi?
| erulabs wrote:
| If AI remains in the cloud, nvidia wins. But I can't help but
| think that if AI becomes "self-hosted", if we return to a world
| where people own their own machines, AMDs APUs and interconnect
| technology will be absolutely dominant. Training may still be
| Nvidias wheelhouse, but for a single device able to do all the
| things (inference, rendering, and computing), AMD, at least
| currently, would seem to be the winner. I'd love someone more
| knowledgeable in AI scaling to correct me here though.
|
| Maybe that's all far enough afield to make the current state of
| things irrelevant?
| teaearlgraycold wrote:
| You need orders of magnitude more compute for training than for
| inference. Nvidia still wins in your scenario.
|
| Currently rendering and local GPGPU compute is Nvidia dominated
| and I don't see AMD competently going after the market
| segments.
| demaga wrote:
| But you also run inference orders of magnitudes more times,
| so it should still amount to more compute than training?
| teaearlgraycold wrote:
| That matters more to the electricity company than the
| silicon company. The profit margins on the datacenter
| training hardware are stupidly high compared to an AMD APU.
| binary132 wrote:
| If there are tens of thousands of training GPUs but
| billions of APUs, then what? BTW, training is such a high
| cost that it seems like a major motive for the customer
| to reduce costs there.
| k__ wrote:
| This.
|
| Most will probably use something like Llama as base.
| talldayo wrote:
| > If there are tens of thousands of training GPUs but
| billions of APUs, then what?
|
| Believe it or not, we've actually been grappling with
| this scenario for almost a decade at this point.
| Originally the answer was to unite hardware manufacturers
| around a common featureset that could compete with
| (albeit not replace) CUDA. Khronos was prepared to
| elevate OpenCL to an industry standard, but Apple pulled
| their support for it and let the industry collapse into
| proprietary competition again. I bet they're kicking
| themselves over that one, if they still hold a stronger
| grudge against Nvidia than Khronos at least.
|
| So - logically, there's actually a one-size-fits-all
| solution for this problem. It was even going to get
| managed by the same people handling Vulkan. The problem
| was corporate greed and shortsighted investment that let
| OpenCL languish while CUDA was under active heavy
| development.
|
| > BTW, training is such a high cost that it seems like a
| major motive for the customer to reduce costs there.
|
| Eh, that's kinda like saying "app development is so
| expensive that consumers will eventually care". Consumers
| just buy the end product; they are never exposed to
| building the software or concerned with the cost of the
| development. This is especially true with businesses like
| OpenAI that just give you free access to a decent LLM (or
| Apple and their "it's free for now" mentality).
| marcosdumay wrote:
| Besides, if you separate them, the people doing the
| training will put way more effort into optimizing their
| hardware ROI than the ones doing inference.
| mistercow wrote:
| I think this is the big point of uncertainty in Nvidia's
| future: will we find new training techniques which require
| significantly less compute, and/or are better suited to some
| fundamentally different architecture than GPUs? I'm reluctant
| to bet no on that long term, and "long term" for ML right now
| is not very long.
| brigadier132 wrote:
| If we find a new training technique that is that much more
| efficient why do you think we wont just increase the amount
| of training we do be n times? (or even more since it's now
| accessible to train custom models for smaller businesses)
| mistercow wrote:
| We might, but it's also plausible that it would change
| the ecosystem so much that centralized models are no
| longer so prominent. For example, suppose that with much
| cheaper training, most training is on your specific data
| and behaviors so that you have a model (or ensemble of
| models) tailored to your own needs. You still need a
| foundation model, but those are much smaller so that they
| can run on device, so even with overparameterization and
| distillation, the training costs are orders of magnitude
| smaller.
|
| Or, in the small business case (mind you, "long term" for
| tech reaching small businesses is looooong), these
| businesses again need much smaller models because a) they
| don't need a model well versed in Shakespeare and multi
| variable calculus, and b) they want inference to be as
| low cost as possible.
|
| These are just scenarios off the top of my head. The
| broader point is that a dramatic drop in training cost is
| a wildcard whose effects are really hard to predict.
| marcosdumay wrote:
| I'd bet that any AI that is really useful for the tasks
| people want to push LLMs into will answer "yes" to both
| parts of your question.
|
| But I don't know what "long term" is exactly, and have no
| idea how to time this thing. Besides, I'd bet the sibling
| evoking the Jevon's paradox is correct.
| acchow wrote:
| I'm betting the opposite: new model architectures will
| unlock greater abilities at the cost of massive compute.
| passion__desire wrote:
| If compute is gonna play the role of electricity in coming
| decades, then having a compute wall similar to Tesla powerwall
| is a necessity.
| ta988 wrote:
| Only if improvements in speed and energy savings slow down
| CooCooCaCha wrote:
| And if models don't get any larger, which they will
| CooCooCaCha wrote:
| Powerwall and electric car in the garage, compute wall in the
| closet, 3d printer and other building tools in the
| manufacturing room, hydroponics setup in the indoor farm
| room, and AI assistant to help manage it all. The home
| becomes a mixed living area and factory.
| crowcroft wrote:
| The vision of this sounds so cool, but man, for a lot of
| use cases at the moment most 'smart home' stuff is still
| complicated and temperamental.
|
| How do we get from here to there, cause I want to get there
| so bad.
| rbanffy wrote:
| Powerwall makes sense because you can't generate energy at
| any time and, therefore, you store it. Computers are not like
| that - you don't "store" computations for when you need them
| - you either use capacity or you don't. That makes it
| practical to centralise computing and only pay for what you
| use.
| jfoutz wrote:
| I was going to make a pedantic argument/joke about
| memoization.
|
| It is kind of an interesting thought though. A big wall of
| SSD is a fabulous amount of storage. and maybe a clever
| read only architecture, would be cheaper than SSD. and a
| clever data structure for shared high order bits, maybe,
| maybe there is potential for some device to look up matrix
| multiply results, or close approximations that could be
| cheaply refined.
|
| Right now, I doubt it. But big static cache it is a kind of
| interesting idea to kick around Saturday afternoon.
| rbanffy wrote:
| > maybe there is potential for some device to look up
| matrix multiply results, or close approximations that
| could be cheaply refined.
|
| Shard that across the planet and you'd have a global
| cache for calculations. Or a lookup for every possible AI
| prompt and its results.
| marcosdumay wrote:
| > I was going to make a pedantic argument/joke about
| memoization.
|
| You are reading the GP the wrong way around.
|
| You store partial results exactly because you can't store
| computation. Computation is perishable1, you either use
| it or lose it. And one way to use it is to create partial
| results you can save for later.
|
| 1 - Well, partially so. Hardware utilization is
| perishable, but computation also consumes inputs (mostly
| energy) that aren't. How much it perishable depends on
| the ratio of those two costs, and your mobile phone has a
| completely different outlook from a supercomputer.
| moffkalast wrote:
| Nvidia still has 12-16GB VRAM offerings for around $300-400,
| which are exceptionally well optimized and supported on the
| software side. Still by far the most cost effective option if
| you also value your time imo. The Strix Halo better have high
| tier Mac level bandwidth plus ROCm support and be priced below
| $1k or it's just not competitive with that because it'll still
| be slower than even partial cuda offloading.
| wmf wrote:
| You may be seeing something that isn't there. I don't even know
| if MI300A is available to buy, what it costs, or if you'll be
| forced to buy four of them which would push prices close to DGX
| territory anyway.
| BaculumMeumEst wrote:
| I remember reading geohot advocate for 7900XTX as a cost
| effective card for deep learning. I read AMD is backing off
| from the high end GPU market, though. Is there any chance they
| will at least continue to offer cards with lots of VRAM?
| acchow wrote:
| The cloud is more efficient at utilizing hardware. Except for
| low-latency or low-connection requirements, the move to cloud
| will continue.
| benreesman wrote:
| I'm torn: NVIDIA has a fucking insane braintrust of some of the
| most elite hackers in both software and extreme cutting edge
| digital logic. You do not want to meet an NVIDIA greybeard in a
| dark alley, they will fuck you up.
|
| But this bullshit with Jensen signing girls' breasts like he's
| Robert Plant and telling young people to learn prompt engineering
| instead of C++ and generally pulling a pump and dump shamelessly
| while wearing a leather jacket?
|
| Fuck that: if LLMs could write cuDNN-caliber kernels that's how
| you would do it.
|
| It's ok in my book to live the rockstar life for the 15 minutes
| until someone other than Lisa Su ships an FMA unit.
|
| The 3T cap and the forward PE and the market manipulation and the
| dated signature apparel are still cringe and if I had the capital
| and trading facility to LEAP DOOM the stock? I'd want as much as
| there is.
|
| The fact that your CPU sucks ass just proves this isn't about
| real competition just now.
| almostgotcaught wrote:
| Sir this is a Wendy's
| benreesman wrote:
| This is Y-Combinator. Garry Tan is still tweeting
| embarrassing Utopianism to faint applause and @pg is still
| vaguely endorsing a rapidly decaying pseudo-argument that
| we're north of securities fraud.
|
| At Wendy's I get a burger that's a little smaller every year.
|
| On this I get Enron but smoothed over by Dustin's
| OpenPhilanthropy lobbyism.
|
| I'll take the burger.
|
| edit:
|
| tinygrad IS brat.
|
| YC is old and quite weird.
| defrost wrote:
| Hell to the Yeah, it's filled with old weird posts:
| https://news.ycombinator.com/item?id=567736
| benreesman wrote:
| I'm not important enough to do opposition research on, it
| bewilders me that anyone cares.
|
| I was 25 when I apologized for trolling too much on HN,
| and frankly I've posted worse comments since: it's a
| hazard of posting to a noteworthy and highly scrutinized
| community under one's own name over decades.
|
| I'd like to renew the apology for the low-quality, low-
| value comments that have happened since. I answer to the
| community on that.
|
| To you specifically, I'll answer in the way you imply to
| anyone with the minerals to grow up online under their
| trivially permanent handle.
|
| My job opportunities and livelihood move up and down with
| the climate on my attitudes in this forum but I never
| adopted a pseudonym.
|
| In spite of your early join date which I respect in
| general as a default I remain perplexed at what you've
| wagered to the tune of authenticity.
| defrost wrote:
| There's no drama as far as I'm concerned, I got a
| sensible chuckle from your comment & figured it deserved
| a tickle in return; the obvious vector being anyone here
| since 2008 has earned a tweak for calling the HN crowd
| 'old' (something many can agree with).
|
| My "opposition research" was entirely two clicks, profile
| (see account age), Submissions (see oldest).
|
| As for pseudonym's, I've been online since Usenet and
| have never once felt the need to advertise on the new
| fangled web (1.0, 2.0, or 3), handles were good enough
| for Ham Radio, and TWKM - Those Who Know Me Know Who I Am
| (and it's not at all that interesting unless you like
| yarns about onions on belts and all that jazz).
| benreesman wrote:
| I'm pretty autistic, after haggling with Mistral this is
| what it says a neurotypical person would say to diffuse a
| conflict:
|
| I want to apologize sincerely for my recent comments,
| particularly my last response to you. Upon reflection, I
| realize that my words were hurtful, disrespectful, and
| completely inappropriate, especially given the light-
| hearted nature of your previous comment. I am truly sorry
| for any offense or harm I may have caused.
|
| Your comment was clearly intended as a friendly jest, and
| I regret that I responded with such hostility. There is
| no excuse for my behavior, and I am committed to learning
| from this mistake and ensuring it does not happen again.
|
| I also want to address my earlier comments in this
| thread. I now understand that my attempts to justify my
| past behavior and dismiss genuine concerns came across as
| defensive and disrespectful. Instead of taking
| responsibility for my actions, I tried to deflect and
| downplay their impact, which only served to escalate the
| situation.
|
| I value this community and the opportunity it provides
| for open dialogue and growth. I understand that my
| actions have consequences, and I am determined to be more
| mindful, respectful, and considerate in my future
| interactions. I promise to strive for higher quality
| participation and to treat all members of this community
| with the kindness and respect they deserve.
|
| Once again, I am truly sorry for my offensive remarks and
| any harm they may have caused. I appreciate the
| understanding and patience you and the community have
| shown, and I hope that my future actions will reflect my
| commitment to change and help rebuild any trust that may
| have been lost.
| defrost wrote:
| Cheers for that, it's a good apology.
|
| Again, no drama - my sincere apologies for inadvertently
| poking an old issue, there was no intent to be hurtful on
| my part.
|
| I have a thick skin, I'm Australian, we're frostily
| polite to those we despise and call you names if we like
| you - it can be offputting to some. :)
| benreesman wrote:
| The best hacker I know is from Perth, I picked up the
| habit of the word "legend" as a result.
|
| You've been a good sport legend.
| benreesman wrote:
| It's my hope that this thread is over.
|
| You joined early, I've been around even longer.
|
| You can find a penitent post from me about an aspiration
| of higher quality participation, I don't have automation
| set up to cherry-pick your comments in under a minute.
|
| My username is my real name, my profile includes further
| PII. Your account looks familiar but if anyone recognizes
| it on sight it's a regime that post-dates @pg handing the
| steering wheel to Altman in a "Battery Club" sort of way.
|
| With all the respect to a fellow community member
| possible, and it's not much, kindly fuck yourself with
| something sharp.
| defrost wrote:
| Err .. you getting enough sleep there?
| waynecochran wrote:
| Side note: The acronym APU was used in the title but not once
| defined or referenced in the article?
| falcor84 wrote:
| Here's my reasoning of what an APU is based on letter indices:
| if A is 1, C is 3 and G is 7, then to get an APU, you need to
| do what it takes to go from GPU to a CPU, and then apply an
| extra 50% effort.
| sebastiennight wrote:
| This... is technically wrong, but it's the best kind of
| wrong.
| layer8 wrote:
| It's an established term (originally by AMD) for a combination
| of CPU and GPU on a single die. In other words, it's a CPU with
| integrated accelerated graphics (iGPU). APU stands for
| Accelerated Processing Unit.
|
| Nvidia's Grace Hopper isn't quite that (it's primarily a GPU
| with a bit of CPU sprinkled in), hence "halfway" I guess.
| alexhutcheson wrote:
| Somewhat tangential, but did Nvidia ever confirm if they
| cancelled their project to develop custom cores implementing the
| ARM instruction set (Project Denver, and later Carmel)?
|
| It's interesting to me that they've settled on using standard
| Neoverse cores, when almost everything else is custom designed
| and tuned for the expected workloads.
| adrian_b wrote:
| Already in Nvidia Orin, which has replaced Xavier (with Carmel
| cores) a couple of years ago, the CPU cores have been
| Cortex-A78AE.
|
| So Nvidia has given up on designing CPU cores, already for some
| years.
|
| The Carmel core had a performance similar to Cortex-A75, even
| if it was launched by the time when Cortex-A76 was already
| available. Moreover, Carmel had very low clock frequencies,
| which diminished its performance even more. Like also Qualcomm
| or Samsung, Nvidia has not been able to keep up with the Arm
| Holdings design teams. (Now Qualcomm is back in the CPU design
| business only because they have acquired Nuvia.)
| jokoon wrote:
| It always made sense to have a single chip instead of 2, I just
| want to buy a single package with both things on the same die.
|
| That might make things much simpler for people who write kernel,
| drivers and video games.
|
| The history of CPU and GPU prevented that, it was always more
| profitable for CPU and GPU vendors to sell them separately.
|
| Having 2 specialized chips makes more sense because it's
| flexible, but since frequencies are stagnating, having more cores
| make sense, and AI means massively parallel things are not only
| for graphics.
|
| Smartphones are much modern in that regard. Nobody upgrades their
| GPU or CPU anymore, might as well have a single, soldered product
| that last a long time instead.
|
| That may not be the end of building your own computer, but I just
| hope it will make things simpler and in a smaller package.
| tliltocatl wrote:
| It's not about profit, it's about power and pin budget. Proper
| GPU needs lots of memory bandwidth=lots of memory-dedicated
| pins (HBM kinda solves this, but has tons of other issues). And
| on power/thermal side having two chips each with dedicated
| power circuits, heatsinks and radiators is always better then
| one. The only reason NOT to have to chips is either space
| (that's why we have integrated graphics and it sucks
| performance-wise), packaging costs (not really a concern for
| consumer GPU/CPU where we are now) or interconnect costs (but
| for both gaming and compute CPU-GPU bandwith is negligible
| compared to GPU-RAM).
| astromaniak wrote:
| This is good for datacenters, but.. NVidia stopped doing anything
| for consumers market.
| rbanffy wrote:
| > The downside is Genoa-X has more than 1 GB of last level cache,
| and a single core only allocates into 96 MB of it.
|
| I wonder if AMD could license the IBM Telum cache implementation
| where one core complex could offer unused cache lines to other
| cores, increasing overall occupancy.
|
| Would be quite neat, even if cross-complex bandwidth and latency
| is not awesome, it still should be better than hitting DRAM.
| sirlancer wrote:
| In my tests of a Supermicro ARS-111GL-NHR with a Nvidia GH200
| chipset, I found that my benchmarks performed far better with the
| RHEL 9 aarch64+64k kernel versus the standard aarch64 kernel.
| Particularly with LLM workloads. Which kernel was used in these
| tests?
| metadat wrote:
| "Far better" is a little vague, what was the actual difference?
| magicalhippo wrote:
| Not OP but was curious about the "+64k" thing and found
| this[1] article claiming around 15% increase across several
| different workloads using GH200.
|
| FWIW for those unaware like me, 64k refers to 64kB pages, in
| contrast to the typical 4kB.
|
| [1]: https://www.phoronix.com/review/aarch64-64k-kernel-perf
___________________________________________________________________
(page generated 2024-08-10 23:01 UTC)