[HN Gopher] AMD Ryzen APU turned into a 16GB VRAM GPU and it can...
___________________________________________________________________
AMD Ryzen APU turned into a 16GB VRAM GPU and it can run Stable
Diffusion
Author : virgulino
Score : 136 points
Date : 2023-08-17 15:01 UTC (8 hours ago)
(HTM) web link (old.reddit.com)
(TXT) w3m dump (old.reddit.com)
| acumenical wrote:
| The post is short so I'll paste it here. If this is against the
| rules please ban me.
|
| -----begin copy paste-----
|
| The 4600G is currently selling at price of $95. It includes a
| 6-core CPU and 7-core GPU. 5600G is also inexpensive - around
| $130 with better CPU but the same GPU as 4600G.
|
| It can be turned into a 16GB VRAM GPU under Linux and works
| similar to AMD discrete GPU such as 5700XT, 6700XT, .... It thus
| supports AMD software stack: ROCm. Thus it supports Pytorch,
| Tensorflow. You can run most of the AI applications.
|
| 16GB VRAM is also a big deal, as it beats most of discrete GPU.
| Even those GPU has better computing power, they will get out of
| memory errors if application requires 12 or more GB of VRAM.
| Although the speed is an issue, it's better than out of memory
| errors.
|
| For stable diffusion, it can generate a 50 steps 512x512 image
| around 1 minute and 50 seconds. This is better than some high end
| CPUs.
|
| 5600G was a very popular product, so if you have one, I encourage
| you to test it. I made some videos tutorials for it. Please
| search tech-practice9805 for on Youtube and subscribe to the
| channel for future contents. Or see the video links in Comments.
|
| Please also follow me on X: https://twitter.com/TechPractice1
| Thanks for reading!
| [deleted]
| rvnx wrote:
| Video + tutorial: https://www.youtube.com/watch?v=HPO7fu7Vyw4
| thereisnospork wrote:
| How well do these workloads parallelize? Especially over
| customer-tier interconnects? What's stopping someone from picking
| up 100 of these to setup a cool little 1.6tb-vram cluster? Whole
| thing would probably cost less than an h100.
| [deleted]
| nre wrote:
| The 4600G supports two channels of DDR4-3200 which has a maximum
| memory bandwidth of around 50GB/s (actual graphics cards are in
| the hundreds). While this chip may be decent for SD and other
| compute-bound AI apps it won't be good for LLMs as inference
| speed is pretty much capped by memory bandwidth.
|
| Apple Silicon has extremely high memory bandwidth which is why it
| performs so well with LLMs.
| acumenical wrote:
| Odd. The high memory bandwidth of M2 intrigues me but I have
| not seen many people having success with AI apps on Apple
| Silicon. Which LLMs run better on Apple silicon than comparably
| priced Nvidia cards?
| AnthonyMouse wrote:
| They don't run better on AS than on GPUs with even more
| memory bandwidth. They run better on AS than on consumer PC
| CPUs (or presumably iGPUs) with less memory bandwidth.
| seniorivn wrote:
| there are no comparably priced nvidia cards, thats the point,
| of comparing apple soc/apu with amd/intel, specialized
| hardware is and always will be better
| [deleted]
| AnthonyMouse wrote:
| > The 4600G supports two channels of DDR4-3200 which has a
| maximum memory bandwidth of around 50GB/s (actual graphics
| cards are in the hundreds).
|
| DDR4-4800 exists. 76.8GB/s. You can also get a Ryzen 7000
| series for around $200 that can use DDR5-8000, which is
| 128GB/s. By contrast, the M1 is 68GB/s and the M2 is 100GB/s.
| (The Pro and Max are more, but they're also solidly in "buy a
| GPU" price range.)
| toast0 wrote:
| Doesn't change the meat of your argument, the 4000G series runs
| DDR4-3600 pretty well (even if the spec sheet only goes to
| 3200), and it's almost a crime to run an APU at anything worse
| than DDR4-3600 CL16. You can go higher than that too, but
| depending on your particular chip, when you get much above
| 3600, you may not be able to run the ram base clock at the same
| rate as the infinity fabric, which isn't ideal.
| Roark66 wrote:
| I wonder how does it compare to running same model on the same
| cpu in ram(making sure fast cpu ML library that utilises AVX2 is
| used for example Intel MKL)?
|
| Also, when doing a test like this it's important to compare same
| bit depths so fp32 on both.
| dragontamer wrote:
| My untested rule of thumb is that if something fits inside of
| GPU-register space, its probably faster on GPU. (GPUs have more
| architectural registers than CPUs, since CPUs are using all
| those "shadow registers" to implement out-of-order stuffs).
| Compilers will automatically turn your most-used variables into
| registers today, though you might need to verify by looking at
| the assembly code whether or not you're actually in register-
| space.
|
| But if something fits inside of CPU-cache space, its probably
| faster on CPU. (Intel is upto 2MB L2 cache on some systems, AMD
| is up to 192MB L3 cache on some systems).
|
| But if something is outside of CPU-cache space but fits inside
| of GPU-VRAM space, its probably faster on GPU. (Better to be
| bandwidth-limited at 500GB/s GPU-VRAM speeds than 50GB/s DDR5
| speed). Ex: 16GBs GPU-VRAM is back into GPU-sized solutions to
| a problem.
|
| Then if something fits inside of CPU-RAM space, which is like
| 2TB in practice (!!!!), then CPUs are probably faster. Because
| there's no point taking a 2TB DDR5 RAM limited problem, passing
| some of that data to 16GB or 96GB of GPU-VRAM, and then waiting
| for DDR5 RAM anyway.
|
| ------------
|
| Not that I've tested out this theory. But it'd make sense in my
| brain at least, lol.
| [deleted]
| chickenpotpie wrote:
| The ability to get huge amounts of VRAM/$ is what I find
| incredibly interesting. A lot of diffusion techniques are
| incredibly VRAM intensive and high VRAM consumer cards are rare
| and expensive. I'll gladly take the slower speeds of an APU if it
| means I can load the entire model in memory instead of having to
| offload chunks of it.
| delusional wrote:
| A PCIE 4.0 16x link should provide around 32GB/s of bandwidth,
| close to the 33GB/s of DDR5-3200. In a perfect world, it would
| seem to me that doing 100% offloading (streaming everything as
| needed from system memory) should be equivalent to doing the
| calculations in system memory in the first place. The GPU
| memory just acts as a cache, and should only speed up the
| processing.
| dragontamer wrote:
| Every modern motherboard is dual-channel DDR5 at a minimum,
| maybe quad-channel.
|
| 2x sticks of 32GB/s RAM, properly configured, will run at
| 64GB/s of bandwidth. Modern servers are quad, hex, or oct-
| channel (4x, 6x, or 8x parallel sticks of RAM) in practice.
| Or even more (ex: Intel Xeon Platinums are 6x channel per
| socket, so an 8x socket 8x CPU Xeon Platinum will be like 48x
| DDR5 parallel RAM sticks).
|
| ----------
|
| PCIe x16 by the way, is 16x parallel lanes of PCIe. All the
| parallelism is already innate in modern systems.
|
| ---------
|
| L3 cache is TB/s bandwidth IIRC. CPUs inside of CPU-space
| will automatically be caching a lot of those RAM commands, so
| you can go above the RAM-bandwidth limitations in practice,
| though it depends on how your code accesses RAM.
|
| GPUs have very small caches, and have higher latency to those
| caches. Instead, GPUs rely upon register-space and extreme
| amounts of SMT/wavefronts to kinda-sorta hyperthread their
| cores to hide all that latency.
| BobbyJo wrote:
| The full mem-swap congo line is DRAM<->PCIE<->VRAM<->GPU.
| PCIE is the weak link, but I'd be willing to bet that those
| transfers are't 100% overlapped, and PCIE transfer rate
| represents best case speed as opposed to expected. In the
| case of unidirectional writes, you'd have to cut that speed
| in half.
| gautamcgoel wrote:
| This argument makes sense only if you assume that your model
| is too large to fit in the GPU's RAM and hence has to reside
| mainly in the CPU's RAM.
| delusional wrote:
| That's true. The comment i was responding to talked about
| offloading. I was assuming he was talking about offloading
| part of the core model to the system RAM which would need
| to be reloaded frequently.
| dragontamer wrote:
| Using DDR5 as VRAM however means you're only getting 50GB/s to
| 100GB/s read/write speed instead of the 500GB/s available on a
| proper GPU.
|
| That might be a fine tradeoff for some kernels. But my
| understanding is that Stable Diffusion is very VRAM-bandwidth
| heavy and actually benefits from the higher-speed GDDR6 or HBM
| RAM on a proper high-end GPU.
| justinclift wrote:
| If you're ok with older generation Nvidia gear, but still with
| 24GB ram, then some people are using things like this:
|
| https://www.ebay.com.au/itm/126046615075
|
| That's $250 in Australian dollars though, which is about
| US$160. I'm not affiliated with that seller btw, I just
| remembered the search result from looking a while back. :)
| cptskippy wrote:
| I bought one of those earlier this week and I'm in the
| process of getting it setup.
| lhl wrote:
| For not much more (US$200) you can find lots of P40s, which
| is a generation newer and will give you double the memory
| bandwidth and FP32. That being said, used 3090s are going for
| about $600 now and are much better bang/buck and easier
| (software and hardware) to setup.
| choppaface wrote:
| These have lots of memory but are pretty slow, can be
| 50x-100x slower than a gamer card from the past couple years,
| plus lots of heat / power inefficiency. If you have any
| software than can take advantage of tensor cores / matrix
| cores or int8 / fp16 ops then modern hardware will probably
| win.
| [deleted]
| sp332 wrote:
| (It's a Tesla M40 with a blower fan retrofit, "Buy it Now"
| price $300 Australian dollars.)
| AuryGlenz wrote:
| I mean, it's ridiculously slower. They're getting something
| like .5it/s at 512x512. I believe my 3080 gets 10? Maybe more
|
| If AMD improves their speed with this configuration on later
| APUs though - that could really hurt Nvidia.
| andrewstuart wrote:
| You mean in the Reddit post?
|
| That's using a 4600G.
| kramerger wrote:
| This is still a 5-10x improvement over the CPU.
|
| And most people playing with AI don't have an expensive GPU
| with 12+ GB VRAM.
| Zetobal wrote:
| Incomprehensible, no receipts... script kiddie garbage.
| fisf wrote:
| You demand a receipt and call that script kiddie garbage. What
| an irony.
|
| ROCm has _kinda_ worked for some APUs in 5.x for a while now.
| As much as things are expected to work on AMD hardware anyway.
|
| So basically install ROCm 5.5 and check if rocminfo lists your
| APU as device. Assign more VRAM if possible in your Bios.
|
| There are really no fundamental secret tricks involved. I run
| pytorch on a 5600G. It's not great and breaks all the time, but
| that's not really APU specific.
| krylm wrote:
| I suppose he used something like this:
| https://bruhnspace.com/en/bruhnspace-rocm-for-amd-apus/
| rbanffy wrote:
| When it says "turned" I'm assuming there are some kernel boot
| parameters or driver configurations that are needed for it to
| allocate 16GB of main RAM for the GPU. Did they publish those or
| is this behavior out of the box?
| nickthegreek wrote:
| My Asus ROG Ally (amd powered handheld win11 gaming machine)
| comes with 16gb of ram. In the bios I can set the GPU ram to
| 2,4,6,7 or 8; which will lower my available system ram.
| However, I can also change this in the AMD Armor Crate software
| in windows. So, in my case, it came like this out of the box.
| NotCamelCase wrote:
| How do you like that machine besides? I am kinda considering
| getting one (or Steam Deck), but very short battery time puts
| me off.
| nickthegreek wrote:
| I don't play while outside the house for extended periods,
| so battery isn't a concern for me. I mostly just wanted to
| be able to play my games on the couch, whether that was
| using the Ally as a handheld or by being able to dock it to
| the TV.
|
| The biggest issue that keeps me from recommending it to
| anyone is that it is a windows 11 machine. That is both its
| strongest and weakest point. Turn it on for the first time,
| windows setup on a tiny screen with no physical mouse
| keyboard. And then there is messing with game graphics
| settings to eek out the performance you want. SD slot is
| complete garbage. Most likely a hardware issue that will
| not be able to be fixed. It stops reading cards and in some
| cases, destroys the card. It is fairly easy to upgrade the
| internal SSD though. ASUS has also been pretty quiet on
| graphics driver updates.
|
| As long as you are cool with that stuff, its an awesome
| machine. Steamdeck has more of a 'just works' kinda console
| feel and kinda outdated hardware wise. Lenovo Go images
| just leaked today.
| zokier wrote:
| The billion dollar question here is how to use the HSA feature
| in APU to avoid needing to split the RAM between GPU and CPU;
| in theory at least they both should be able to access same
| memory.
| 1MachineElf wrote:
| Good question!
|
| ~2016 I bought a Lenovo ThinkPad E465 sporting an AMD Carrizo
| APU specifically to take advantage of HSA. It seemed like the
| feature, or at least the toolchain to take advantage of it,
| never really materialized. I'm glad at least someone else
| remembers it.
| andrewstuart wrote:
| https://github.com/GPUOpen-LibrariesAndSDKs/AGS_SDK/
|
| https://gpuopen.com/ags-sdk-5-4-improves-handling-video-memo...
|
| Which says: "For APUs, this distinction is important as all
| memory is shared memory, with an OS typically budgeting half of
| the remaining total memory for graphics after the operating
| system fulfils its functional needs. As a result, the
| traditional queries to Dedicated Video Memory in these
| platforms will only return the dedicated carveout - and often
| represent a fraction of what is actually available for
| graphics. Most of the available graphics budget will actually
| come in the form of shared memory which is carefully OS-managed
| for performance."
|
| The implication seems to be that you can have an arbitrary
| amount of graphics RAM, which would be appealing for AI use
| cases, even though the GPU itself is relatively underpowered.
|
| Still, the question remains open, how to precisely control
| APU/GPU memory allocation on Linux and what is the limitations?
| bladderlover23 wrote:
| If you have 32gb of RAM the option for force allocating 16gb
| will be available in the bios. I think it lets you set a
| maximum of half your ram as reserved for the iGPU
| Havoc wrote:
| This is not consistently true - I can allocate only 8gb out
| of 64gb
| jejones3141 wrote:
| Nice. Just ordered the 64 GB of RAM to max out my computer
| w/5600G at 128.
| FloatArtifact wrote:
| There might be a max that you can assign to the APU
| jejones3141 wrote:
| True. We'll see. I run VMs on the computer, so it's nice
| to have the RAM anyway and I might as well max it out.
| FloatArtifact wrote:
| Do let me know how it goes.
| fotcorn wrote:
| In the BIOS of my Lenovo laptop (T13 Gen3 AMD) I can select how
| much of the RAM should be reserved as VRAM. I guess they are
| doing something similar.
|
| ROCm and PyTorch recognize the GPU, as in,
| `torch.cuda.is_available` returns true, but I haven't actually
| run any models yet.
|
| The maximum I can select on my 32GB RAM laptop is 8GB. Sounds
| like there are desktop AM4 mainboards where you can go up to
| 64GB.
|
| One interesting thing I have seen being reported by `glxinfo`
| is auxiliary memory, which in my case is another 12GB, for a
| total of 20GB reported total available memory GPU memory.
| Unclear if this could be used by ROCm/PyTorch.
| delijati wrote:
| Woot AMD now supports APU? I sold my notebook as i hit a wall
| when trying rocm [1] Is there a list oft Wirkung apu's ?
|
| [1] https://github.com/RadeonOpenCompute/ROCm/issues/1587
| tiffanyh wrote:
| From the Reddit thread: _" That's about 0.55 iterations per
| second. For $95, can't really complain."_
|
| https://old.reddit.com/r/Amd/comments/15t0lsm/i_turned_a_95_...
| [deleted]
| Const-me wrote:
| Would be interesting to try newer AMD Phoenix APU, specifically
| 7840H, 7840HS, 7940H or 7940HS:
| https://en.wikipedia.org/wiki/Template:AMD_Ryzen_Mobile_7040...
|
| These chips don't have a socket and were designed for laptops.
| However, they have up to 54W TDP which is not quite a laptop's
| territory. Luckily, there're mini-PCs on the market with them.
| The form factor is similar to Intel NUC or Mac Mini. An example
| is Minisforum UM790 Pro (disclaimer: I have never used one, so
| far only read a review).
|
| The integrated Radeon 780M GPU includes 12 compute units of
| RDNA3, peak FP32 performance is about 9 TFlops, peak FP16 about
| 18 TFlops. The CPU supports two channels of DDR5-5600, a properly
| built computer has 138 GB/second memory bandwidth.
| gautamcgoel wrote:
| Using two channels of DDR5-5600, you could achieve at most 90
| GB/s of memory bandwidth. It seems to me that you would need
| DDR5-8400 to hit 138GB/s.
| Const-me wrote:
| You might be correct, but do you have a source for that? I
| found that DDR5-5600 = 69.21 GB/second on that web page:
| https://www.crucial.com/articles/about-memory/everything-
| abo...
|
| I have tried to find an authoritative source but failed,
| because DDR5 spec on jedec.org is sold for $370 :-(
| lhl wrote:
| You can use a calculator:
| https://www.unitsconverters.com/en/Mt/S-To-
| Gb/S/Utu-6007-376... - enter in 5600 MT/s and get 44.8 GB/s
| x 2 channels = 89.6 GB/s
|
| I have a 7940HS on my desk now and the Memtest reports the
| same number.
___________________________________________________________________
(page generated 2023-08-17 23:02 UTC)