[HN Gopher] AMD Ryzen APU turned into a 16GB VRAM GPU and it can...
       ___________________________________________________________________
        
       AMD Ryzen APU turned into a 16GB VRAM GPU and it can run Stable
       Diffusion
        
       Author : virgulino
       Score  : 136 points
       Date   : 2023-08-17 15:01 UTC (8 hours ago)
        
 (HTM) web link (old.reddit.com)
 (TXT) w3m dump (old.reddit.com)
        
       | acumenical wrote:
       | The post is short so I'll paste it here. If this is against the
       | rules please ban me.
       | 
       | -----begin copy paste-----
       | 
       | The 4600G is currently selling at price of $95. It includes a
       | 6-core CPU and 7-core GPU. 5600G is also inexpensive - around
       | $130 with better CPU but the same GPU as 4600G.
       | 
       | It can be turned into a 16GB VRAM GPU under Linux and works
       | similar to AMD discrete GPU such as 5700XT, 6700XT, .... It thus
       | supports AMD software stack: ROCm. Thus it supports Pytorch,
       | Tensorflow. You can run most of the AI applications.
       | 
       | 16GB VRAM is also a big deal, as it beats most of discrete GPU.
       | Even those GPU has better computing power, they will get out of
       | memory errors if application requires 12 or more GB of VRAM.
       | Although the speed is an issue, it's better than out of memory
       | errors.
       | 
       | For stable diffusion, it can generate a 50 steps 512x512 image
       | around 1 minute and 50 seconds. This is better than some high end
       | CPUs.
       | 
       | 5600G was a very popular product, so if you have one, I encourage
       | you to test it. I made some videos tutorials for it. Please
       | search tech-practice9805 for on Youtube and subscribe to the
       | channel for future contents. Or see the video links in Comments.
       | 
       | Please also follow me on X: https://twitter.com/TechPractice1
       | Thanks for reading!
        
         | [deleted]
        
       | rvnx wrote:
       | Video + tutorial: https://www.youtube.com/watch?v=HPO7fu7Vyw4
        
       | thereisnospork wrote:
       | How well do these workloads parallelize? Especially over
       | customer-tier interconnects? What's stopping someone from picking
       | up 100 of these to setup a cool little 1.6tb-vram cluster? Whole
       | thing would probably cost less than an h100.
        
       | [deleted]
        
       | nre wrote:
       | The 4600G supports two channels of DDR4-3200 which has a maximum
       | memory bandwidth of around 50GB/s (actual graphics cards are in
       | the hundreds). While this chip may be decent for SD and other
       | compute-bound AI apps it won't be good for LLMs as inference
       | speed is pretty much capped by memory bandwidth.
       | 
       | Apple Silicon has extremely high memory bandwidth which is why it
       | performs so well with LLMs.
        
         | acumenical wrote:
         | Odd. The high memory bandwidth of M2 intrigues me but I have
         | not seen many people having success with AI apps on Apple
         | Silicon. Which LLMs run better on Apple silicon than comparably
         | priced Nvidia cards?
        
           | AnthonyMouse wrote:
           | They don't run better on AS than on GPUs with even more
           | memory bandwidth. They run better on AS than on consumer PC
           | CPUs (or presumably iGPUs) with less memory bandwidth.
        
           | seniorivn wrote:
           | there are no comparably priced nvidia cards, thats the point,
           | of comparing apple soc/apu with amd/intel, specialized
           | hardware is and always will be better
        
         | [deleted]
        
         | AnthonyMouse wrote:
         | > The 4600G supports two channels of DDR4-3200 which has a
         | maximum memory bandwidth of around 50GB/s (actual graphics
         | cards are in the hundreds).
         | 
         | DDR4-4800 exists. 76.8GB/s. You can also get a Ryzen 7000
         | series for around $200 that can use DDR5-8000, which is
         | 128GB/s. By contrast, the M1 is 68GB/s and the M2 is 100GB/s.
         | (The Pro and Max are more, but they're also solidly in "buy a
         | GPU" price range.)
        
         | toast0 wrote:
         | Doesn't change the meat of your argument, the 4000G series runs
         | DDR4-3600 pretty well (even if the spec sheet only goes to
         | 3200), and it's almost a crime to run an APU at anything worse
         | than DDR4-3600 CL16. You can go higher than that too, but
         | depending on your particular chip, when you get much above
         | 3600, you may not be able to run the ram base clock at the same
         | rate as the infinity fabric, which isn't ideal.
        
       | Roark66 wrote:
       | I wonder how does it compare to running same model on the same
       | cpu in ram(making sure fast cpu ML library that utilises AVX2 is
       | used for example Intel MKL)?
       | 
       | Also, when doing a test like this it's important to compare same
       | bit depths so fp32 on both.
        
         | dragontamer wrote:
         | My untested rule of thumb is that if something fits inside of
         | GPU-register space, its probably faster on GPU. (GPUs have more
         | architectural registers than CPUs, since CPUs are using all
         | those "shadow registers" to implement out-of-order stuffs).
         | Compilers will automatically turn your most-used variables into
         | registers today, though you might need to verify by looking at
         | the assembly code whether or not you're actually in register-
         | space.
         | 
         | But if something fits inside of CPU-cache space, its probably
         | faster on CPU. (Intel is upto 2MB L2 cache on some systems, AMD
         | is up to 192MB L3 cache on some systems).
         | 
         | But if something is outside of CPU-cache space but fits inside
         | of GPU-VRAM space, its probably faster on GPU. (Better to be
         | bandwidth-limited at 500GB/s GPU-VRAM speeds than 50GB/s DDR5
         | speed). Ex: 16GBs GPU-VRAM is back into GPU-sized solutions to
         | a problem.
         | 
         | Then if something fits inside of CPU-RAM space, which is like
         | 2TB in practice (!!!!), then CPUs are probably faster. Because
         | there's no point taking a 2TB DDR5 RAM limited problem, passing
         | some of that data to 16GB or 96GB of GPU-VRAM, and then waiting
         | for DDR5 RAM anyway.
         | 
         | ------------
         | 
         | Not that I've tested out this theory. But it'd make sense in my
         | brain at least, lol.
        
           | [deleted]
        
       | chickenpotpie wrote:
       | The ability to get huge amounts of VRAM/$ is what I find
       | incredibly interesting. A lot of diffusion techniques are
       | incredibly VRAM intensive and high VRAM consumer cards are rare
       | and expensive. I'll gladly take the slower speeds of an APU if it
       | means I can load the entire model in memory instead of having to
       | offload chunks of it.
        
         | delusional wrote:
         | A PCIE 4.0 16x link should provide around 32GB/s of bandwidth,
         | close to the 33GB/s of DDR5-3200. In a perfect world, it would
         | seem to me that doing 100% offloading (streaming everything as
         | needed from system memory) should be equivalent to doing the
         | calculations in system memory in the first place. The GPU
         | memory just acts as a cache, and should only speed up the
         | processing.
        
           | dragontamer wrote:
           | Every modern motherboard is dual-channel DDR5 at a minimum,
           | maybe quad-channel.
           | 
           | 2x sticks of 32GB/s RAM, properly configured, will run at
           | 64GB/s of bandwidth. Modern servers are quad, hex, or oct-
           | channel (4x, 6x, or 8x parallel sticks of RAM) in practice.
           | Or even more (ex: Intel Xeon Platinums are 6x channel per
           | socket, so an 8x socket 8x CPU Xeon Platinum will be like 48x
           | DDR5 parallel RAM sticks).
           | 
           | ----------
           | 
           | PCIe x16 by the way, is 16x parallel lanes of PCIe. All the
           | parallelism is already innate in modern systems.
           | 
           | ---------
           | 
           | L3 cache is TB/s bandwidth IIRC. CPUs inside of CPU-space
           | will automatically be caching a lot of those RAM commands, so
           | you can go above the RAM-bandwidth limitations in practice,
           | though it depends on how your code accesses RAM.
           | 
           | GPUs have very small caches, and have higher latency to those
           | caches. Instead, GPUs rely upon register-space and extreme
           | amounts of SMT/wavefronts to kinda-sorta hyperthread their
           | cores to hide all that latency.
        
           | BobbyJo wrote:
           | The full mem-swap congo line is DRAM<->PCIE<->VRAM<->GPU.
           | PCIE is the weak link, but I'd be willing to bet that those
           | transfers are't 100% overlapped, and PCIE transfer rate
           | represents best case speed as opposed to expected. In the
           | case of unidirectional writes, you'd have to cut that speed
           | in half.
        
           | gautamcgoel wrote:
           | This argument makes sense only if you assume that your model
           | is too large to fit in the GPU's RAM and hence has to reside
           | mainly in the CPU's RAM.
        
             | delusional wrote:
             | That's true. The comment i was responding to talked about
             | offloading. I was assuming he was talking about offloading
             | part of the core model to the system RAM which would need
             | to be reloaded frequently.
        
         | dragontamer wrote:
         | Using DDR5 as VRAM however means you're only getting 50GB/s to
         | 100GB/s read/write speed instead of the 500GB/s available on a
         | proper GPU.
         | 
         | That might be a fine tradeoff for some kernels. But my
         | understanding is that Stable Diffusion is very VRAM-bandwidth
         | heavy and actually benefits from the higher-speed GDDR6 or HBM
         | RAM on a proper high-end GPU.
        
         | justinclift wrote:
         | If you're ok with older generation Nvidia gear, but still with
         | 24GB ram, then some people are using things like this:
         | 
         | https://www.ebay.com.au/itm/126046615075
         | 
         | That's $250 in Australian dollars though, which is about
         | US$160. I'm not affiliated with that seller btw, I just
         | remembered the search result from looking a while back. :)
        
           | cptskippy wrote:
           | I bought one of those earlier this week and I'm in the
           | process of getting it setup.
        
           | lhl wrote:
           | For not much more (US$200) you can find lots of P40s, which
           | is a generation newer and will give you double the memory
           | bandwidth and FP32. That being said, used 3090s are going for
           | about $600 now and are much better bang/buck and easier
           | (software and hardware) to setup.
        
           | choppaface wrote:
           | These have lots of memory but are pretty slow, can be
           | 50x-100x slower than a gamer card from the past couple years,
           | plus lots of heat / power inefficiency. If you have any
           | software than can take advantage of tensor cores / matrix
           | cores or int8 / fp16 ops then modern hardware will probably
           | win.
        
             | [deleted]
        
           | sp332 wrote:
           | (It's a Tesla M40 with a blower fan retrofit, "Buy it Now"
           | price $300 Australian dollars.)
        
         | AuryGlenz wrote:
         | I mean, it's ridiculously slower. They're getting something
         | like .5it/s at 512x512. I believe my 3080 gets 10? Maybe more
         | 
         | If AMD improves their speed with this configuration on later
         | APUs though - that could really hurt Nvidia.
        
           | andrewstuart wrote:
           | You mean in the Reddit post?
           | 
           | That's using a 4600G.
        
           | kramerger wrote:
           | This is still a 5-10x improvement over the CPU.
           | 
           | And most people playing with AI don't have an expensive GPU
           | with 12+ GB VRAM.
        
       | Zetobal wrote:
       | Incomprehensible, no receipts... script kiddie garbage.
        
         | fisf wrote:
         | You demand a receipt and call that script kiddie garbage. What
         | an irony.
         | 
         | ROCm has _kinda_ worked for some APUs in 5.x for a while now.
         | As much as things are expected to work on AMD hardware anyway.
         | 
         | So basically install ROCm 5.5 and check if rocminfo lists your
         | APU as device. Assign more VRAM if possible in your Bios.
         | 
         | There are really no fundamental secret tricks involved. I run
         | pytorch on a 5600G. It's not great and breaks all the time, but
         | that's not really APU specific.
        
         | krylm wrote:
         | I suppose he used something like this:
         | https://bruhnspace.com/en/bruhnspace-rocm-for-amd-apus/
        
       | rbanffy wrote:
       | When it says "turned" I'm assuming there are some kernel boot
       | parameters or driver configurations that are needed for it to
       | allocate 16GB of main RAM for the GPU. Did they publish those or
       | is this behavior out of the box?
        
         | nickthegreek wrote:
         | My Asus ROG Ally (amd powered handheld win11 gaming machine)
         | comes with 16gb of ram. In the bios I can set the GPU ram to
         | 2,4,6,7 or 8; which will lower my available system ram.
         | However, I can also change this in the AMD Armor Crate software
         | in windows. So, in my case, it came like this out of the box.
        
           | NotCamelCase wrote:
           | How do you like that machine besides? I am kinda considering
           | getting one (or Steam Deck), but very short battery time puts
           | me off.
        
             | nickthegreek wrote:
             | I don't play while outside the house for extended periods,
             | so battery isn't a concern for me. I mostly just wanted to
             | be able to play my games on the couch, whether that was
             | using the Ally as a handheld or by being able to dock it to
             | the TV.
             | 
             | The biggest issue that keeps me from recommending it to
             | anyone is that it is a windows 11 machine. That is both its
             | strongest and weakest point. Turn it on for the first time,
             | windows setup on a tiny screen with no physical mouse
             | keyboard. And then there is messing with game graphics
             | settings to eek out the performance you want. SD slot is
             | complete garbage. Most likely a hardware issue that will
             | not be able to be fixed. It stops reading cards and in some
             | cases, destroys the card. It is fairly easy to upgrade the
             | internal SSD though. ASUS has also been pretty quiet on
             | graphics driver updates.
             | 
             | As long as you are cool with that stuff, its an awesome
             | machine. Steamdeck has more of a 'just works' kinda console
             | feel and kinda outdated hardware wise. Lenovo Go images
             | just leaked today.
        
         | zokier wrote:
         | The billion dollar question here is how to use the HSA feature
         | in APU to avoid needing to split the RAM between GPU and CPU;
         | in theory at least they both should be able to access same
         | memory.
        
           | 1MachineElf wrote:
           | Good question!
           | 
           | ~2016 I bought a Lenovo ThinkPad E465 sporting an AMD Carrizo
           | APU specifically to take advantage of HSA. It seemed like the
           | feature, or at least the toolchain to take advantage of it,
           | never really materialized. I'm glad at least someone else
           | remembers it.
        
         | andrewstuart wrote:
         | https://github.com/GPUOpen-LibrariesAndSDKs/AGS_SDK/
         | 
         | https://gpuopen.com/ags-sdk-5-4-improves-handling-video-memo...
         | 
         | Which says: "For APUs, this distinction is important as all
         | memory is shared memory, with an OS typically budgeting half of
         | the remaining total memory for graphics after the operating
         | system fulfils its functional needs. As a result, the
         | traditional queries to Dedicated Video Memory in these
         | platforms will only return the dedicated carveout - and often
         | represent a fraction of what is actually available for
         | graphics. Most of the available graphics budget will actually
         | come in the form of shared memory which is carefully OS-managed
         | for performance."
         | 
         | The implication seems to be that you can have an arbitrary
         | amount of graphics RAM, which would be appealing for AI use
         | cases, even though the GPU itself is relatively underpowered.
         | 
         | Still, the question remains open, how to precisely control
         | APU/GPU memory allocation on Linux and what is the limitations?
        
         | bladderlover23 wrote:
         | If you have 32gb of RAM the option for force allocating 16gb
         | will be available in the bios. I think it lets you set a
         | maximum of half your ram as reserved for the iGPU
        
           | Havoc wrote:
           | This is not consistently true - I can allocate only 8gb out
           | of 64gb
        
           | jejones3141 wrote:
           | Nice. Just ordered the 64 GB of RAM to max out my computer
           | w/5600G at 128.
        
             | FloatArtifact wrote:
             | There might be a max that you can assign to the APU
        
               | jejones3141 wrote:
               | True. We'll see. I run VMs on the computer, so it's nice
               | to have the RAM anyway and I might as well max it out.
        
               | FloatArtifact wrote:
               | Do let me know how it goes.
        
         | fotcorn wrote:
         | In the BIOS of my Lenovo laptop (T13 Gen3 AMD) I can select how
         | much of the RAM should be reserved as VRAM. I guess they are
         | doing something similar.
         | 
         | ROCm and PyTorch recognize the GPU, as in,
         | `torch.cuda.is_available` returns true, but I haven't actually
         | run any models yet.
         | 
         | The maximum I can select on my 32GB RAM laptop is 8GB. Sounds
         | like there are desktop AM4 mainboards where you can go up to
         | 64GB.
         | 
         | One interesting thing I have seen being reported by `glxinfo`
         | is auxiliary memory, which in my case is another 12GB, for a
         | total of 20GB reported total available memory GPU memory.
         | Unclear if this could be used by ROCm/PyTorch.
        
       | delijati wrote:
       | Woot AMD now supports APU? I sold my notebook as i hit a wall
       | when trying rocm [1] Is there a list oft Wirkung apu's ?
       | 
       | [1] https://github.com/RadeonOpenCompute/ROCm/issues/1587
        
       | tiffanyh wrote:
       | From the Reddit thread: _" That's about 0.55 iterations per
       | second. For $95, can't really complain."_
       | 
       | https://old.reddit.com/r/Amd/comments/15t0lsm/i_turned_a_95_...
        
         | [deleted]
        
       | Const-me wrote:
       | Would be interesting to try newer AMD Phoenix APU, specifically
       | 7840H, 7840HS, 7940H or 7940HS:
       | https://en.wikipedia.org/wiki/Template:AMD_Ryzen_Mobile_7040...
       | 
       | These chips don't have a socket and were designed for laptops.
       | However, they have up to 54W TDP which is not quite a laptop's
       | territory. Luckily, there're mini-PCs on the market with them.
       | The form factor is similar to Intel NUC or Mac Mini. An example
       | is Minisforum UM790 Pro (disclaimer: I have never used one, so
       | far only read a review).
       | 
       | The integrated Radeon 780M GPU includes 12 compute units of
       | RDNA3, peak FP32 performance is about 9 TFlops, peak FP16 about
       | 18 TFlops. The CPU supports two channels of DDR5-5600, a properly
       | built computer has 138 GB/second memory bandwidth.
        
         | gautamcgoel wrote:
         | Using two channels of DDR5-5600, you could achieve at most 90
         | GB/s of memory bandwidth. It seems to me that you would need
         | DDR5-8400 to hit 138GB/s.
        
           | Const-me wrote:
           | You might be correct, but do you have a source for that? I
           | found that DDR5-5600 = 69.21 GB/second on that web page:
           | https://www.crucial.com/articles/about-memory/everything-
           | abo...
           | 
           | I have tried to find an authoritative source but failed,
           | because DDR5 spec on jedec.org is sold for $370 :-(
        
             | lhl wrote:
             | You can use a calculator:
             | https://www.unitsconverters.com/en/Mt/S-To-
             | Gb/S/Utu-6007-376... - enter in 5600 MT/s and get 44.8 GB/s
             | x 2 channels = 89.6 GB/s
             | 
             | I have a 7940HS on my desk now and the Memtest reports the
             | same number.
        
       ___________________________________________________________________
       (page generated 2023-08-17 23:02 UTC)