[HN Gopher] Zluda: Run CUDA code on Intel GPUs, unmodified
___________________________________________________________________
Zluda: Run CUDA code on Intel GPUs, unmodified
Author : goranmoomin
Score : 217 points
Date : 2023-06-15 14:42 UTC (8 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| raphaelj wrote:
| Related question, what is the best way to handle kernel
| compatibility for CUDA, OpenCL, etc ... ?
|
| I had to write a cross-platform kernel a few weeks ago, and I
| ended using pre-processor guards to make it work with the OpenCL
| and CUDA compilers [1].
|
| [1]
| https://github.com/RaphaelJ/libhum/blob/main/libhum/match.ke...
| vkaku wrote:
| Answer is unconventional. Run CI/CD ... It's way easier to see
| if things will break live when you have it being run-tested on
| these stacks.
| dang wrote:
| Related:
|
| _Zluda: CUDA on Intel GPUs_ -
| https://news.ycombinator.com/item?id=26262038 - Feb 2021 (77
| comments)
| 01100011 wrote:
| > Is ZLUDA a drop-in replacement for CUDA? Yes, but certain
| applications use CUDA in ways which make it incompatible with
| ZLUDA
|
| So no then.
| glimshe wrote:
| It's a drop-in replacement in the sense that you don't need to
| modify your code. But it has limitations/incompatibilities.
| Contrast to something that isn't a drop-in replacement... That
| would require changes to the application.
| 01100011 wrote:
| This statement makes no sense.
|
| "It's compatible with CUDA as long as you don't use all the
| features of CUDA."
|
| So it's a drop in replacement for some subset of modern CUDA.
| I feel like most folks who are upvoting this don't program
| CUDA professionally or aren't very advanced in their usage of
| it.
| dragonwriter wrote:
| "Drop-in" = to the extent it works, it requires no app
| changes.
|
| "Complete" = it covers everything.
|
| It is drop-in, but not complete.
| dahart wrote:
| The issue being that in the original question "drop-in"
| implies complete, that's what being a drop-in replacement
| actually means in other contexts. If it's not complete,
| then it's not _really_ drop-in, even though I don't
| necessarily disagree with your definition. You can be
| right, and parent can be right too, IMHO. The FAQ
| question is stated ambiguously and in misleadingly black
| and white terms, and the answer really does look kinda
| funny starting with the word "Yes" and then following
| that with "but... not exactly". Wouldn't it be better to
| say drop-in is the goal, and because it's not complete,
| we're not there yet?
| glimshe wrote:
| Does the statement _make no sense_ or you simply don 't
| like it? I can understand what it says, sounds like plain
| English to me.
|
| I think that what you're trying to say is that before
| claiming to be a "drop-in replacement", make sure that your
| supported feature set is representative enough of mainline
| CUDA development.
| jjallen wrote:
| yeah would probably be good to include where it doesn't work.
| 1% of the time? 10%?
| mattkrause wrote:
| It's near the bottom:
|
| "What is the status of the project?
|
| This project is a Proof of Concept. About the only thing that
| works currently is Geekbench. It's amazingly buggy and
| incomplete. You should not rely on it for anything serious."
| meepmorp wrote:
| 50% of the time it works every time.
| catchnear4321 wrote:
| that's not quite accurate.
|
| it works 100% of the time. until it does not.
| jlebar wrote:
| Oh wow, I think this translates PTX (nvidia's high-level assembly
| code) to SPIR-V? Am I reading this right? That's...a lot.
|
| A note to any systems hackers who might try something like this,
| you can also retarget clang's CUDA support to SPIR-V via a LLVM-
| to-SPIR-V translator. I can say with confidence that this works.
| :)
| rapatel0 wrote:
| Intel should pour money into this project until the code is
| hosted in scrooge's money bin
| YetAnotherNick wrote:
| If a single dev could do it, why can't AMD do the same for their
| GPUs.
| ftxbro wrote:
| it's because AMD drivers generate dead loop
| https://youtu.be/Mr0rWJhv9jU?t=320
| simcop2387 wrote:
| Patents and software licenses probably.
| mrstone wrote:
| I imagine they could, but it is probably more of a legal thing.
| viewtransform wrote:
| Yes it is a legal issue. AMD cannot implement CUDA.
|
| However, they have gone around that by creating HIP which is
| a CUDA adjacent language that runs on AMD and also translates
| to CUDA for Nvidia GPUs. There is also the HIPify tool to
| automatically convert existing sources from CUDA to HIP.
| https://docs.amd.com/bundle/HIP-Programming-
| Guide-v5.3/page/...
| CoastalCoder wrote:
| Legal question:
|
| Let's suppose that an open-source CUDA API is in a legal
| gray zone that could only be clarified by a judge.
|
| Could a company like AMD create a wholly owned subsisiary
| to make an attempt, whithout exposing the parent company to
| legal liability?
| remram wrote:
| Is that true? AMD implemented the x86 instruction set,
| Google implemented the Java APIs, what is different about
| CUDA?
| hardware2win wrote:
| Doesnt AMD and Intel have agreement on x86?
| remram wrote:
| I see
| constantcrying wrote:
| AMD has an x86 license from Intel and Intel has a x86-64
| license from AMD.
|
| You can guess how much money lawyers have been paid over
| that circumstance.
| circuit10 wrote:
| x86 is protected by patents and only a few companies can
| use them, that's why you don't see random companies
| making x86 CPUs like they can with ARM (which also needs
| licensing but it's much easier to get)
| cesarb wrote:
| The first x86-64 processor has been released more than 20
| years ago, so any patents on that base architecture
| (which includes SSE2) have already expired.
| riemannzeta wrote:
| Good luck enforcing those software parents though in the
| current environment. My sense is that hardware companies
| put more faith in the enforceability of patent claims to
| something like CUDA than software companies with
| experience with litigating patent claims to APIs would.
| Post Oracle v. Google, something like CUDA is vulnerable
| to being knocked off.
|
| And FWIW, that seems to be a reasonable result given the
| overall market structure at the moment. Having all eggs
| in the Nvidia basket is great for Nvidia shareholders,
| but not for customers and probably not even for the
| health of the surrounding industry.
| rany_ wrote:
| genuine question, why don't these protections apply to
| emulators? How could emulators get away with emulating
| x86 but chip manufacturers cannot use x86 for their chips
| without a license?
| remram wrote:
| Probably fair use, which is a subjective thing but one
| Intel must be confident enough they would lose. Or just
| lack of incentive, there is no money for Intel to gain if
| they win.
| YetAnotherNick wrote:
| HIPify is such a half baked effort, in everything from
| installation to benchmark to marketing. It doesn't look
| like they are trying to support this method at all.
| Onavo wrote:
| Didn't Scotus affirm that APIs are not copyrightable?
| pavon wrote:
| No, they ruled that APIs _are_ copyrightable, but that
| Google 's re-implementation was fair use. Based on the
| reasoning of the decision one would expect that in most
| cases independently reimplementing an API would generally
| be fair use. However, from a practical point of view, if
| you are defending yourself in a copyright lawsuit, fair
| use decisions happen much later in the process and are
| more subjective.
|
| Furthermore, CUDA is a language (dialect of C/C++) not an
| API, so that precedent may not have much weight.
| hedora wrote:
| In the end, google reimplemented Java, and the supreme
| court ruled on a very narrow piece of the
| reimplementation. I think it came down to a former
| sun/oracle employee at google actually copy pasting code
| from the original java code base.
|
| I'm reasonably sure they could reimplement CUDA from a
| copyright / trademark perspective. It's possible that
| they could be blocked with patents though.
| monocasa wrote:
| > think it came down to a former sun/oracle employee at
| google actually copy pasting code from the original java
| code base.
|
| IIRC, the verbatim copying of rangeCheck didn't make it
| to SCOTUS. They really did instead rule on the
| copyrightability of the "structure, sequence, and
| organization" of the Java API as a whole.
| COGlory wrote:
| Then why isn't Microsoft suing Valve for Proton/DXVK?
| constantcrying wrote:
| Because these lawsuits are costly, a PR nightmare,
| loosing them is a serious possibility and going around
| fighting your competition with lawsuits can put you into
| a bad place with government agencies.
|
| Playing games on linux is not a threat to microsoft. The
| money they loose on that is miniscule.
| hobofan wrote:
| Probably because they don't care?
| dotnet00 wrote:
| MS has been building lots of goodwill with gamers by
| bringing games to PC and subverting expectations by not
| being opposed to using game pass on Steam Deck. Suing
| Valve or trying to shut down Proton/DXVK would instantly
| burn all that.
| easyThrowaway wrote:
| Because they ran their numbers and realized they have
| more to lose by going against Valve than... amicably find
| a compromise.
|
| A more aggressive approach was tried during the Xbox 360
| era with the Games For Windows Live framework and by
| removing their games from the Steam store. It ended up
| catastrophically bad and they had to backtrack on both
| decisions.
|
| The irony of Proton de facto killing any chance for
| native linux ports of windows games isn't lost to them,
| either.
| CoastalCoder wrote:
| Not sure if this is relevant, but IIUC Proton and Wine
| implement Windows' ABI, rather than something involving
| copyrighted header files.
| hedora wrote:
| Probably for the same reason JWZ reimplemented OpenGL 1.3 on
| top of OpenGL ES 1.1 in three days, but the vendors can't do
| it:
|
| https://www.jwz.org/blog/2012/06/i-have-ported-xscreensaver-...
|
| https://news.ycombinator.com/item?id=4134426
| voxadam wrote:
| It's probably a good idea to hide the referrer on links to
| jwz's site, he holds some fairly strong options about HN.
|
| https://dereferer.me/?https%3A//www.jwz.org/blog/2012/06/i-h.
| ..
| tempaccount420 wrote:
| Although true, I don't think we should be trying to
| circumvent his block.
| Ygg2 wrote:
| It's better suggestion to just not visit JWZ site :P
|
| Went there a few days ago. Got a colonoscopy picture. Even
| without deferrer/referrer.
| bee_rider wrote:
| From the README:
|
| > Is ZLUDA a drop-in replacement for CUDA?
|
| > Yes, but certain applications use CUDA in ways which make it
| incompatible with ZLUDA
|
| > What is the status of the project?
|
| > This project is a Proof of Concept. About the only thing that
| works currently is Geekbench. It's amazingly buggy and
| incomplete. You should not rely on it for anything serious
|
| It is a cool proof of concept but we don't know how far away it
| is from becoming something that a company would willingly
| endorse. And I suspect AMD or Intel wouldn't want to put a ton
| of effort into... helping people continue to write code in
| their competitor's ecosystem.
| jabradoodle wrote:
| Cuda has won though, it's not about helping people write code
| for your competitors, it's about allowing the most used
| packages for ML to run on your hardware.
| NelsonMinar wrote:
| This looks really interesting but also early days. "this is a
| very incomplete proof of concept. It's probably not going to work
| with your application." I hope it develops into something broadly
| usable!
| pjmlp wrote:
| CUDA is a polyglot development enviroment, usually all these
| projects fail short to focus only on C++.
|
| I failed to find information regarding Fortran, Haskell, Julia,
| .NET, Java support for CUDA workloads.
| misterbishop wrote:
| Why is this trending when there hasn't been a commit since Jan
| 2021? There's a comment here like "it's early days"... the repo
| has been dormant for longer than it was active.
| trostaft wrote:
| This project has been unmaintained for a while.
|
| Both Intel and AMD have the opportunity to create some actual
| competition for NVIDIA in the GPGPU space. Intel, at least, I can
| forgive since they only just entered the market. Why AMD has
| struggled so hard to get anything going for so long, I don't
| know...
| garbagecoder wrote:
| * and Apple.
| moooo99 wrote:
| AMD has made several attempts, their most recent effort
| apparently is the ROCm [0] software platform. There is an
| official PyTorch distro for Linux that supports ROCm [1] for
| acceleration. There's also frameworks like tinygrad [2] that
| (claim) support for all sorts of accelerators. Thats as far as
| the claims go, I don't know how it handles the real world. If
| the occasional George Hotz livestream (creator of TinyGrad) is
| anything to go by, AMD has to rule out a lot of driver issues
| to be any actual competition for team green.
|
| I really hope AMD manages a comeback like they showed a few
| years ago with their CPUs. Intel joining the market is
| certainly helping, but having three big players competing would
| certainly be desirable for all sorts of applications that
| require GPUs. AMD cars like the 7900 XTX are already fairly
| promising on paper with fairly big VRAMs, they'd probably be
| much more cost effective than NVIDIA cards if software support
| was anywhere near comparable.
|
| [0]: https://www.amd.com/en/graphics/servers-solutions-rocm
|
| [1]: https://pytorch.org/
|
| [2]: https://github.com/geohot/tinygrad
| mmis1000 wrote:
| I think the most weird thing is that. The ROCm works fine on
| linux. At least some dedicated workstation can use it with
| specific cards. And it already exists for many years. But
| somehow they can't make any single card work on windows after
| so many years passed. (Or they don't want for some reason?).
| It's really weird, given they already have a working
| implementation (just not for windows) so they are not lacking
| of the ability for making it work.
| jacooper wrote:
| The issue with ROCm is that its completely unaccessible for
| most users. It only support high end GPUs.
|
| While cuda works on a 1050ti.
| kkielhofner wrote:
| Supported on CUDA 12 no less!
|
| To get an idea, the 1050ti is a card with an MSRP of $140 -
| almost seven years ago when it was released. Between the
| driver and CUDA support matrix it will likely end up with a
| 10 year support life.
|
| While it's not going to impress with an LLM or such it's
| the lowest minimum supported card for speech to text with
| Willow Inference Server (I'm the creator) and it still puts
| up impressive numbers.
|
| Same for the GTX 1060/1070, which you can get with up to
| 8GB VRAM today for ~$100 used. Again, not impressive for
| the hotter LLMs, etc but it will do ASR, TTS, video
| encoding/decoding, Frigate, Plex transcoding, and any
| number of other things with remarkable performance
| (considering cost). People also run LLMs on them and from a
| price/performance/power standpoint it's not even close
| compared to CPU.
|
| The 15 year investment and commitment to universal support
| for any Nvidia GPU across platforms (with very long support
| lifecycles) is extremely hard to compete with (as we see
| again and again with AMD attempts in the space).
| jacooper wrote:
| To be fair, AMD does offer good long term support for
| cards, just not with ROCm.
| vkaku wrote:
| Agreed. What they've also been doing is stalling/removing
| support on gfx803 and that literally could have allowed
| people to still use the GPU for doing many decent small
| nets.
| joe_the_user wrote:
| " _AMD has made several attempts..._ "
|
| And failed to make any of them work, which to my mind means
| they've burned their possibilities _more_ than if they flat-
| out did nothing.
| kadoban wrote:
| Does ROCm count as an attempt? They burned so many people by
| not supporting any of the cards anyone cares about.
| Tostino wrote:
| All it would take to remedy that, is actually providing
| good support going forward and a bit of advertising. Not a
| huge barrier IMO.
| MuffinFlavored wrote:
| > Both Intel and AMD have the opportunity to create some actual
| competition for NVIDIA in the GPGPU space.
|
| Apple Silicon being on Metal Performance Shaders (I think they
| deprecated OpenCL support?) kind of makes this all more
| confusing.
|
| It definitely feels like CUDA is the leader and anything else
| is backseat/a non starer, which is fine. The community support
| isn't there.
|
| I haven't heard anybody talk about AMD Radeon GPUs in a looong
| time.
| pjmlp wrote:
| All the competition fails in tooling and not being polyglot
| like CUDA.
|
| So it is already a non starter if they can't meet those
| baselines.
| easythrees wrote:
| AMD has HIP.
| trostaft wrote:
| Yes, but I don't think its debatable to say that the entire
| ecosystem is firmly behind NVIDIA's. Usually it comes as a
| surprise when something does support their framework, whether
| directly ROCm or even HIP which should be easier....
|
| I shouldn't be surprised that AMD's ecosystem is lagging
| behind, since their GPU division spent a good decade
| suffering to be even relevant. Not to mention that NVIDIA has
| spent a lot of effort on their HPC tools.
|
| I don't want this to be too negative towards AMD, they have
| been steadily growing in this space. Some things do work
| well, e.g. stable diffusion is totally fine on AMD GPUs. So
| they seem to be catching up. I just feel a little impatient,
| especially since their cards are more than powerful enough to
| be useful. I suppose my point is that the gap in HPC software
| between NVIDIA and AMD is much larger than the actual
| capability gap in their hardware, and that's a shame.
| Eisenstein wrote:
| Apple was a few months from bankruptcy during most of the
| 90s competing with IBM and Microsoft, then turned around to
| become the most profitable company on the planet. It takes
| a leader and a plan and a lot of talent and the exact right
| conditions, but industry behemoths get pulled down from the
| top spot all the time.
| tracker1 wrote:
| Apple's success is mostly UX and marketing with a walled
| garden for an application tax. AMD has to actually
| achieve on the hardware side, not just marketing. Beyond
| this, AMD has demonstrated that they are, indeed working
| on closing and pulling ahead. AMD is well ahead of Intel
| on the server CPU front, they're neck and neck on
| desktop, with spans ahead in the past few years. And on
| the GPU side, they've closed a lot of gaps.
|
| While I am a little bit of a fan of AMD, there's still
| work to do. I think AMD really needs to take advantage of
| their production margins to gain more market share. They
| also need to get something a bit closer to the 4090 on a
| performance gpu + entry workstation api/gpgpu workload
| card. The 7900 XTX is really close, but if they had
| something with say 32-48gb vram in the sub-2000 space it
| would really get a lot of the hobbiest and soho types to
| consider them.
| elzbardico wrote:
| Yeah, sure, changing their platform 3 times in the space
| of some twenty years is just marketing and UX from Apple.
| They are just a bunch of MBAs. Sometimes I feel like I am
| reading slashdot.
| AnthonyMouse wrote:
| The platform changes had very little to do with their
| success. They switched from PowerPC to Intel because
| PowerPC was uncompetitive, but that doesn't explain why
| they did any better than Dell or anyone else using the
| exact same chips. Then they developed their own chips
| because Intel was stagnant, but they barely came out
| before AMD had something competitive and it's not obvious
| they'd have been in a meaningfully different position had
| they just used that.
|
| Their hardware is good but if all they were selling was
| Macbooks and iPhones with Windows and Android on them,
| they wouldn't have anything near their current margins.
| robbiep wrote:
| You're not really making sense.
|
| If they hadn't made platform changes they would have
| never been able to turn into what they are today. I
| hardly thing that is 'little to do'.
|
| They would likely barely exist. They have 'achieved
| product market fit' as the saying goes. Which requires
| more than just a sharp UI, as their history shows
| freedomben wrote:
| Yeah, but try running ML projects on your AMD card and you'll
| quickly see that they're an afterthought nearly everywhere,
| even projects that use PyTorch (which has backend support for
| AMD). If consumers can't use it, they're going to learn
| nvidia and experience has shown that people opt for
| enterprise tech that they're familiar with, and most people
| get familiar hacking on it locally
| zackmorris wrote:
| Are there any projects that go the opposite direction, to run CPU
| code on GPU? I understand that there might be limitations, like
| not being able to access system calls or the filesystem. What I'm
| mainly looking for is a way to write C-style code and have it run
| auto-parallelized, without having to drop into a different
| language or annotate my code or manually manage buffers.
| andy_ppp wrote:
| I think C code is full of branches and CPUs are designed to
| guess and parallelise those (possible) decisions. Graphics
| cards are designed for running tiny programs against millions
| of pixels per second. I'm not sure it possible to make these
| two different concepts the same.
| karim79 wrote:
| > Is ZLUDA a drop-in replacement for CUDA? Yes, but certain
| applications use CUDA in ways which make it incompatible with
| ZLUDA
|
| I think this might get better if and when people redesign their
| dev workflows, CI/CD pipelines, builds, etc, to deploy code to
| both hardware platforms to ensure matching functionality and
| stability. I'm not going to hold my breath just yet. But it would
| be _really_ great to have two _viable_ platforms /players in this
| space where code can be run and behave equally.
| Tade0 wrote:
| I appreciate how the name translates to "delusion", considering
| how "cuda" translates to "miracles" in the same language
| (Polish).
| tgtweak wrote:
| How good is the tool at reporting missing cuda functionally?
| SomeRndName11 wrote:
| [flagged]
| raphlinus wrote:
| This is something that fundamentally can't work, unfortunately.
| One showstopper (and there may be others) is subgroup size.
| Nvidia hardware has a subgroup (warp) size of 32, while Intel's
| subgroup size story is far more complicated, and depends on a
| compiler heuristic to tune. The short version of the story is
| that it's usually 16 but can be 8 if there's a lot of register
| pressure, or 32 for a big workgroup and not much register
| pressure (and for those who might reasonably question whether
| forcing subgroup size to 32 can solve the compatibility issue,
| the answer is that it will frequently cause registers to spill
| and performance to tank). CUDA code is not written to be agile in
| subgroup size, so there is no automated translation that works
| efficiently on Intel GPU hardware.
|
| Longer term, I think we _can_ write GPU code that is portable,
| but it will require building out the infrastructure for it.
| Vulkan compute shaders are one good starting point, and as of
| Vulkan 1.3 the "subgroup size control" feature is mandatory.
| WebGPU is another possible path to get there, but it's currently
| lacking a lot of important features, including subgroups at all.
| There's more discussion of subgroups as a potential WebGPU
| feature in [1], including how to handle subgroup size.
|
| [1]: https://github.com/gpuweb/gpuweb/issues/3950
| varelse wrote:
| [dead]
| AnthonyMouse wrote:
| Things like this are often useful even if they're not optimal.
| Before you had a piece of code that simply would not run on
| your GPU. Now it runs. Even if it's slower than it should be,
| that's better than not running at all. Which makes more people
| willing to buy the GPU.
|
| Then they go to the developers and ask why the implementation
| isn't optimized for this hardware lots of people have and the
| solution is to do an implementation in Vulkan etc.
| pjmlp wrote:
| Only if SPIR-V tooling ever gets half as good as PTX ecosystem.
| [deleted]
| fancyfredbot wrote:
| The CUDA block size is likely to be a good proxy for register
| pressure so if the block size is small you can try running with
| a small subgroup, etc.
|
| NVIDIA used to discourage code which relies on the subgroup or
| warp size. I'm not sure how much this is true of real world
| code though.
| vkaku wrote:
| Beautiful! Waiting for someone to use this and get benchmarks
| with PyTorch now.
| replete wrote:
| Llama.cpp just added CUDA GPU acceleration yesterday, so this
| would be very interesting for the emerging space of running local
| LLMs on commodity hardware.
|
| Running CUDA on an AMD RDNA3 APU is what I'd like to see as its
| probably the cheapest 16GB shared VRAM solution (UMA Frame Buffer
| BIOS setting) and creates the possibility of running 13b LLM
| locally on an underutilized iGPU.
|
| Aaand its been dead for years, shame.
| brucethemoose2 wrote:
| - llama.cpp already has OpenCL acceleration. It has had it for
| some time.
|
| - AMD already has a CUDA translator: ROCM. It _should_ work
| with llama.cpp CUDA, but in practice... _shrug_
|
| - Copies the CUDA/OpenCL code make (that are unavoidable for
| discrete GPUs) are problematic for IGPs. Right now acceleration
| regresses performance on IGPs.
|
| Llama.cpp would need tailor made IGP acceleration. And I'm not
| even sure what API has the most appropriate zero copy
| mechanism. Vulkan? OneAPI? Something inside ROCM?
| varelse wrote:
| [dead]
___________________________________________________________________
(page generated 2023-06-15 23:01 UTC)