[HN Gopher] Zluda: CUDA on Intel GPUs
       ___________________________________________________________________
        
       Zluda: CUDA on Intel GPUs
        
       Author : fho
       Score  : 242 points
       Date   : 2021-02-25 12:12 UTC (2 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | exhilaration wrote:
       | Upvoting so someone from Intel sees it and hires these devs to
       | make this actually happen. CUDA absolutely dominates, it would be
       | a game changer to have it work with another GPU platform.
        
         | k_sze wrote:
         | I think the correct way to solve it is to make something
         | competitive for Vulkan. Intel, AMD, and maybe ARM should really
         | cooperate on that front.
        
       | einpoklum wrote:
       | Very nice! I look forward to trying this out.
       | 
       | There a few issues which come to mid though:
       | 
       | 1. How do these benchmark results compare to writing x64_64
       | implementations directly? That is, is it enough to reach OpenCL-
       | level performance?
       | 
       | 2. What about if I want to use both a CPU and and a GPU? It seems
       | that would not be supported, as this library produces a
       | `libcuda.so`.
       | 
       | 3. What about AMD chips?
       | 
       | 4. The README says:
       | 
       | > Authors of CUDA benchmarks used CUDA functions atomicInc
       | 
       | > and atomicDec which have direct hardware support on NVIDIA
       | 
       | > cards, but no hardware support on Intel cards. They have
       | 
       | > to be emulated in software, which limits performance
       | 
       | is there really no support, not even in the near future, for
       | atomic increment on Intel chips?
       | 
       | 5. This is all written in Rust. It's a bit fishy to me for a C
       | library to be implemented in Rust, though maybe I'm just a little
       | prejudiced.
       | 
       | 6. Where is the documentation of the semantic differences between
       | proper CUDA and ZLUDA? e.g. - what do streams do? How do events
       | work? etc.
        
         | johnnycerberus wrote:
         | To be fair I have the same feeling about Rust. C++ is the
         | language positioned for high performance computing, almost
         | everything that screams HPC is in C++. We already have problems
         | in our company modifying some old Fortran libraries that are
         | called from C++ and in the future God knows what Rust or Zig
         | training we will require for basically no upside for HPC. But
         | if safety would be my concern, I would totally look towards
         | Rust. Or hopefully, Julia will save the day.
        
       | makapuf wrote:
       | Does the same exist with amd ? Sorry for asking maybe the obvious
       | but could it be made?
        
         | tomash wrote:
         | kinda:
         | https://www.reddit.com/r/Amd/comments/hp88jb/is_there_any_ap...
        
           | my123 wrote:
           | My opinion about the subject: AMD actually does _not_ care
           | about GPU compute on customer hardware.
           | 
           | If they did they would not have:
           | 
           | - Have had support for Linux only, with no ROCm on Windows
           | 
           | - Have made ROCm as a GPU-specific targeted process, there's
           | no IR like PTX to make your current *binary* run on future
           | GPU generations
           | 
           | - Having dropped support for GCN2/3 (https://github.com/Radeo
           | nOpenCompute/ROCm/issues/1353#issuec...) making the _only_
           | supported customer GPU generation Vega, with no support for
           | RDNA/RDNA2.
           | 
           | They obviously don't care about the market as they should,
           | despite anything they might or they might not say. Nothing to
           | see here... It's only and solely their own fault that NVIDIA
           | is the only option.
           | 
           | Intel so far has a much more competent strategy around GPU
           | computing, and might prove to be an actual competitor. I've
           | written off AMD as a possible competitor to NVIDIA for GPU
           | computing a long time ago.
        
             | dogma1138 wrote:
             | They dropped GCN2/3 support before ROCm was even anywhere
             | near production ready but the biggest issue is the lack of
             | cross platform as you mentioned you can't even use it in
             | Windows containers and also that they never even attempted
             | to support APUs.
             | 
             | Intel not only has better OpenCL support but will come out
             | out of the gate with OneAPI that will support all Intel
             | GPUs this means productivity applications could use it
             | wether it's for a laptop or for a future productivity
             | workstation with an Intel discrete GPU.
             | 
             | It's pretty much impossible to buy a laptop with an AMD GPU
             | and run ROCm on it, even the discrete cards are not
             | officially supported and since the ROCm binaries are
             | hardware specific without official support things tend to
             | be even more broken than what they are now.
             | 
             | I really can't understand how AMD could cock it up so
             | badly.
        
       | jonplackett wrote:
       | So how come this still can't be done with AMD GPUs?
        
         | vetinari wrote:
         | See this comment below:
         | https://news.ycombinator.com/item?id=26285380
        
         | [deleted]
        
       | martini333 wrote:
       | Eth?
        
         | krick wrote:
         | You wouldn't care about GPU API when mining crypto. You just
         | need more GPUs. There's nothing complicated about the software
         | you need for that.
        
       | teleforce wrote:
       | Fun facts, now even the latest lower end Intel's CPUs (Atom,
       | Pentium, Celeron) will feature the much improved Gen 11 iGPU over
       | the last generation Intel's built in GPU [1].
       | 
       | The latest iGPU supports more than 1 TFLOP in GPU performance.
       | This is apparently more than the performance of two years old
       | Nvidia GeForce GT 1030.
       | 
       | [1]https://www.notebookcheck.net/Intel-s-Elkhart-Lake-SoC-
       | will-...
        
         | TomVDB wrote:
         | >This is apparently more than the performance of two years old
         | Nvidia GeForce GT 1030.
         | 
         | Almost 4 years old...
         | 
         | https://www.techpowerup.com/gpu-specs/geforce-gt-1030.c2954
         | 
         | "The GeForce GT 1030 is an entry-level graphics card by NVIDIA,
         | launched in May 2017."
        
       | macksd wrote:
       | Love this concept and hope this does well or something like it
       | becomes a de facto standard. There's an explosion of software
       | using GPUs. In the DL space alone, PyTorch and TensorFlow and
       | many frameworks that compete with them, plug-in to them, etc.
       | There's also an explosion of DL hardware coming. We have NVIDIA's
       | whole product line, competing GPUs, TPUs, Inferentia, a bunch of
       | start ups... That compatibility matrix is going to be insane. You
       | need a good integration layer between them. If everyone can
       | support CUDA, great. But there's also OpenCL and other competing
       | standards. They're all going to have some performance trade-offs,
       | but some of that must be worth it to keep some semblance of a
       | common framework, otherwise small and new players don't stand
       | much of a chance.
        
         | tachyonbeam wrote:
         | I wish PyTorch would utilize CPUs more effectively out of the
         | box. The last time I tried to run some DNN training on CPU
         | (last summer), I was disappointed to find that only one single
         | core was used on my Ryzen machine. Yes, GPUs don't have as much
         | memory bandwidth or compute as GPUs, but this is still leaving
         | a lot of performance on the table.
        
           | rrss wrote:
           | pytorch can use more than one core.
        
       | keyle wrote:
       | On top of HN... I get excited to use it with Blender then scroll
       | through a plethora of "awesomeness" text to find this
       | 
       | > Warning: this is a very incomplete proof of concept. It's
       | probably not going to work with your application. ZLUDA currently
       | works only with applications which use CUDA Driver API or
       | statically-linked CUDA Runtime API - dynamically-linked CUDA
       | Runtime API is not supported at all.
       | 
       | Common man. You can't call it a drop in replacement if it's an
       | incomplete proof of concept.
        
         | JosephRedfern wrote:
         | I don't think it's unfair to call it a drop in replacement,
         | though the current limitations should perhaps be caveated a
         | little more clearly.
         | 
         | I think calling it a drop in replacement highlights that the
         | aim of Zluda is to not require recompilation of the software
         | being run. Maybe "Proof of concept drop in replacement" would
         | be a better description?
        
         | rvz wrote:
         | > Common man. You can't call it a drop in replacement if it's
         | an incomplete proof of concept.
         | 
         | I was nearly put off by that but maybe it should be given a
         | chance with some funding or what not. All great and interesting
         | things come from proof of concepts, that actually work.
         | 
         | At least it isn't vapourware. Unlike some 'projects' I see on
         | GitHub.
        
           | rvz wrote:
           | Downvoters: So this 'proof of concept' on GitHub is in fact
           | some how _' vapourware'_, and it shows an empty repository
           | with zero functioning working code? I'm looking at the same
           | repository as everyone else in this HN post.
           | 
           | Literally the top comment is even suggesting that Intel
           | _could_ be interested in funding this, so surely it deserves
           | a chance with some backing, even if it is _' incomplete'_.
           | 
           | Care to explain the downvotes?
        
             | [deleted]
        
         | syspec wrote:
         | That's the goal, even if not there yet
        
           | bottled_poe wrote:
           | What's does this mean? Is the statement in the article true
           | or not?
        
       | en4bz wrote:
       | > Tying to the previous point, currently ZLUDA does not support
       | asynchronous execution. This gives us an unfair advantage in a
       | benchmark like GeekBench. GeekBench exclusively uses CUDA
       | synchronous APIs
       | 
       | Any "professional" application solely use async APIs so while
       | these numbers may look impressive something like tensorflow or
       | pytorch would either not run or be incredibly slow.
        
       | anovikov wrote:
       | can i use ffmpeg with h264_evenc to encode videos with it?
        
         | wmf wrote:
         | Video encoding uses different hardware and different APIs than
         | GPGPU. FFMPEG can use Intel, AMD, and Nvidia hardware encoding;
         | see https://trac.ffmpeg.org/wiki/HWAccelIntro
        
         | anovikov wrote:
         | what's wrong about my question? yes i know it makes little
         | sense because every motherboard with Intel GPU probably has a
         | CPU with QuickSync Video support and that will most certainly
         | work better for encoding (h264_qsv), but in theory?
        
       | dogma1138 wrote:
       | How does this work with CUDA libraries? Does it has an
       | implementation of the standard libraries at least?
        
       | trevortheblack wrote:
       | I was working at Intel on integrated graphics less than a year
       | ago.
       | 
       | Geekbench could, at best, be described as a _fine_ benchmark. But
       | it 's pretty terrible in a bunch of respects.
       | 
       | I would love to see more benchmarks run through with this.
        
       | maleadt wrote:
       | Impressive! The PTX to SPIR-V compiler must have been quite a bit
       | of work; what's the coverage of the ISA like?
       | 
       | With oneAPI I had hoped to get the inverse, a oneAPI
       | implementation for NVIDIA hardware, but I don't think the CUDA
       | driver API is low-level enough to do so (e.g. explicit vs global
       | contexts). And yes, I know of Codeplay's implementation of DPC++
       | for NVIDIA GPUs, but that doesn't implement oneAPI Level0 APIs so
       | is not usable for other languages.
        
       | villgax wrote:
       | Intel should back the crap outta this project/dev
        
       | sciprojguy wrote:
       | I'm stoked for this to be usable on things like PyTorch.
        
         | rrss wrote:
         | The author has a comment here describing what that would take: 
         | https://github.com/vosen/ZLUDA/issues/17#issuecomment-735403...
         | .
         | 
         | tl;dr: someone would need to re-implement cuDNN
        
       | varispeed wrote:
       | This looks like something Intel should be paying good money for,
       | but I feel like they are just going to "embrace" open source and
       | snatch it without giving a penny to the author.
        
         | jcranmer wrote:
         | Disclosure: I work at Intel.
         | 
         | At one of the internal Q&As, there was a question as to why we
         | didn't just implement CUDA. One of the reasons given was that
         | the lawyers looked at the license of CUDA and decided that
         | Intel could not legally implement CUDA for Intel's GPU devices.
         | I don't know the details, but quite frankly, it wouldn't
         | surprise me if Nvidia didn't somehow put a poison pill in there
         | to prevent Intel or AMD from implementing it for their own GPUs
         | (note that AMD also doesn't provide an implementation of CUDA
         | for its own GPUs).
         | 
         | Instead, the strategy Intel pursued was to develop a migration
         | tool from CUDA to Sycl:
         | https://software.intel.com/content/www/us/en/develop/tools/o...
        
           | The_rationalist wrote:
           | Why not collaborate with AMD for once an improve HIP? If AMD
           | and Intel stay divided all hope is lost and nvidia will stay
           | the main target
        
             | my123 wrote:
             | Because HIP/ROCm is awful.
             | 
             | Instead of targeting an IR, it directly targets a given
             | GPU's ISA, so that your existing binary will not run on
             | future hardware. That's a total no-go for basically every
             | non-HPC use case.
             | 
             | Intel is much better off building a sound technical
             | foundation from scratch.
        
               | The_rationalist wrote:
               | You better make Zluda work well on AMD gpus then
               | otherwise it will be irrelevant
        
               | randomNumber7 wrote:
               | And it doesn't even work on most consumer AMD graphic
               | cards^^
        
           | eslaught wrote:
           | AMD implemented HIP, which is nearly CUDA (if not identical).
           | There is an implementation for Intel too though it is third-
           | party:
           | 
           | https://github.com/cpc/hipcl
        
             | lumost wrote:
             | The problem here seems to be that everyone kinda wants
             | Nvidia's market capture. Intel didn't contribute to AMD's
             | project - they started their own.
        
               | stefan_ wrote:
               | And they are all approaching it the wrong way, the
               | typical hubris of a hardware company that thinks they can
               | just have some interns make a software solution for their
               | problem.
               | 
               | Here is how you beat the CUDA lock-in: _consistently make
               | better performing GPUs so not using them is a liability_.
               | Instead buying AMD you not only get a worse GPU but also
               | the intern software solution, and that is just not
               | compelling.
        
               | tal8d wrote:
               | Or AMD could simply knock off the binary blobs. The DRM
               | excuse has always been weak, because it is a tiny
               | fraction of the total firmware blob - and it'd be easy to
               | make it so that the legally hobbled hardware decoder
               | simply errors out in cases where the end user chooses not
               | to load the DRM blob. Boom, no more dependence on AMD
               | interns. I've been following their commit logs pretty
               | closely for a year now, and I frequently see some amazing
               | accidental admissions about the left hand (software team)
               | not knowing what the right hand (hardware team) is doing.
               | During one very frustrating series of patches it was
               | difficult resisting the impulse to say "Go get your dad."
        
         | numlock86 wrote:
         | You make it sound like that's an issue with Intel while the
         | license is to blame.
        
           | krick wrote:
           | So, basically this project is illegal and waiting to be
           | removed from Github then?
        
             | qiqitori wrote:
             | (IANAL.)
             | 
             | Without looking into the CUDA licenses (who knows, they
             | might even expressly allow this kind of thing, but seems
             | pretty unlikely to me), I'd expect this to be a case of
             | whether "APIs are copyrightable" or not, same as the famous
             | and sort of still ongoing https://en.wikipedia.org/wiki/Goo
             | gle_LLC_v._Oracle_America,_....
             | 
             | The US courts said "yes" in this case (note: 100% stupid
             | IMO), but I'm not confident that nVidia'd have an easy win
             | if they decided to sue the developer, and I'm also
             | relatively sure they wouldn't send a DMCA request at this
             | stage (and that if they did, their request would probably
             | be reviewed harder than normal).
        
               | jcranmer wrote:
               | That's not the most accurate summary of Google v Oracle.
               | The case has been tried twice, Google has been found in
               | the clear twice at the district court by the jury, and
               | the Court of Appeals for the Federal Circuit has twice
               | overturned the jury result, and the case is now at the
               | Supreme Court awaiting a decision as to whether or not
               | CAFC's decision is off it's rocker.
               | 
               | It is not usual for CAFC to hear copyright disputes; that
               | it was appealed to CAFC instead of the 9th Circuit is
               | because there was a patent claim at one point, and CAFC
               | should have followed 9th Circuit precedent on the matter.
               | Google contends that 9th Circuit precedent holds that the
               | API is not copyrightable, which means that CAFC erred in
               | ignoring precedent. Most software companies ultimately
               | agree with Google here, not Oracle: it's telling that
               | most of the amici who side with Oracle are _not_ software
               | companies but media publishers (e.g., MPAA, RIAA).
        
               | iforgotpassword wrote:
               | I think you could argue from the interop side which would
               | make such a project legal in Europe. So Intel or AMD
               | could funnel money into this via their European branches
               | or some subsidiary?
        
               | sodality2 wrote:
               | That link is missing a trailing period by the way
               | 
               | https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_Americ
               | a,_....
               | 
               | Edit: huh? I pasted it in with the period and it gets
               | removed. Do trailing periods get removed because it
               | thinks it's the end of a sentence?
               | 
               | Anyway, this redirects to the right URL:
               | https://en.wikipedia.org/wiki/Google_v_Oracle
        
               | monocasa wrote:
               | Yeah, HN strips trailing periods for the reason you
               | stated. You can normally get around that by adding a
               | hash/pound sign.
               | 
               | https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_Americ
               | a,_...
        
             | barkingcat wrote:
             | yes. I am not a lawyer but nvidia's lawyers would be
             | interested I bet.
        
             | [deleted]
        
         | krick wrote:
         | This makes me think, why Intel doesn't actually care? There's a
         | lot of money poured into marketing, and today it's always about
         | "AI-optimized muti-core nano-whateverer GPU", so isn't it
         | obvious making your shitty GPUs compatible with software
         | written for competition's less shitty GPUs is probably even
         | more profitable in the 5-year run than trying to make your GPUs
         | a bit less trash? Yes, Nvidia wouldn't want that, so it won't
         | be easy to make your product CUDA-compatible, but not
         | impossible. Is it illegal for them to do, or what? We all are
         | kind of accustomed to the idea that Intel doesn't care, but why
         | don't they care? It is money laying around waiting to be taken,
         | and they just ignore it.
        
       | pjmlp wrote:
       | > Is ZLUDA a drop-in replacement for CUDA? > > Yes, but certain
       | applications use CUDA in ways which make it incompatible with
       | ZLUDA
       | 
       | I fail to see how that is a drop-down replacement.
       | 
       | Plus apparently it doesn't support the polyglot CUDA ecosystem.
        
       | Abishek_Muthian wrote:
       | Excellent effort. Nvidia has become defacto GPGPU hardware vendor
       | due to CUDA, but I wish it was OpenCL or other general API
       | instead. Even Raspberry Pi's VideoCore has OpenCL support[1].
       | 
       | But a look at HW Acceleration support table at FFmpeg[2] shows
       | why GPGPU Platform API is such a mess. But performance benefits
       | are incredible, using VAAPI for FFmpeg to encode 1080p 2560x1080
       | screen capture at 60fps reduces CPU usage from 90% to 10% on a
       | old corei5 with intel HD 3000; An old laptop could be perfectly
       | used as an encoding machine for streaming just by using HW
       | Acceleration.
       | 
       | What's funny is that the laptop also has Radeon HD 6490M with 1GB
       | GDDR5 dedicated memory and it's not supported by VAAPI for
       | encoding! GPGPU API/Platform Support are astonishingly messy.
       | 
       | [1]https://github.com/doe300/VC4CL
       | 
       | [2]https://trac.ffmpeg.org/wiki/HWAccelIntro
        
         | zamadatix wrote:
         | GPU accelerated video encoding/decoding is done largely via
         | special fixed function hardware not GPGPU so there really isn't
         | a relation between the links and the statements. The reason the
         | Radeon HD 6000 is not supported for hardware accelerated
         | encoding is simply because it did not have a video encoding
         | ASIC to use. The HD 7000 series introduced it to the family.
         | https://en.wikipedia.org/wiki/Radeon_HD_6000_series#Radeon_F...
        
           | marmaduke wrote:
           | > The reason the Radeon HD 6000 is not supported for hardware
           | accelerated encoding is simply because it did not have a
           | video encoding ASIC to use.
           | 
           | Doesn't this prove the broader point that GPGPU cross
           | platform would benefit everyone? A new codec is written.. and
           | everyone gets to use it not just those with fixed function
           | support.
        
             | zamadatix wrote:
             | No, it proves fixed function hardware outperforms general
             | purpose hardware for the task it's made for. Without the
             | fixed function hardware the GPU is worse suited for the
             | task than a general purpose CPU.
        
       | Const-me wrote:
       | > Authors of CUDA benchmarks used CUDA functions atomicInc and
       | atomicDec which have direct hardware support on NVIDIA cards, but
       | no hardware support on Intel cards.
       | 
       | Then how does InterlockedAdd HLSL intrinsic works on Intel?
       | https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/...
        
         | [deleted]
        
       | xena wrote:
       | Yes please take business away from Nvidia and AMD. I just want to
       | buy a GPU.
        
       | wmu wrote:
       | Trivia: the Polish word "cuda" means "miracles", "zluda"
       | ("zluda") means "a delusion". Nice pun.
        
         | antjanus wrote:
         | I'm Czech and under the dialect where I grew up, "Zluda" could
         | be translated to "evil person/monster" or "mischievous
         | person/monster".
         | 
         | Growing up, people sometimes called their kids "zludy" (plural
         | of "zluda" in my language) -- or trouble-makers.
        
           | hackyhacky wrote:
           | Interesting. Possibly a variant of the standard word zruda?
        
             | antjanus wrote:
             | Yeah, could be! I grew up close to the polish/slovak/czech
             | border. I bet some of those words got mixed together.
        
         | neolog wrote:
         | Off topic...I heard that a common Polish swear-word is
         | "cholera", because the disease is so bad. Is that true?
        
           | ols wrote:
           | It is.
        
       ___________________________________________________________________
       (page generated 2021-02-27 23:01 UTC)