[HN Gopher] C-for-Metal: High Performance SIMD Programming on In...
___________________________________________________________________
C-for-Metal: High Performance SIMD Programming on Intel GPUs
Author : lelf
Score : 53 points
Date : 2021-01-29 09:27 UTC (13 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| raphlinus wrote:
| Another interesting reference from a few years ago:
| http://www.joshbarczak.com/blog/?p=1028
|
| Also read the followups (1120 and 1197), as they go into
| considerably more detail about the SPMD programming model and
| some use cases.
|
| The author is now at Intel working on ray tracing.
| astrange wrote:
| Intel has a previous SPMD compiler here: https://ispc.github.io
|
| Although the author seemed to have fled Intel soon after
| releasing it, and apparently spent the whole development
| process terrified that corporate politics would make him cancel
| it.
| einpoklum wrote:
| > The SIMT execution model is commonly used for general GPU
| development. CUDA and OpenCL developers write scalar code that is
| implicitly parallelized by compiler and hardware. On Intel GPUs,
| however, this abstraction has profound performance implications
| as the underlying ISA is SIMD and important hardware capabilities
| cannot be fully utilized
|
| What? That makes no sense.
|
| GPU processor cores are basically just SIMD with a different
| color hat. The SASS assebly simply has _only_ SIMD instructions -
| and with the full instrunction set being SIMD'ized, it can drop
| the mention of "this is SIMD" and just pretend individual lanes
| are instruction-locked threads .
|
| So, an OpenCL compiler would do very similar parallelization on a
| GPU and on an Intel CPU. (It's obviously not exactly the same
| since the instruction sets do differ, and the widths are not the
| same, and Intel CPUs has different widths which could all act at
| the same time etc.)
|
| So, the hardware capabilities can be utilized just fine.
| 37ef_ced3 wrote:
| Domain-specific compilers that generate explicit SIMD code from a
| high-level specification are even nicer. These can fully exploit
| the capabilities of the instruction set (e.g., fast permutes,
| masking, reduced precision floats, large register file, etc.) for
| a particular domain
|
| For example, generating AVX-512 code for convnet inference:
| https://NN-512.com
|
| NN-512 does four simultaneous 8x8 Winograd tiles (forward and
| backward) in the large AVX-512 register file, accelerates strided
| convolutions by interleaving Fourier transforms (again, with
| knowledge of the large register file), makes heavy use of the two
| input VPERMI2PS permutation instructions, generates simplified
| code with precomputed masks around tensor edges, uses
| irregular/arbitrary tiling patterns, etc. It generates code like
| this:
|
| https://nn-512.com/example/11
|
| This kind of compiler can be written for any important domain
| the_optimist wrote:
| This is great, but it doesn't address GPUs. If you built it for
| GPUs, from what I understand, that outcome would basically look
| like tensorflow, or maybe tensorflow XLA. Is that right?
| 37ef_ced3 wrote:
| My point is that a less general compiler can yield better
| SIMD code for a particular domain, and be easier to use for a
| particular domain. And I gave a concrete illustration
| (NN-512) to support that claim
|
| Consider NN-512 (less general) versus Halide (more general).
| Imagine how hard it would be to make Halide generate the
| programs that NN-512 generates. It would be a very
| challenging problem
| the_optimist wrote:
| Understood: NN-512 is a local optimum in an optimization of
| hardware and problem structure.
| the_optimist wrote:
| Compiling from high-level lang to GPU is a huge problem, and we
| greatly appreciate efforts to solve it.
|
| If I understand correctly, this (CM) allows for C-style fine-
| level control over a GPU device as though it were a CPU.
|
| However, it does not appear to address data transit (critical for
| performance). Compilation and operator fusing to minimize transit
| is possibly more important. See Graphcore Poplar, Tensorflow XLA,
| Arrayfire, Pytorch Glow, etc.
|
| Further, this obviously only applies to Intel GPUs, so investing
| time in utilizing low-level control is possibly a hardware dead-
| end.
|
| Dream world for programmers is one where data transit and
| hardware architecture are taken into account without living
| inside a proprietary DSL Conversely, it is obviously against
| hardware manufacturers' interests to create this.
|
| Is MLIR / LLVM going to solve this? This list has been
| interesting to consider:
|
| https://github.com/merrymercy/awesome-tensor-compilers
| baybal2 wrote:
| > is possibly a hardware dead-end.
|
| I'm thinking the opposite, there been an unending succession of
| different accelerators for doing this, and that which
| eventually been obsoleted, and forgotten when general purpose
| CPU caught up to them in performance, or comp-sci learned how
| to do calculations more efficiently on mainstream hardware.
|
| Just by seeing how morbid are the sales of the new "NPUs," I
| can guess it's already happening.
|
| A number of cellphone brands experimented with such to run
| selfie filters, or do speech recognition, but later found that
| those work on CPUs not any worse at all if competent
| programmers are hired, and then threw the NPU hardware out, or
| stopped using them.
| banachtarski wrote:
| I'm not a hardware engineer, but I am a GPU-focused graphics
| engineer.
|
| > C-style fine-level control over a GPU device as though it
| were a CPU.
|
| Personally, I think this is a fool's errand, and this has
| nothing to do with my desire for job security or anything. When
| I look at how code in the ML world is written for a GPU for
| example, it's really easy to see why it's so slow. The CPU and
| GPU architectures are fundamentally different. Different
| pipelining architecture, scalar instead of vector, 32/64-wide
| instruction dispatches, etc. HLSL/GLSL and other such shader
| languages are perfectly "high level" with other needed
| intrinsics needed to perform relevant warp level barriers, wave
| broadcasts/ballots/queries, use LDS storage, execute device
| level barriers, etc. This isn't to say that high level shader
| language improvements aren't welcome, but that trying to
| emulate a CPU is an unfortunate goal.
| mpweiher wrote:
| What kinds of improvements would you like to see?
| moonbug wrote:
| ain't no one gonna use that.
| skavi wrote:
| Why are Intel GPUs designed in such a way that typical GPU
| languages don't fully exploit it? Is the new Xe architecture
| still SIMD?
| dragontamer wrote:
| OpenCL works on Intel GPUs, while CUDA doesn't because CUDA is
| an NVidia technology.
|
| > Is the new Xe architecture still SIMD?
|
| SIMD is... pretty much all GPUs do. There's a few scalar bits
| here and there to speed up if-statements and the like, but the
| entire point of a GPU is to build a machine for SIMD.
| astrange wrote:
| GPUs don't need to have SIMD instructions; if you give one a
| fully scalar program it just needs to run a lot of copies of
| it at once. Every architecture is different here, including
| within the same vendor.
| dragontamer wrote:
| > GPUs don't need to have SIMD instructions;
|
| Except NVidia Ampere (RTX 3xxx series) and AMD RDNA2 (Navi
| / 6xxx series) are both SIMD architectures with SIMD-
| instructions.
|
| And the #3 company: Intel, also has SIMD instructions. I
| know that some GPUs out there are VLIW or other weird
| architectures, but... the "big 3" are SIMD-based.
|
| > if you give one a fully scalar program it just needs to
| run a lot of copies of it at once.
|
| Its emulated on a SIMD processor. That SIMD processor will
| suffer branch-divergence as you traverse through if-
| statements and while-statements, because its physically
| SIMD.
|
| The compiler / programming model is scalar. But the
| assembly instructions are themselves vector. Yeah, NVidia
| now has per-SIMD core instruction pointers. But that
| doesn't mean that the hardware can physically execute
| different instructions: they're still all locked together
| with SIMD-style at the physical level.
| raphlinus wrote:
| That's partly true, but there are exceptions, of which the
| subgroup operations are the most obvious. These are roughly
| similar to broadcast and permute SIMD instructions, and in
| some cases can lead to dramatic speedups.
| oivey wrote:
| OpenCL is basically dead at this point, too. The de facto
| standard is CUDA and there aren't currently any real
| challengers. Maybe eventually AMD's ROCm or Intel's oneAPI
| will get traction.
| pjmlp wrote:
| For them to get traction, they need to invest in debugger
| tooling that allows the productivity as on CPUs, and to
| help language communities other than C and C++ to target
| GPGPUs.
|
| NVidia started doing both around CUDA 3.0, whereas Khronos,
| AMD and Intel only started paying attention that not
| everyone wanted to do printf() style debugging with a C
| dialect until it was too late to get people's attention
| back.
| profquail wrote:
| oneAPI uses DPC++ (Data-Parallel C++), which is pretty much
| just SYCL, which itself is a C++ library on top of OpenCL.
|
| From my understanding, the Khronos group realized OpenCL
| 2.x was much too complicated so vendors just weren't
| implementing it, or only implementing parts of it, so they
| came up with OpenCL 3.0 which is slimmed-down and much more
| modular. It's hard to say how much adoption it'll get, but
| with Intel focused on DPC++ and oneAPI now, there will
| definitely be more numerical software coming out in the
| next few years that compiles down to and runs on OpenCL.
|
| For example, Intel engineers are building a numpy clone on
| top of DPC++, so unlike regular numpy it'll take advantage
| of multiple CPU cores: https://github.com/IntelPython/dpnp
| astrange wrote:
| > From my understanding, the Khronos group realized
| OpenCL 2.x was much too complicated so vendors just
| weren't implementing it, or only implementing parts of
| it, so they came up with OpenCL 3.0 which is slimmed-down
| and much more modular.
|
| Something like this also happened to OpenGL 4.3. It added
| a compute shader extension which was essentially all of
| OpenCL again, except different, so you had 2x the
| implementation work. This is about when some people
| stopped implementing OpenGL.
| TazeTSchnitzel wrote:
| OpenGL compute shaders are a natural step if you have
| unified programmable shader cores, and less complicated
| than adding new pipeline stages for everything
| (tessellation shaders, geometry shaders, ...).
|
| Khronos could have chosen only to add OpenCL integration,
| but OpenCL C is a very different language to GLSL, the
| memory model (among other things) is different, and so
| on. I don't see why video game developers should be
| forced to use OpenCL when they want to work with the
| outputs of OpenGL rendering passes, to produce inputs to
| OpenGL rendering passes, scheduled in OpenGL, to do
| things that don't fit neatly into vertex or fragment
| shaders?
| pjmlp wrote:
| Kind of right.
|
| DPC++ has more stuff than just SYSCL, some of it might
| find its way back to SYSCL standardization, some of it
| might remain Intel only.
|
| OpenCL 3.0 is basically OpenCL 1.2 with a new name.
|
| Meanwhile people are busy waiting for Vulkan compute to
| take off, got to love Khronos standards.
| TazeTSchnitzel wrote:
| Some people are working on being able to run OpenCL
| kernels on Vulkan: https://github.com/google/clspv
| pjmlp wrote:
| Sure, but it will take off to actually matter?
|
| So far I am only aware of Adobe using it to port their
| shaders to Vulkan on Android.
___________________________________________________________________
(page generated 2021-01-29 23:00 UTC)