https://discourse.llvm.org/t/rfc-add-xegpu-dialect-for-intel-gpus/75723 LLVM Discussion Forums [RFC] Add XeGPU dialect for Intel GPUs MLIR Jianhui-Li December 16, 2023, 1:11am 1 Motivation To support high-performance GEMM code generation on Intel GPU, we propose XeGPU dialect. XeGPU dialect provides an abstraction that closely models Xe instructions. XeGPU ops are introduced when a special Xe instruction can't be expressed by LLVM/SPIR-V dialect, for example, like matrix instruction (AKA DPAS) and 2D block load. It matches the hardware instructions' semantics including the matrix sizes. XeGPU dialect is similar to NVGPU and AMDGPU dialect and works as a bridge dialect providing target-specific operations on MLIR memref and vector data types. XeGPU dialect models a subset of Xe GPU's unique features focusing on GEMM performance. The operations include 2d load, dpas, atomic, scattered load, 1d load, named barrier, mfence, and compile-hint. These operations provide a minimum set to support high-performance MLIR GEMM implementation for a wide range of GEMM shapes. XeGPU dialect complements Arith, Math, Vector, and Memref dialects. This allows XeGPU based MLIR GEMM implementation fused with other operations lowered through existing MLIR dialects. Example Below is a short example of how it looks like. It creates 3 tensor descriptors for matrix A, B, and C, followed by a K loop that iteratively loads a block from matrix A, a block from B, does the DPAS, and accumulates to a result vector. After the loop, the result vector is saved to a block matrix C. The "vc" mode allows the XeGPU op to be lowered to SPRI-V VC intrinsic with "Intel Vector Compute" mode. %4 = xegpu.create_nd_tdesc %arg2[%2, %3] {mode = vc} : memref<1024x1024xf32> -> !xegpu.tensor_desc<8x16xf32> %5 = xegpu.load_nd %4 {mode = vc} : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %7 = xegpu.create_nd_tdesc %arg0[%2, %c0] {mode=vc}: memref<1024x1024xf16> -> !xegpu.tensor_desc<8x16xf16> %8 = xegpu.create_nd_tdesc %arg1[%c0, %3] {mode=vc}: memref<1024x1024xf16> -> !xegpu.tensor_desc<16x16xf16> %6:3 = scf.for %arg3 = %c0 to %c1024 step %c16 iter_args(%arg4 = %5, %subA = %7, %subB = %8) -> (vector<8x16xf32>, !xegpu.tensor_desc<8x16xf16>, !xegpu.tensor_desc<16x16xf16>) { %9 = xegpu.load_nd %subA {mode=vc, vnni_axis = 1}: !xegpu.tensor_desc<8x16xf16> -> vector<8x8x2xf16> %10 = xegpu.load_nd %subB {mode=vc, vnni_axis = 0} : !xegpu.tensor_desc<16x16xf16> -> vector<8x16x2xf16> %11 = xegpu.dpas %9, %10, %arg4 {mode=vc}: vector<8x8x2xf16>, vector<8x16x2xf16>, vector<8x16xf32> -> vector<8x16xf32> %12 = xegpu.update_nd_offset %subA, [%c0, %c16] {mode=vc}: !xegpu.tensor_desc<8x16xf16> -> !xegpu.tensor_desc<8x16xf16> %13 = xegpu.update_nd_offset %subB, [%c16, %c0] {mode=vc}: !xegpu.tensor_desc<16x16xf16> -> !xegpu.tensor_desc<16x16xf16> scf.yield %11, %12, %13: vector<8x16xf32>, !xegpu.tensor_desc<8x16xf16>, !xegpu.tensor_desc<16x16xf16> } xegpu.store_nd %6#0, %4 {mode = vc}: vector<8x16xf32>, !xegpu.tensor_desc<8x16xf32> Reference XeGPU has been implemented in Intel Extension to MLIR github repo . The high-performance XeGPU based GEMM implementation can be found here, and the test case demonstrated close-to-peak GEMM performance on Intel Max series. See XeGPU Op definition for details. mehdi_amini December 16, 2023, 2:03am 2 # Jianhui-Li: . XeGPU ops are introduced when a special Xe instruction can't be expressed by LLVM/SPIR-V dialect, for example, like matrix instruction (AKA DPAS) and 2D block load. It matches the hardware instructions' semantics including the matrix sizes. XeGPU dialect is similar to NVGPU and AMDGPU dialect and works as a bridge dialect providing target-specific operations on MLIR memref and vector data types. I'm not familiar with Xe: is there a set of intrinsics in LLVM like NVVM and AMDGPU? The lowering path isn't clear to me from your description? JackW December 16, 2023, 4:24am 3 # mehdi_amini: I'm not familiar with Xe: is there a set of intrinsics in LLVM like NVVM and AMDGPU? The lowering path isn't clear to me from your description? From the RFC on Intel's MLIR Extensions repo Proposal XeGPU dialect models a subset of Xe GPU's ISA. This is the counterpart of NVGPU and AMDGPU dialects, which provide a bridge dialect in the MLIR gradual lowering. XeGPU dialect works with MLIR memref and vector type and complements Arith, Math, Vector, and Memref dialects. XeGPU operations are introduced when there is a special Xe instruction not modeled by LLVM/SPIR-V dialect, for example, like DPAS and 2D block load. In some cases, one XeGPU op may lower to a sequence of instructions for a dedicated and performance-critical function. For example, create_tdesc is mapped to a fixed sequence of instructions to create an address description. ... Notes Currently, there is no lower-level dialect for the Intel GPU compiler toolchain to represent GPU ops with values based on LLVM data types such as NVVM dialect for the Nvidia GPU compiler toolchain. XeGPU dialect uses LLVM or SPIR-V intrinsic to access advanced intel GPU instructions. When the lower-level software changes, we expect XeGPU lowering passes to change accordingly. mehdi_amini December 16, 2023, 8:01am 4 Thanks, I am still not sure on: for the Nvidia GPU compiler toolchain. XeGPU dialect uses LLVM or SPIR-V intrinsic to access advanced intel GPU instructions. Does this means that LLVM already has intrinsics for XeGPU? (I am trying to picture what is this dialect lowered to upstream.) 1 Like banach-space December 16, 2023, 3:59pm 5 # Jianhui-Li: XeGPU dialect complements Arith, Math, Vector, and Memref dialects. Can you elaborate? How would this dialect compose with other upstream dialects? In particular: # Jianhui-Li: XeGPU dialect models a subset of Xe GPU's unique features focusing on GEMM performance. Presumably you'd like things like linalg.matmul to be lowered to XeGPU? What's the roadmap for that? And would it be possible to have end-to-end tests upstream? -Andrzej Jianhui-Li December 16, 2023, 6:44pm 6 The upstream LLVM doesn't have intrinsic for Xe GPU yet. The XeGPU op will be first lowered to LLVM dialect with an external function call to an Intel-specific function name, and then lower to LLVM bitcode and translate to SPIR-V binary. These external function names will be recognized by Intel's low-level SW stack (IGC) as intrinsic. The current implementation in Intel Extension to MLIR github repo is lowered through SPRI-V dialect which generates SPIR-V IR directly. But when we upstream, we plan to upstream the LLVM dialect lowering path. Jianhui-Li December 16, 2023, 7:05pm 7 XeGPU op interacts with memref and vector data type. Once it sets up the tensor address description with memref, it can load a 2d block from memref to vector. With the data loaded to vector, it can be processed by any other dialect accepting vector data type. Linalg.matmul would be able to lowered to XeGPU. The lowering could be gradual so it first lowered to a larger size 2d submatrix, and then lowered to the 2d block size to XeGPU level. Internally we are experimenting with the gradual lowering and eventually we would like to upstream the dialect/passes out from the experiemnt. We have a end-to-end XeGPU based GEMM implementation for 4Kx4K here , and we can upstream that test case to llvm-project/mlir/test/ Integration/Dialect at main * llvm/llvm-project * GitHub as part of the XeGPU lowering pass. * Home * Categories * FAQ/Guidelines * Terms of Service * Privacy Policy Powered by Discourse, best viewed with JavaScript enabled