https://discourse.llvm.org/t/rfc-add-xegpu-dialect-for-intel-gpus/75723

LLVM Discussion Forums

[RFC] Add XeGPU dialect for Intel GPUs

MLIR
Jianhui-Li December 16, 2023, 1:11am 1

Motivation

To support high-performance GEMM code generation on Intel GPU, we
propose XeGPU dialect. XeGPU dialect provides an abstraction that
closely models Xe instructions. XeGPU ops are introduced when a
special Xe instruction can't be expressed by LLVM/SPIR-V dialect, for
example, like matrix instruction (AKA DPAS) and 2D block load. It
matches the hardware instructions' semantics including the matrix
sizes. XeGPU dialect is similar to NVGPU and AMDGPU dialect and works
as a bridge dialect providing target-specific operations on MLIR
memref and vector data types.

XeGPU dialect models a subset of Xe GPU's unique features focusing on
GEMM performance. The operations include 2d load, dpas, atomic,
scattered load, 1d load, named barrier, mfence, and compile-hint.
These operations provide a minimum set to support high-performance
MLIR GEMM implementation for a wide range of GEMM shapes. XeGPU
dialect complements Arith, Math, Vector, and Memref dialects. This
allows XeGPU based MLIR GEMM implementation fused with other
operations lowered through existing MLIR dialects.

Example
Below is a short example of how it looks like. It creates 3 tensor
descriptors for matrix A, B, and C, followed by a K loop that
iteratively loads a block from matrix A, a block from B, does the
DPAS, and accumulates to a result vector. After the loop, the result
vector is saved to a block matrix C. The "vc" mode allows the XeGPU
op to be lowered to SPRI-V VC intrinsic with "Intel Vector Compute"
mode.

%4 = xegpu.create_nd_tdesc %arg2[%2, %3] {mode = vc} : memref<1024x1024xf32> -> !xegpu.tensor_desc<8x16xf32>

%5 = xegpu.load_nd %4 {mode = vc} : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>

%7 = xegpu.create_nd_tdesc %arg0[%2, %c0] {mode=vc}: memref<1024x1024xf16> -> !xegpu.tensor_desc<8x16xf16>

%8 = xegpu.create_nd_tdesc %arg1[%c0, %3] {mode=vc}: memref<1024x1024xf16> -> !xegpu.tensor_desc<16x16xf16>

%6:3 = scf.for %arg3 = %c0 to %c1024 step %c16 iter_args(%arg4 = %5, %subA = %7, %subB = %8) -> (vector<8x16xf32>, !xegpu.tensor_desc<8x16xf16>, !xegpu.tensor_desc<16x16xf16>) {

%9 = xegpu.load_nd %subA {mode=vc, vnni_axis = 1}: !xegpu.tensor_desc<8x16xf16> -> vector<8x8x2xf16>

%10 = xegpu.load_nd %subB {mode=vc, vnni_axis = 0} : !xegpu.tensor_desc<16x16xf16> -> vector<8x16x2xf16>

%11 = xegpu.dpas %9, %10, %arg4 {mode=vc}: vector<8x8x2xf16>, vector<8x16x2xf16>, vector<8x16xf32> -> vector<8x16xf32>

%12 = xegpu.update_nd_offset %subA, [%c0, %c16] {mode=vc}: !xegpu.tensor_desc<8x16xf16> -> !xegpu.tensor_desc<8x16xf16>

%13 = xegpu.update_nd_offset %subB, [%c16, %c0] {mode=vc}: !xegpu.tensor_desc<16x16xf16> -> !xegpu.tensor_desc<16x16xf16>

scf.yield %11, %12, %13: vector<8x16xf32>, !xegpu.tensor_desc<8x16xf16>, !xegpu.tensor_desc<16x16xf16>

}

xegpu.store_nd %6#0, %4 {mode = vc}: vector<8x16xf32>, !xegpu.tensor_desc<8x16xf32>


Reference
XeGPU has been implemented in Intel Extension to MLIR github repo .
The high-performance XeGPU based GEMM implementation can be found
here, and the test case demonstrated close-to-peak GEMM performance
on Intel Max series.

See XeGPU Op definition for details.

mehdi_amini December 16, 2023, 2:03am 2
# Jianhui-Li:


    . XeGPU ops are introduced when a special Xe instruction can't be
    expressed by LLVM/SPIR-V dialect, for example, like matrix
    instruction (AKA DPAS) and 2D block load. It matches the hardware
    instructions' semantics including the matrix sizes. XeGPU dialect
    is similar to NVGPU and AMDGPU dialect and works as a bridge
    dialect providing target-specific operations on MLIR memref and
    vector data types.

I'm not familiar with Xe: is there a set of intrinsics in LLVM like
NVVM and AMDGPU? The lowering path isn't clear to me from your
description?

JackW December 16, 2023, 4:24am 3
# mehdi_amini:


    I'm not familiar with Xe: is there a set of intrinsics in LLVM
    like NVVM and AMDGPU? The lowering path isn't clear to me from
    your description?

From the RFC on Intel's MLIR Extensions repo

    Proposal

    XeGPU dialect models a subset of Xe GPU's ISA. This is the
    counterpart of NVGPU and AMDGPU dialects, which provide a bridge
    dialect in the MLIR gradual lowering. XeGPU dialect works with
    MLIR memref and vector type and complements Arith, Math, Vector,
    and Memref dialects. XeGPU operations are introduced when there
    is a special Xe instruction not modeled by LLVM/SPIR-V dialect,
    for example, like DPAS and 2D block load. In some cases, one
    XeGPU op may lower to a sequence of instructions for a dedicated
    and performance-critical function. For example, create_tdesc is
    mapped to a fixed sequence of instructions to create an address
    description.

    ...

    Notes

    Currently, there is no lower-level dialect for the Intel GPU
    compiler toolchain to represent GPU ops with values based on LLVM
    data types such as NVVM dialect for the Nvidia GPU compiler
    toolchain. XeGPU dialect uses LLVM or SPIR-V intrinsic to access
    advanced intel GPU instructions. When the lower-level software
    changes, we expect XeGPU lowering passes to change accordingly.

mehdi_amini December 16, 2023, 8:01am 4

Thanks, I am still not sure on:

    for the Nvidia GPU compiler toolchain. XeGPU dialect uses LLVM or
    SPIR-V intrinsic to access advanced intel GPU instructions.

Does this means that LLVM already has intrinsics for XeGPU?
(I am trying to picture what is this dialect lowered to upstream.)

1 Like
banach-space December 16, 2023, 3:59pm 5
# Jianhui-Li:


    XeGPU dialect complements Arith, Math, Vector, and Memref
    dialects.

Can you elaborate? How would this dialect compose with other upstream
dialects? In particular:

# Jianhui-Li:


    XeGPU dialect models a subset of Xe GPU's unique features
    focusing on GEMM performance.

Presumably you'd like things like linalg.matmul to be lowered to
XeGPU? What's the roadmap for that? And would it be possible to have
end-to-end tests upstream?

-Andrzej

Jianhui-Li December 16, 2023, 6:44pm 6

The upstream LLVM doesn't have intrinsic for Xe GPU yet.

The XeGPU op will be first lowered to LLVM dialect with an external
function call to an Intel-specific function name, and then lower to
LLVM bitcode and translate to SPIR-V binary. These external function
names will be recognized by Intel's low-level SW stack (IGC) as
intrinsic.

The current implementation in Intel Extension to MLIR github repo is
lowered through SPRI-V dialect which generates SPIR-V IR directly.
But when we upstream, we plan to upstream the LLVM dialect lowering
path.

Jianhui-Li December 16, 2023, 7:05pm 7

XeGPU op interacts with memref and vector data type. Once it sets up
the tensor address description with memref, it can load a 2d block
from memref to vector. With the data loaded to vector, it can be
processed by any other dialect accepting vector data type.

Linalg.matmul would be able to lowered to XeGPU. The lowering could
be gradual so it first lowered to a larger size 2d submatrix, and
then lowered to the 2d block size to XeGPU level. Internally we are
experimenting with the gradual lowering and eventually we would like
to upstream the dialect/passes out from the experiemnt.

We have a end-to-end XeGPU based GEMM implementation for 4Kx4K here ,
and we can upstream that test case to llvm-project/mlir/test/
Integration/Dialect at main * llvm/llvm-project * GitHub as part of
the XeGPU lowering pass.

  * Home
  * Categories
  * FAQ/Guidelines
  * Terms of Service
  * Privacy Policy

Powered by Discourse, best viewed with JavaScript enabled