Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create RFC for XeTile and XeGPU Dialect #655

Merged
merged 11 commits into from
Nov 21, 2023
Merged

Create RFC for XeTile and XeGPU Dialect #655

merged 11 commits into from
Nov 21, 2023

Conversation

Jianhui-Li
Copy link
Contributor

This is the RFC for XeTile and XeGPU Dialect. XeTile dialect supports the tile-based programming model and decomposes the GEMM kernel to a large enough tile size at the subgroup level. The XeGPU dialect models Xe instructions like DPAS and 2D block load.

Copy link

@kurapov-peter kurapov-peter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having two separate dialects is a good decision as it would enable more use cases and separate the optimizations from hardware features abstraction. Analytical tools might use the XeGPU dialect directly to perform appropriate lowering, so the independence is nice. The dialect should probably also include specific instructions like barriers.

Having a single entry point for any GPGPU workload through the XeGPU dialect seems a reasonable arch solution to me. We'd control the lowering to either VC intrinsics or SPRIV extensions directly and avoid duplicating this functionality in different tools. Ideally, there should be a path to lower this to LLVM and then use native SPIRV backend. That would require the SPIRV extensions to be a part of vanilla LLVM though.

As a side note, I'd also consider the runtime part of the code: kernel scheduling and launch, memory allocation and movement. I'm not sure if these can/should be a part of the very same XeGPU dialect but such primitives do arise in some scenarios when a complete graph for running a workload in a heterogeneous environment is built. In general, I'd prefer to have a simple dialect with intuitive and predictive lowering (to genx code) behavior though.

@Jianhui-Li
Copy link
Contributor Author

As a side note, I'd also consider the runtime part of the code: kernel scheduling and launch, memory allocation and movement. I'm not sure if these can/should be a part of the very same XeGPU dialect but such primitives do arise in some scenarios when a complete graph for running a workload in a heterogeneous environment is built. In general, I'd prefer to have a simple dialect with intuitive and predictive lowering (to genx code) behavior though.

The runtime is defined inside GPUX dialect, https://github.com/intel/mlir-extensions/tree/refactor/include/imex/Dialect/GPUX, which serves as extension of GPU dialect. Does that fit your need? It support kernel launch, memory allocation, and etc.

@kurapov-peter
Copy link

As a side note, I'd also consider the runtime part of the code: kernel scheduling and launch, memory allocation and movement. I'm not sure if these can/should be a part of the very same XeGPU dialect but such primitives do arise in some scenarios when a complete graph for running a workload in a heterogeneous environment is built. In general, I'd prefer to have a simple dialect with intuitive and predictive lowering (to genx code) behavior though.

The runtime is defined inside GPUX dialect, https://github.com/intel/mlir-extensions/tree/refactor/include/imex/Dialect/GPUX, which serves as extension of GPU dialect. Does that fit your need? It support kernel launch, memory allocation, and etc.

Yes, it seems to close the gap, I saw that one after posting the comment :)

Refresh the document with the latest refinement on XeTile and XeGPU dialect.
improve the documentation layout
Improve documentation layout
minor layout change for the table
XeTile provides a middle-level abstraction for matmul operation, sits between Linalg matmul named op and XeGPU Dpas op. It is not tied to specific Xe architecture. The XeTile dialect design facilitates optimization using hardware auto-padding, which generates simpler and more efficient code than the software padding. Using the tile dialect, the user doesn’t need to detect the out-of-boundary case, and the dialect takes care of unaligned shapes, so the same code runs for the unaligned use case. Users can focus on high-level optimization like software pipelining, cooperative prefetch, and K-slicing.

| Ops | Syntax | Example |
| :--- | :---- | :--- |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

examples could go into code blocks not plain text.

code

| Ops | Syntax | Example |
| :--- | :---- | :--- |
|init_tile | operation ::= `XeTile.init_tile `$base_memref `$offset0 `, `$offset1 `:` type($base_memref) `,` index `,` index `->` type($tile, attr-dict) | %block = XeTile.init_tile %base_memref, %tile_offset:2 memref<128x128xbf16> into tile<8x16xbf16> |
|load_tile | operation ::= `XeTile.load_tile` $tile attr-dict `:` type($tile) `->` type($res) | %vector_a = XeTile.load_tile %tile_a transpose = [1,0] padding=0 tile<64x32xbf16> into vector <32x64xbf16> |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using inline code is not consistent in the syntax. double check with the synthesized markdown doc here.
https://github.com/intel/mlir-extensions/blob/5f4727fb1b086d9f60c41ee9b3137cdfa91031da/docs/rfcs/XeTileandXeGPUDialect.md

```
Init_tile with memref of dynamic shape. The memref has a dynamic shape, so that its shape and strides have to be passed as runtime parameters to init_tile.
```mlir
%block = XeTile.init_tile %base_memref, [%tile_offset:2], [%base_shape:2[, [%base_strides:2]:
Copy link
Contributor

@charithaintc charithaintc Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo here. should be [%base_shape:2],

XeGPU dialect models a subset of Xe GPU’s ISA. This is the counterpart of NVGPU and AMDGPU dialects, which provide a bridge dialect in the MLIR gradual lowering. XeGPU dialect works with MLIR memref and vector type and complements with Arith/Math/Vector/Memref dialect. XeGPU operations are introduced when there is a special Xe instruction not modeled by LLVM/SPIRV dialect. In some cases, one XeGPU op is mapped to multiple hardware instructions when there is no performance disadvantage by grouping them. For example, create_tdesc is mapped to a fixed sequence of instructions to create the 32-byte long address description.
Below is a summary.

| Ops | Syntax | Example |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as the previous table

nbarrrier, mfence, and compile_hint works on both VC mode and SIMT mode, since they access uniform values.


## Alternative
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"alternative design considerations" would be better title


The alternative design of tile data type is to reuse the memref data type. The memref data type needs to be enhanced to allow attributes. So the XeTile's tile data type can be expressed with memref associated with Tile attributes. XeTile.wg_map and XeTile.sg_map are examples of these attributes.

## Questions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "Notes" would be a better title?


## Questions

Currently there is no NVVM counterpart. XeGPU dialect uses SPIRV Intel extension to access joint-matrix or SPRIV external function to access intel GPU VC intrinsics. This may change in the future, so we expect XeGPU lowering may change accordingly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently there is no lower-level GPU IR like NVVM available for Intel GPU compiler toolchain.


To create a 2D Tile memory descriptor, the user needs to set up a tile (init_tile) describing a 2D region within the global memory. Setting up a tile requires the shape of the parent tile and the underneath physical memory buffer size, known as the base matrix. The base matrix must be 2D and must be contiguous. The XeTile takes the base matrix address pointer, shape, and strides, and the tile’s offsets and shape. Offsets, strides, and shapes are for two dimensions and in the number of elements. base_stride[0] describes the number of elements between the two rows, describing the width of the underneath physical memory buffer, and *%base_strides[1] must be 1, as the innermost dimension of the base matrix must be contiguous. The current version only supports 2D memref with a row-major layout.

Init_tile takes memref as the description of the base matrix with the offsets of the specific tile. The tile shape and element data type are specified in the output tile data type, and they must be known at compile-time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use inline code like init_tile when referring to XeTile/XeGPU ops or attributes for readability.

@charithaintc charithaintc self-requested a review November 21, 2023 22:16
Copy link
Contributor

@charithaintc charithaintc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@silee2 silee2 self-requested a review November 21, 2023 22:31
Copy link
Contributor

@silee2 silee2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@silee2 silee2 merged commit bef54a9 into intel:main Nov 21, 2023
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants