Make use of GPU shared memory to fuse reduce operator with its consumers into one kernel.
It helps to accommodate complex memory-intensive computations (e.g., LayerNorm, SoftMax) into one kernel,
reducing off-chip memory traffics and overhead of kernel scheduling and launching.
It implements partial functions described in paper AStitch.
It is currently under refactoring to enhance the robustness, for which it is not enabled by default.
Users of BladeDISC can enable it by setting the environment variable DISC_ENABLE_STITCH=true
.
Note that we have already released the CPU stitch optimization when we open-source the BladeDISC project, which is enabled by default. Refer to the materials for more information about CPU stitch technique details.
Support two types of GEMM merging optimization. One is to merge two GEMMs sharing the same operand into a single GEMM. The other one is to merge two GEMMs with the same shape into a batched GEMM. The GEMM merging optimization helps to increase hardware utilization and to reduce kernel launch overhead.
Support weight pre-packing optimization for convolution (calling onednn library) and GEMM (calling mkl/onednn/acl libraries) operations.
Support to transform the layout of convolution operator to the friendliest format on the specific device (i.e., either CPU or GPU). Most of the introduced transpose operators can be eliminated in a following transpose-simplifier pass.
- Optimize the schedule selection strategy for reduce operator on GPU to enhance thread-level-parallelism.
- Algebraic simplification for operators like power.
- Support to fuse splat constant operator with its consumers, reducing memory access overhead. Refer to issue.
Support end-to-end optimization for X86 and AArch64 CPUs.
According to the supported operators of TensorRT, cluster sub-graphs and apply TensorRT optimization for both TensorFlow and PyTorch models.
Release PoC version for accelerating PyTorch training via Disc + Lazy Tensor Core, referring to the related issue and design doc.
Enhance the shape equality analysis according to the dimension values. Add the function to analyze the collapse and expand relationship between dimensions, which helps to identify the dimension mapping between input and output values of reshape operator. This is the basic function to support GPU stitch fusion.
Support int8 datatype for the code generation of memory-intensive operators (e.g., element-wise, reduce operators).
Support to dump clusters and the corresponding input data, based on which developers can replay the execution. It is effective to help debugging and tuning. Refer to issue.
Enhance the CI process of BladeDISC repo, which helps the people from community to contribute to BladeDISC more conveniently and efficiently.
Migrate TorchBlade's compilation toolchain from the original CMake to bazel, enhancing maintainability.
Prepare a set of commonly used models as the examples for BladeDISC. Compare the performance of BladeDISC with TensorRT, XLA and ONNX Runtime (ORT) upon the examples.
Rebase to TensorFlow codebase for BladeDISC according to the newest community code.
Continuous bug fixing and code refactoring.