Releases · ROCm/Tensile

new tuning script to summarize rocBLAS log file
new environment variable to test fixed grid size with Stream-K kernels
new Stream-K dynamic mode to run large problems at slightly reduced CU count if it improves work division and power
add reject conditions for SourceKernel + PrefetchGlobalRead/LoopDoWhile
add reject condition for PreloadKernelArguments (disable PreloadKernelArguments if not supported (instead of rejecting kernel generation))
support NT flag for global load and store for gfx94x
new Kernarg preloading feature (DelayRemainingArgument: initiate the load of the remaining (non-preloaded) arguments, updated AsmCaps, AsmRegisterPool to track registers for arguments and preload)
add option for rotating buffers timing with cache eviction
add predicate for arithmetic intensity
add DirectToVgpr + packing for f8/f16 + TLU cases
enable negative values for ExtraLatencyForLR to reduce interval of local read and wait for DTV
add test cases for DirectToVgpr + packing
add batch support for Stream-K kernels and new test cases
new tuning scripts to analyze rocblas-bench results and remove tuned sizes from liblogic
enable VgprForLocalReadPacking + PrefetchLocalRead=1 (removed the reject condition for VFLRP + PLR=1, added test cases for VFLRP + PLR=1)
support VectorWidthB (new parameter VectorWidthB)
support VectorWidth + non SourceSwap
add test cases for VectorWidthB, VectorWidth + non SourceSwap
add code owners file
new environment variables to dynamically adjust number of CUs used in Stream-K
add new parameters to specify global load width for A and B separately (GlobalLoadVectorWidthA, B (effective with GlobalReadVectorWidth=-1))
add xf32 option to rocblas-bench input creator

Optimizations

initialization optimizations (reordered init code for PreloadKernelArguments opt, used s_mov_b64 for 64 bit address copy, used v_mov_b64/ds_read_b64 for C register initialization, added undefine AddressC/D with PreloadKernelArguments, optimized waitcnt for prefetch global read with DirectToVgpr, refactored waitcnt code for DTV and moved all asm related code to KernelWriterAssembly.py)
optimize temp vgpr allocation for ClusterLocalRead (added if condition to allocate temp vgpr only for 8bit datatype)
reverse MFMA order in inner loop for odd outer iteration
optimize waitcnt lgkmcnt for 1LDSBuffer + PGR>1 (removed redundant waitcnt lgkmcnt after 1LDSBuffer sync)
enhance maximum value of DepthU to 1024 (used globalParameters MaxDepthU to define maximum value of DepthU)

Changes

update rocBLAS-bench-input-create script (added number of iteration based on performance, rotating buffer flag)
limit build threads based on CPUs/RAM available on system (for tests)
update required workspace size for Stream-K, skip kernel initialization when possible
use fallback libraries for archs without optimized logic
use hipMemcpyAsync for validation (replace hipMemcpy with hipMemcpyAsync + hipStreamSynchronize in ReferenceValidator)
remove OCL tests
disable HostLibraryTests
reduce extended test time by removing extra parameters in the test config files
disable InitAccVgprOpt for Stream-K
skip sgemm 64bit offset tests for gfx94x
skip DTV, DTL, LSU+MFMA tests for gfx908
increase extended test timeout to 720 min
update xfail test (1sum tests only failing on gfx90a)
update lib logic convertor script
test limiting CI threads for only gfx11
WGM related kernargs are removed if they are not needed (WGM=-1,0,1)
cleanup on unused old code, mostly related to old client
change GSUA to SingleBuffer if GlobalSplitU=1 + MultipleBuffer, instead of rejecting it
update efficiency script for new architecture and xf32 datatype
re-enable negative values for WorkGroupMapping (asm kernel only)
disable HW monitor for aquvavanjaram941
pre-apply offsets for strided batch kernels
update tensile build with 16 threads

Fixes

fix WorkspaceCheck implementation when used in rocBLAS
ignore asm cap check for kernel arg preload for rocm6.0 and older
fix Stream-K partials cache behavior
fix MasterSolutionLibrary indexing for multiple architecture build
fix memory allocation fail with FlushMemorySize + StridedBatched/Batched cases (multiply batch count size when calculating array size)
fix BufferLoad=False with Stream-K
fix mismatch issue with GlobalReadCoalesceGroup
fix rocblas build fail on gfx11 (used state["ISA"] for reject conditions instead of globalParameters["CurrentISA"])
fix for LdsPad auto (fixed incorrect value assignment for autoAdjusted, set LdsBlockSizePerPadA or B = 0 if stride is not power of 2)
fix inacurate vgpr allocation for ClusterLocalRead
fix mismatch issue with LdsBlockSizePerPad + MT1(or 0) not power of 2
fix mismatch issue with InitAccOpt + InnerUnroll (use const 0 for src1 of MFMA only if index of innerUnrll (iui) is 0)
fix HostLibraryTests on gfx942 and gfx941
fix LLVM crash issue
fix for newer windows vcpkg msgpack and vcpkg version package name
fix an error with DisableKernelPieces + 32bit ShadowLimit

Assets 2

16 Apr 19:07

rocm-ci

rocm-6.1.0

be9f7da

Tensile 4.40.0 for ROCm 6.1.0

Additions

new DisableKernelPieces values to invalidate local read, local write, and global read
stream-K kernel generation, including two-tile stream-k algorithm by setting StreamK=3
feature to allow testing stream-k grid multipliers
debug output to check occupancy for Stream-K
reject condition for FractionalLoad + DepthU!=power of 2
new TENSILE_DB debugging value to dump the common kernel parameters
predicate for APU libs
new parameter (ClusterLocalRead) to turn on/off wider local read opt for TileMajorLDS
new parameter (ExtraLatencyForLR) to add extra interval between local read and wait
new logic to check LDS size with auto LdsPad(=1) and change LdsPad to 0 if LDS overflows
initialization type and general batched options to the rocblas-bench input creator script

Optimizations

enabled MFMA + LocalSplitU=4 for MT16x16
enabled (DirectToVgpr + MI4x4) and supported skinny MacroTile
optimized postGSU kernel: separate postGSU kernels for different GSU values, loop unroll for GSU loop, wider global load depending on array size, and parallel reduction depending on array size
auto LdsPad calculation for TileMajorLds + MI16x16
auto LdsPad calculation for UnrollMajorLds + MI16x16 + VectorWidth

Changes

cleared hipErrorNotFound error since it is an expected part of the search
modified hipcc search path for Linux
changed PCI ID from 32bit to 64bit for ROCm SMI HW monitor
changed LdsBlockSizePerPad to LdsBlockSizePerPadA, B to specify LBSPP separately
changed the default value of LdsPadA, B, LdsBlockSizePerPadA, B from 0 to -1
updated test cases according to parameter changes for LdsPad, LBSPP and ClusterLocalRead
Replaced std::regex with fnmatch()/PathMatchSpec as a workaround to std::regex stack overflow known bug

Fixes

hipcc compile append flag parallel-jobs=4
race condition in Stream-K that appeared with large grids and small sizes
mismatch issue with LdsPad + LdsBlockSizePerPad!=0 and TailLoop
mismatch issue with LdsPad + LdsBlockSizePerPad!=0 and SplitLds
incorrect reject condition check for DirectToLds + LdsBlockSizePerPad=-1 case
small fix for LdsPad optimization (LdsElement calculation)

Assets 2

31 Jan 20:12

rocm-ci

rocm-6.0.2

17df881

Tensile 4.39.0 for ROCm 6.0.2

Tensile code for ROCm 6.0.2 did not change. The library was rebuilt for the updated ROCm 6.0.2 stack.

Assets 2

15 Dec 18:30

rocm-ci

rocm-6.0.0

17df881

Tensile 4.39.0 for ROCm 6.0.0

Added

Added aquavanjaram support: gfx940/gfx941/gfx942, fp8/bf8 datatype, xf32 datatype, and stochastic rounding for various datatypes
Added/updated tuning scripts
Added DirectToLds support for larger data types with 32bit global load (old parameter DirectToLds is replaced with DirectToLdsA and DirectToLdsB), and the corresponding test cases
Added the average of frequency, power consumption, and temperature information for the winner kernels to the CSV file
Added asmcap check for MFMA + const src
Added support for wider local read + pack with v_perm (with VgprForLocalReadPacking=True)
Added a new parameter to increase miLatencyLeft

Optimizations

Enabled InitAccVgprOpt for MatrixInstruction cases
Implemented local read related parameter calculations with DirectToVgpr
Adjusted miIssueLatency for gfx940
Enabled dedicated vgpr allocation for local read + pack
Optimized code initialization
Optimized sgpr allocation
Supported DGEMM TLUB + RLVW=2 for odd N (edge shift change)
Enabled miLatency optimization for (gfx940/gfx941 + MFMA) for specific data types, and fixed instruction scheduling

Changed

Removed old code for DTL + (bpe * GlobalReadVectorWidth > 4)
Changed/updated failed CI tests for gfx11xx, InitAccVgprOpt, and DTLds
Removed unused CustomKernels and ReplacementKernels
Added a reject condition for DTVB + TransposeLDS=False (not supported so far)
Removed unused code for DirectToLds
Updated test cases for DTV + TransposeLDS=False
Moved parameter MinKForGSU from globalparameter to BenchmarkCommonParameter to support smaller K
Changed how to calculate latencyForLR for miLatency
Set minimum value of latencyForLRCount for 1LDSBuffer to avoid getting rejected by overflowedResources=5 (related to miLatency)
Refactored allowLRVWBforTLUandMI and renamed it as VectorWidthB
Supported multi-gpu for different architectures in lazy library loading
Enabled dtree library for batch > 1
Added problem scale feature for dtree selection
Enabled ROCm SMI for gfx940/941.
Modified non-lazy load build to skip experimental logic

Fixed

Fixed predicate ordering for fp16alt impl round near zero mode to unbreak distance modes
Fixed boundary check for mirror dims and re-enable disabled mirror dims test cases
Fixed merge error affecting i8 with wmma
Fixed mismatch issue with DTLds + TSGR + TailLoop
Fixed a bug with InitAccVgprOpt + GSU>1 and a mismatch issue with PGR=0
Fixed override for unloaded solutions when lazy loading
Fixed build some errors (adding missing headers)
Fixed boost link for a clean build on ubuntu22
Fixed bug in forcestoresc1 arch selection
Fixed compiler directive for gfx941 and gfx942
Fixed formatting for DecisionTree_test.cpp

Assets 2

13 Oct 18:57

rocm-ci

rocm-5.7.1

97e0cfc

Tensile 4.38.0 for ROCm 5.7.1

Tensile code for ROCm 5.7.1 did not change. The library was rebuilt for the updated ROCm 5.7.1 stack.

Assets 2

15 Sep 17:29

rocm-ci

rocm-5.7.0

97e0cfc

Tensile 4.38.0 for ROCm 5.7.0

Added

Added support for FP16 Alt Round Near Zero Mode (this feature allows the generation of alternate kernels with intermediate rounding instead of truncation)
Added user-driven solution selection feature

Optimizations

Enabled LocalSplitU with MFMA for I8 data type
Optimized K mask code in mfmaIter
Enabled TailLoop code in NoLoadLoop to prefetch global/local read
Enabled DirectToVgpr in TailLoop for NN, TN, and TT matrix orientations
Optimized DirectToLds test cases to reduce the test duration

Changed

Removed DGEMM NT custom kernels and related test cases
Changed noTailLoop logic to apply noTailLoop only for NT
Changed the range of AssertFree0ElementMultiple and Free1
Unified aStr, bStr generation code in mfmaIter

Fixed

Fixed LocalSplitU mismatch issue for SGEMM
Fixed BufferStore=0 and Ldc != Ldd case
Fixed mismatch issue with TailLoop + MatrixInstB > 1

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additions

Optimizations

Changes

Fixes

Additions

Optimizations

Changes

Fixes

Added

Optimizations

Changed

Fixed

Added

Optimizations

Changed

Fixed

Releases: ROCm/Tensile

Tensile 4.40.0 for ROCm 6.1.2

Tensile 4.40.0 for ROCm 6.1.1

Tensile 4.41.0 for ROCm 6.2.2

Tensile 4.41.0 for ROCm 6.2.1

Tensile 4.41.0 for ROCm 6.2.0

Additions

Optimizations

Changes

Fixes

Tensile 4.40.0 for ROCm 6.1.0

Additions

Optimizations

Changes

Fixes

Tensile 4.39.0 for ROCm 6.0.2

Tensile 4.39.0 for ROCm 6.0.0

Added

Optimizations

Changed

Fixed

Tensile 4.38.0 for ROCm 5.7.1

Tensile 4.38.0 for ROCm 5.7.0

Added

Optimizations

Changed

Fixed