Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does memory efficient attention cutlass kernel support various seq len inputs for q/k/v + tensor bias? #1120

Open
ShijunK opened this issue Oct 4, 2024 · 0 comments

Comments

@ShijunK
Copy link

ShijunK commented Oct 4, 2024

❓ Questions and Help

more specifically, interested in combining BlockDiagonalMask, with a tensor bias.

I hacked something together by creating an BlockDiagonalMaskWithTensorBias, but got RuntimeError: CUDA error: an illegal memory access was encountered with large per block kv len.

what interesting is, if I use small per block kv len (768, 512 or 8). the kernel run success, without cuda error. but when use 1024, or a very large value e.g. 13824, I will get cuda illegal memory access error.

with cuda-gdb, I got this line:

mma_pv(gemm_k_iterations, accum_o, iterator_V, accum_o);

========= Invalid __global__ read of size 16 bytes
=========     at 0x2405c0 in /proc/self/cwd/external/cutlass/include/cutlass/arch/memory_sm80.h:369:cp_async_zfill
=========     by thread (18,0,0) in block (0,7,0)
=========     Address 0x7fe58f446de0 is out of bounds
=========     and is 32 bytes before the nearest allocation at 0x7fe58f446e00 of size 37888 bytes
=========     Device Frame:/proc/self/cwd/xxx/xformers/csrc/attention/cuda/fmha/gemm/mma_from_smem.h:1000:_prologue [0x2405d0]
=========     Device Frame:/proc/self/cwd/xxx/xformers/csrc/attention/cuda/fmha/iterators/predicated_tile_access_iterator_residual_last.h:1040:operator() [0x24d0e0]
=========     Device Frame:/proc/self/cwd/xxx/xformers/csrc/attention/cuda/fmha/kernel_forward.h:1046:attention_kernel [0x2b9fb0]

not familiar with cutlass, and the cpp template also somehow mess up cuda-gdb, cann't put breakpoint inside the kernel.

I wonder, if someone familiar with the kernel could help me to understand, what's the root cause?

is it fundamentally impossible to combine various sequence length inputs (q/k/v) with tensor bias? or, there is some limitation I need to pay attention to?

here is the test I used to play around with the customized bias:

@cuda_only
def test_attn_bias_blockdiag_with_tensor_bias() -> None:
    """IMPORTANT:
    This is the example in the doc for `BlockDiagonalMask`.
    If this example needs to be updated, please also update the doc
    """
    import torch

    from xformers.ops import fmha

    K = 16
    dtype = torch.float16
    device = "cuda"
    q_seqlen = [3, 6, 2]
    kv_seqlen = [8, 8, 8]

    B, H, M, K = 1, 8, 11, 16
    # per_block_kv_len = 13824 # faialed: RuntimeError: CUDA error: an illegal memory access was encountered
    per_block_kv_len = 1024 # failed: RuntimeError: CUDA error: an illegal memory access was encountered
    # per_block_kv_len = 768 # passed
    # per_block_kv_len = 512 # passed
    # per_block_kv_len = 8 # passed
    M_kv, K_kv = per_block_kv_len*3, 16
    q = torch.randn([B, M, H, K], dtype=dtype, device=device)
    k = v = torch.randn([B, M_kv, H, K_kv], dtype=dtype, device=device)
    attn_mask = torch.rand([B, H, M, M_kv], dtype=dtype, device=device)

    attn_bias = fmha.BlockDiagonalMaskWithTensorBias.from_seqlens(tensor_bias=attn_mask, q_seqlen=q_seqlen, kv_seqlen=kv_seqlen)

    print(q.shape, k.shape, v.shape)
    out = fmha.memory_efficient_attention(q, k, v, attn_bias=attn_bias)
    print(out.shape, out)
    # list_out = attn_bias.split(out)
    # assert tuple(list_out[0].shape) == (1, 3, 1, K)

the bias:

@dataclass
class BlockDiagonalMaskWithTensorBias(BlockDiagonalMask):
    _bias: torch.Tensor | None = None

    def materialize(
        self,
        shape: tuple[int, ...],
        dtype: torch.dtype = torch.float32,
        device: str | torch.device = "cpu",
    ) -> torch.Tensor:
        return super().materialize(shape, dtype=dtype, device=device) + self._bias

    @classmethod
    def from_seqlens(
        cls,
        tensor_bias: torch.Tensor,
        q_seqlen: Sequence[int],
        kv_seqlen: Sequence[int] | None = None,
    ) -> "BlockDiagonalMaskWithTensorBias":
        bdm = BlockDiagonalMask.from_seqlens(q_seqlen, kv_seqlen)
        return cls(q_seqinfo=bdm.q_seqinfo, k_seqinfo=bdm.k_seqinfo, _bias=tensor_bias)

    @classmethod
    def from_tensor_list(
        cls,
        tensor_bias: torch.Tensor,
        tensors: Sequence[torch.Tensor],
    ) -> Tuple["BlockDiagonalMaskWithTensorBias", torch.Tensor]:
        batch_sizes = [tensor.shape[0] for tensor in tensors]
        seqlens = []
        for x in tensors:
            for _ in range(x.shape[0]):
                seqlens.append(x.shape[1])
        block_diag = cls.from_seqlens(tensor_bias, seqlens)
        block_diag._batch_sizes = batch_sizes
        tensors_bs1 = tuple(x.reshape([1, -1, *x.shape[2:]]) for x in tensors)
        concat_tensors = torch.cat(tensors_bs1, dim=1)
        return block_diag, concat_tensors

cutlass.py is updated to support this new bias:

def _get_tensor_bias(
    attn_bias: Optional[Union[torch.Tensor, AttentionBias]]
) -> Optional[torch.Tensor]:
    if isinstance(attn_bias, torch.Tensor):
        return attn_bias
    elif isinstance(attn_bias, (LowerTriangularMaskWithTensorBias, BlockDiagonalMaskWithTensorBias)):
        return attn_bias._bias
    return None

# SUPPORTED_ATTN_BIAS_TYPES for fw/bw added with BlockDiagonalMaskWithTensorBias
bertmaher pushed a commit to bertmaher/xformers that referenced this issue Dec 20, 2024
…rch#1120)

* analyse nccl communication kernels from profiling logs

* fix isort linter

* removed pandas dependency

* addressed review comments

* remove tqdm dependency and fix type aannotations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant