Add support for left padding and masking in forward() and generate() #70

normster · 2023-12-20T08:23:01Z

This PR implements masking for left contiguous pad tokens by zeroing out intermediate state values, per the discussion at #66, for all three code paths: non-fused, fused without CUDA graph, and fused with CUDA graph. I'm not sure if this implementation is the best approach, so let me know if there's a better way to do things.

I've included a simple testing script at tests/test_padding.py which can be run with python tests/test_padding.py to compare prefill logits + generation outputs with and without left padding.

I also evaluated the models with/without batching + left-padding + masking on a question answering dataset and found nearly identical accuracies. Batching + left-padding + no masking hurts accuracy by a couple percentage points.

normster · 2023-12-20T09:55:31Z

I tried validating the masking implementation with lm-eval-harness. On HellaSwag, mamba-1.4b with right padding still achieves the reported 59.1% accuracy. Switching to left padding drops this to 55.8% accuracy, and changing lm-eval-harness to 1) construct padding masks and 2) use padding masks does not recover any performance (still 55.8%). I might be using the padding masks incorrectly but it's pretty straightforward so I suspect the issue might lie in my mamba masking change in this PR.

Edit: I found an error in my evaluation of lm-eval-harness with left padding. After fixing, I get 59.1% with left padding + masking, the same as with right-padding. But left padding + no masking also gives 59.1% since lm-eval-harness collates prompts by length which minimizes the number of padding tokens. I found no difference in pythia-1.4b performance left padding with/without masking (52.1% as reported with right padding). Switching to a fixed, random collate function exposes a difference in performance on pythia-1.4b: 43.4% without masking and 52.1% with masking. But mamba-1.4b is virtually unchanged (59.0%). Maybe it's just more robust to long runs of unmasked padding tokens?

TL;DR: I think my proposed padding + masking works, though it's not clear mamba even really needs the masking.

pjsample · 2023-12-21T22:47:04Z

I just want to express my interest in left padding and masking. Thanks for the effort.

thistleknot · 2023-12-22T02:56:16Z

Curious how to mask to train on outputs only if you think masking isn't needed.

pjsample · 2023-12-22T18:28:04Z

My understanding is that the output of the model (token logits) is causal by default, so there is no masking when the model is being trained autoregressively. For an idea on how to train the model look here: https://github.com/havenhq/mamba-chat/blob/main/train_mamba.py

sentialx · 2023-12-24T11:58:47Z

I tried using this branch but got an error about not getting expected number of gradients during backward (15 vs 16)

junphine · 2024-01-02T03:58:52Z

I think if you want pad+mask to be effective, you need to do pre-training without using a full sentence in chunk

sunningmbzuai · 2024-01-08T16:46:43Z

Yeah, I got the same error. Does anyone know how to solve it?

xtwigs · 2024-01-11T23:10:16Z

I think this error happens because Pytorch expects a gradient for the mask to be returned, which isn't the case here. Adding a None in mamba_ssm/ops/selective_scan_interface.py#301 fixes this issue.

On the other hand to bypass the recalculation of conv1d in the backward pass we can set checkpoint_level to 0 (?).

Can someone verify if my thought process is accurate?

sunningmbzuai · 2024-01-18T17:50:17Z

Could you provide your change in mamba_ssm/ops/selective_scan_interface.py#301 ? That may help a lot.

abdulfatir · 2024-02-17T22:13:17Z

mamba_ssm/models/mixer_seq_simple.py

+        config = MambaConfig(**config_data, **kwargs)
+        model = cls(config, device=device, dtype=dtype)


Why did you make this change with the kwargs?

abdulfatir · 2024-02-18T10:59:41Z

@normster @tridao @albertfgu I believe this feature would be very nice to have in a stable release. Can we work towards merging this into main and have it in the next stable release? I am happy to help in any way.

zigzagcai · 2024-02-28T11:00:25Z

mamba_ssm/ops/selective_scan_interface.py

@@ -301,11 +305,11 @@ def mamba_inner_fn(
    xz, conv1d_weight, conv1d_bias, x_proj_weight, delta_proj_weight,
    out_proj_weight, out_proj_bias,
    A, B=None, C=None, D=None, delta_bias=None, B_proj_bias=None,
-    C_proj_bias=None, delta_softplus=True
+    C_proj_bias=None, mask=None, delta_softplus=True
 ):
    return MambaInnerFn.apply(xz, conv1d_weight, conv1d_bias, x_proj_weight, delta_proj_weight,


I have some confusion about this line of code. Since MambaInnerFn doesn't provides parameters option for mask, it seems that it has no effect one the fwd and bwd pass.
Hence, how can the mask be applied to mark the sequence boundaries?

I might be wrong, but the PR modified Line 162 in the very same file, which is the definition of MambaInnerFn.forward method. Also, multiplications on Line 181 and 220 used mask.

Thank you! You are right. I might missed this line of code on L162.
But I see in this PR, attention_mask is only used in the forward pass, and seem not to be used in the backward pass. So when I tried to feed batch data with left padding and masking (batch_size, seq_len, hidden_dim) into mamba block , it reported error. Has anyone encountered a similar error?

Err Msg:

File "/blahblah/miniconda3/envs/dev/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/blahblah/miniconda3/envs/dev/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: function MambaInnerFnBackward returned an incorrect number of gradients (expected 16, got 15)

@xtwigs pointed out that one can add None in the returned gradients to fix the issue. (For fellows above who wonder where to put None, I did it at the end of the tuple.) I agree with @xtwigs as we don't calculate gradient on the mask tensor.

I tried using this branch but got an error about not getting expected number of gradients during backward (15 vs 16)

Yeah, I got the same error. Does anyone know how to solve it?

I think this error happens because Pytorch expects a gradient for the mask to be returned, which isn't the case here. Adding a None in mamba_ssm/ops/selective_scan_interface.py#301 fixes this issue.

On the other hand to bypass the recalculation of conv1d in the backward pass we can set checkpoint_level to 0 (?).

Can someone verify if my thought process is accurate?

@xtwigs pointed out that one can add None in the returned gradients to fix the issue. (For fellows above who wonder where to put None, I did it at the end of the tuple.) I agree with @xtwigs as we don't calculate gradient on the mask tensor.

I tried using this branch but got an error about not getting expected number of gradients during backward (15 vs 16)

Yeah, I got the same error. Does anyone know how to solve it?

I think this error happens because Pytorch expects a gradient for the mask to be returned, which isn't the case here. Adding a None in mamba_ssm/ops/selective_scan_interface.py#301 fixes this issue.
On the other hand to bypass the recalculation of conv1d in the backward pass we can set checkpoint_level to 0 (?).
Can someone verify if my thought process is accurate?

I have tried this approach but encountered with CUDA OOM, even with much more GPUs and much smaller seq_len. (8x nodes, 64x A100 GPUs, and seq_len=512 for 1.4B mamba model)

zigzagcai · 2024-02-29T07:42:11Z

Hi, I also meet this error. Could you please provide some insight about how to fix this issue on the backward pass?

zigzagcai · 2024-02-29T08:15:42Z

I have verified your idea and it seems not to work and causes CUDA OOM.
Err Msg:

  File "/some_path/miniconda3/envs/final_dev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/some_path/zigzagcai/devel/mamba/mamba_ssm/modules/mamba_simple.py", line 146, in forward
    out = mamba_inner_fn(
  File "/some_path/zigzagcai/devel/mamba/mamba_ssm/ops/selective_scan_interface.py", line 310, in mamba_inner_fn
    return MambaInnerFn.apply(xz, conv1d_weight, conv1d_bias, x_proj_weight, delta_proj_weight,
  File "/some_path/miniconda3/envs/final_dev/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 97, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/some_path/zigzagcai/devel/mamba/mamba_ssm/ops/selective_scan_interface.py", line 221, in forward
    out, scan_intermediates, out_z = selective_scan_cuda.fwd(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB (GPU 0; 79.33 GiB total capacity; 74.83 GiB already allocated; 313.81 MiB free; 77.13 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

My code changes:

diff --git a/mamba_ssm/ops/selective_scan_interface.py b/mamba_ssm/ops/selective_scan_interface.py
index 35143ad..6792749 100644
--- a/mamba_ssm/ops/selective_scan_interface.py
+++ b/mamba_ssm/ops/selective_scan_interface.py
@@ -159,7 +159,7 @@ class MambaInnerFn(torch.autograd.Function):
     def forward(ctx, xz, conv1d_weight, conv1d_bias, x_proj_weight, delta_proj_weight,
                 out_proj_weight, out_proj_bias,
                 A, B=None, C=None, D=None, delta_bias=None, B_proj_bias=None,
-                C_proj_bias=None, mask=None, delta_softplus=True, checkpoint_lvl=1):
+                C_proj_bias=None, mask=None, delta_softplus=True, checkpoint_lvl=0):
         """
              xz: (batch, dim, seqlen)
         """
@@ -298,7 +298,7 @@ class MambaInnerFn(torch.autograd.Function):
                 dout_proj_weight, dout_proj_bias,
                 dA, dB, dC, dD,
                 ddelta_bias if delta_bias is not None else None,
-                dB_proj_bias, dC_proj_bias, None)
+                dB_proj_bias, dC_proj_bias, None, None)

enneamer · 2024-02-29T08:39:33Z

I have verified your idea and it seems not to work and causes CUDA OOM. Err Msg:

  File "/some_path/miniconda3/envs/final_dev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/hwfile/caizheng/devel/mamba/mamba_ssm/modules/mamba_simple.py", line 146, in forward
    out = mamba_inner_fn(
  File "/mnt/hwfile/caizheng/devel/mamba/mamba_ssm/ops/selective_scan_interface.py", line 310, in mamba_inner_fn
    return MambaInnerFn.apply(xz, conv1d_weight, conv1d_bias, x_proj_weight, delta_proj_weight,
  File "/some_path/miniconda3/envs/final_dev/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 97, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/mnt/hwfile/caizheng/devel/mamba/mamba_ssm/ops/selective_scan_interface.py", line 221, in forward
    out, scan_intermediates, out_z = selective_scan_cuda.fwd(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB (GPU 0; 79.33 GiB total capacity; 74.83 GiB already allocated; 313.81 MiB free; 77.13 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

My code changes:

diff --git a/mamba_ssm/ops/selective_scan_interface.py b/mamba_ssm/ops/selective_scan_interface.py
index 35143ad..6792749 100644
--- a/mamba_ssm/ops/selective_scan_interface.py
+++ b/mamba_ssm/ops/selective_scan_interface.py
@@ -159,7 +159,7 @@ class MambaInnerFn(torch.autograd.Function):
     def forward(ctx, xz, conv1d_weight, conv1d_bias, x_proj_weight, delta_proj_weight,
                 out_proj_weight, out_proj_bias,
                 A, B=None, C=None, D=None, delta_bias=None, B_proj_bias=None,
-                C_proj_bias=None, mask=None, delta_softplus=True, checkpoint_lvl=1):
+                C_proj_bias=None, mask=None, delta_softplus=True, checkpoint_lvl=0):
         """
              xz: (batch, dim, seqlen)
         """
@@ -298,7 +298,7 @@ class MambaInnerFn(torch.autograd.Function):
                 dout_proj_weight, dout_proj_bias,
                 dA, dB, dC, dD,
                 ddelta_bias if delta_bias is not None else None,
-                dB_proj_bias, dC_proj_bias, None)
+                dB_proj_bias, dC_proj_bias, None, None)

Is it using the default configuration defined in MambaConfig? The default values specify a huge network with 64 layers and embedding size of 2560. And checkpoint_lvl=0 disables checkpoint and then asks the forward pass to keep convolution and delta results in the GPU memory.

zigzagcai · 2024-02-29T08:54:24Z

I have verified your idea and it seems not to work and causes CUDA OOM. Err Msg:

  File "/some_path/miniconda3/envs/final_dev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/some_path/zigzagcai/devel/mamba/mamba_ssm/modules/mamba_simple.py", line 146, in forward
    out = mamba_inner_fn(
  File "/some_path/zizgagcai/devel/mamba/mamba_ssm/ops/selective_scan_interface.py", line 310, in mamba_inner_fn
    return MambaInnerFn.apply(xz, conv1d_weight, conv1d_bias, x_proj_weight, delta_proj_weight,
  File "/some_path/miniconda3/envs/final_dev/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 97, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/some_path/zigzagcai/devel/mamba/mamba_ssm/ops/selective_scan_interface.py", line 221, in forward
    out, scan_intermediates, out_z = selective_scan_cuda.fwd(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB (GPU 0; 79.33 GiB total capacity; 74.83 GiB already allocated; 313.81 MiB free; 77.13 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

My code changes:

diff --git a/mamba_ssm/ops/selective_scan_interface.py b/mamba_ssm/ops/selective_scan_interface.py
index 35143ad..6792749 100644
--- a/mamba_ssm/ops/selective_scan_interface.py
+++ b/mamba_ssm/ops/selective_scan_interface.py
@@ -159,7 +159,7 @@ class MambaInnerFn(torch.autograd.Function):
     def forward(ctx, xz, conv1d_weight, conv1d_bias, x_proj_weight, delta_proj_weight,
                 out_proj_weight, out_proj_bias,
                 A, B=None, C=None, D=None, delta_bias=None, B_proj_bias=None,
-                C_proj_bias=None, mask=None, delta_softplus=True, checkpoint_lvl=1):
+                C_proj_bias=None, mask=None, delta_softplus=True, checkpoint_lvl=0):
         """
              xz: (batch, dim, seqlen)
         """
@@ -298,7 +298,7 @@ class MambaInnerFn(torch.autograd.Function):
                 dout_proj_weight, dout_proj_bias,
                 dA, dB, dC, dD,
                 ddelta_bias if delta_bias is not None else None,
-                dB_proj_bias, dC_proj_bias, None)
+                dB_proj_bias, dC_proj_bias, None, None)

Is it using the default configuration defined in MambaConfig? The default values specify a huge network with 64 layers and embedding size of 2560. And checkpoint_lvl=0 disables checkpoint and then asks the forward pass to keep convolution and delta results in the GPU memory.

No. I am evaluating mamba model with 1.4B parameter size, where layers = 48 and model dimension = 2048 are equivalent to the size on the repo page.
As we know from the mamba paper, when I set checkpoint_lvl=0, it will disable the recomputation of conv1d_out, delta in backward pass and store those values in GPU memory, which leads to much more memory usage.

Here is my experiment details:

original 1.4B mamba model: works well with 1x node with 8x A100 GPUs
1.4B mamba model patched with this PR: encounters OOM even with 4x nodes with 32x A100 GPUs

zigzagcai · 2024-02-29T09:02:57Z

We are trying this PR because we want mamba to process packed sequence like what has been done in transformer-based models.
If we directly pad the sequence with zero, then a lot of computation will be wasted on meaningless padded tokens.
We just want to use mask to mark the meaningless padded tokens and let the computation focus on regular tokens.

xtwigs · 2024-02-29T12:23:48Z

Hi! Sorry, but I haven't run into the OOM issue. I keep a fork of this code here in case it might help. (I also run this with checkpoint_lvl=0)

zigzagcai · 2024-03-04T03:33:47Z

Hello, thanks for the sharing! :D I might be wrong. But when I tried this branch and found that OOM still appeared in the backward pass, meanwhile python tests/test_padding.py leads to huge max L2 errors (padded): tensor(3442.7573) in the forward pass, which did not appear in Norman’s original PR and might indicated the left padding and masking not work in the forward pass.

Is there any reproducible test code snippet indicating left padding+ masking works in both fwd and bwd pass?
Thanks!

xtwigs · 2024-03-04T15:44:02Z

Can you try this while disabling the dropout module added in the mamba simple code? (default was set to 0.1)

zigzagcai · 2024-03-11T10:05:44Z

Hi xtwigs, many thanks for your reply!
I tried the latest code of your branch and found mamba block runnable without error, but when I run the test code, the max L2 errors are still relatively large.

laulampaul · 2024-04-01T04:04:38Z

Hi zigzagcai , have you solved the problem of masking in the mamba block ?

Add support for left padding and masking in forward() and generate()

9f75b2b

abdulfatir reviewed Feb 17, 2024

View reviewed changes

zigzagcai reviewed Feb 28, 2024

View reviewed changes

albertfgu force-pushed the main branch 2 times, most recently from 6d45666 to 41d30ce Compare June 3, 2024 12:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for left padding and masking in forward() and generate() #70

Add support for left padding and masking in forward() and generate() #70

normster commented Dec 20, 2023

normster commented Dec 20, 2023 •

edited

Loading

pjsample commented Dec 21, 2023

thistleknot commented Dec 22, 2023

pjsample commented Dec 22, 2023

sentialx commented Dec 24, 2023

junphine commented Jan 2, 2024

sunningmbzuai commented Jan 8, 2024

xtwigs commented Jan 11, 2024 •

edited

Loading

sunningmbzuai commented Jan 18, 2024

abdulfatir Feb 17, 2024

abdulfatir commented Feb 18, 2024

zigzagcai Feb 28, 2024 •

edited

Loading

enneamer Feb 28, 2024

zigzagcai Feb 29, 2024 •

edited

Loading

enneamer Feb 29, 2024

zigzagcai Feb 29, 2024 •

edited

Loading

zigzagcai commented Feb 29, 2024 •

edited

Loading

zigzagcai commented Feb 29, 2024 •

edited

Loading

enneamer commented Feb 29, 2024

zigzagcai commented Feb 29, 2024 •

edited

Loading

zigzagcai commented Feb 29, 2024

xtwigs commented Feb 29, 2024 •

edited

Loading

zigzagcai commented Mar 4, 2024 •

edited

Loading

xtwigs commented Mar 4, 2024

zigzagcai commented Mar 11, 2024 •

edited

Loading

laulampaul commented Apr 1, 2024

		config = MambaConfig(config_data, kwargs)
		model = cls(config, device=device, dtype=dtype)

Add support for left padding and masking in forward() and generate() #70

Are you sure you want to change the base?

Add support for left padding and masking in forward() and generate() #70

Conversation

normster commented Dec 20, 2023

normster commented Dec 20, 2023 • edited Loading

pjsample commented Dec 21, 2023

thistleknot commented Dec 22, 2023

pjsample commented Dec 22, 2023

sentialx commented Dec 24, 2023

junphine commented Jan 2, 2024

sunningmbzuai commented Jan 8, 2024

xtwigs commented Jan 11, 2024 • edited Loading

sunningmbzuai commented Jan 18, 2024

abdulfatir Feb 17, 2024

Choose a reason for hiding this comment

abdulfatir commented Feb 18, 2024

zigzagcai Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

enneamer Feb 28, 2024

Choose a reason for hiding this comment

zigzagcai Feb 29, 2024 • edited Loading

Choose a reason for hiding this comment

enneamer Feb 29, 2024

Choose a reason for hiding this comment

zigzagcai Feb 29, 2024 • edited Loading

Choose a reason for hiding this comment

zigzagcai commented Feb 29, 2024 • edited Loading

zigzagcai commented Feb 29, 2024 • edited Loading

enneamer commented Feb 29, 2024

zigzagcai commented Feb 29, 2024 • edited Loading

zigzagcai commented Feb 29, 2024

xtwigs commented Feb 29, 2024 • edited Loading

zigzagcai commented Mar 4, 2024 • edited Loading

xtwigs commented Mar 4, 2024

zigzagcai commented Mar 11, 2024 • edited Loading

laulampaul commented Apr 1, 2024

normster commented Dec 20, 2023 •

edited

Loading

xtwigs commented Jan 11, 2024 •

edited

Loading

zigzagcai Feb 28, 2024 •

edited

Loading

zigzagcai Feb 29, 2024 •

edited

Loading

zigzagcai Feb 29, 2024 •

edited

Loading

zigzagcai commented Feb 29, 2024 •

edited

Loading

zigzagcai commented Feb 29, 2024 •

edited

Loading

zigzagcai commented Feb 29, 2024 •

edited

Loading

xtwigs commented Feb 29, 2024 •

edited

Loading

zigzagcai commented Mar 4, 2024 •

edited

Loading

zigzagcai commented Mar 11, 2024 •

edited

Loading