-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pad audio instead of mel features to reduce word error rates #1084
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, Its really important!
In order to highlight the changes, do you have a specific test file where the difference is visible in the output (hallucination/repetition)? I would also like to see if it somehow affects the timestamps in padding boundaries.
@@ -20,7 +20,7 @@ def __init__( | |||
self.hop_length = hop_length | |||
self.chunk_length = chunk_length | |||
self.n_samples = chunk_length * sampling_rate | |||
self.nb_max_frames = self.n_samples // hop_length | |||
self.nb_max_frames = (30 * sampling_rate) // hop_length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you use self.chunk_length
instead of 30 (hard-coded value)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This MUST be hard coded, whisper encoder expects the input seq length to be 3000 regardless of chunk size, anything less than that will reduce performance, you can test this change alone by using any distil model with 25s chunk length and benchmark WER
for reference:
huggingface/transformers#31991 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, its the input to the whisper encoder. Yes, it should be fixed at 30 sec.
@@ -82,7 +82,6 @@ def __call__(self, waveform, padding=True, chunk_length=None, to_cpu=False): | |||
|
|||
if chunk_length is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.chunk_length
should be reused instead of chunk_length
wherever possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FeatureExtractor
is initialized with 30s chunk length when the model instance is created, the actual inference chunk length is passed to the __call__
method when transcribing, chunk_length
here can be different from self.chunk_length
.
a483549
to
548a7a3
Compare
After testing all the models and updating the results above, it seems that the problem might be more complicated than I initially thought, this change has mixed results with improvements in some models and regressions in others but they are all minimal compared to the improvement in |
Features are not essentially the same when you pad the audio to 30 sec vs when you add 30s of 0.0s and then reintroduce zeros for the features. The former will have non-zero mel feature values even when the audio is silent. This was causing trouble in HF whisper implementation and they are currently fixing it, see this PR. |
The features resulting from the two padding strategies should be identical from faster_whisper.feature_extractor import FeatureExtractor
import torch
fe = FeatureExtractor()
SAMPLING_RATE = 16000
N_SAMPLES = SAMPLING_RATE * 30
N_FRAMES = 3000
HOP_LENGTH = 160
for i in range(100):
audio = torch.rand(SAMPLING_RATE * 10)
features_pad_to_30s = fe(
torch.nn.functional.pad(audio, (0, HOP_LENGTH + N_SAMPLES - audio.size(0))),
padding=False,
)[:, :N_FRAMES]
features_pad_30s_zeros = fe(
torch.nn.functional.pad(audio, (0, N_SAMPLES)), padding=False
)[:, :N_FRAMES]
assert torch.allclose(features_pad_to_30s, features_pad_30s_zeros) What the mentioned PR does is that it removes the features corresponding to the padding and then padding the features again with zeros as the zero padding of the audio produces Mel features that are filled with What they are trying to do is already implemented in One can argue that for efficiency, if we are going to remove the features of the zero padding anyway, then there is no need to calculate them from the beginning, padding with |
The solution to get the improvements of faster-whisper/faster_whisper/transcribe.py Line 1108 in c2a1da1
which makes this will affect all models when |
Closed in favor of #1101 |
This PR pads audio before feature extraction instead of padding the features, this is inline with how the whisper model was trained because reverting the zero-padded Mel spectrogram features back to time domain results in a white noise with moderate amplitude that causes hallucinations and wrong transcriptions
The
distil-large-v3
WER dropped from 26.04 at 83a368e to 14.472This figure can be reproduced by running
benchmarks/yt_commons.py
and switching the batched inference to sequentialThese are WER comparisons before and after
References:
openai/whisper#730 (comment)
openai/whisper#838 (comment)