Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pad audio instead of mel features to reduce word error rates #1084

Closed

Conversation

MahmoudAshraf97
Copy link
Collaborator

@MahmoudAshraf97 MahmoudAshraf97 commented Oct 24, 2024

This PR pads audio before feature extraction instead of padding the features, this is inline with how the whisper model was trained because reverting the zero-padded Mel spectrogram features back to time domain results in a white noise with moderate amplitude that causes hallucinations and wrong transcriptions
The distil-large-v3 WER dropped from 26.04 at 83a368e to 14.472
This figure can be reproduced by running benchmarks/yt_commons.py and switching the batched inference to sequential

These are WER comparisons before and after

word_timestamps=False,
without_timestamps=True,
vad_filter=True,
Model Before WER After WER
tiny.en 17.936 18.701
tiny 20.056 21.085
base.en 18.002 19.286
base 18.366 18.921
small.en 20.743 19.557
small 18.853 19.177
medium.en 21.755 20.323
medium 21.833 21.979
large-v1 22.319 21.195
large-v2 17.083 16.960
large-v3 20.946 19.961
distil-large-v2 71.848 85.386
distil-medium.en 68.044 68.556
distil-small.en 68.719 67.453
distil-large-v3 26.277 14.520
large-v3-turbo 23.999 22.679

References:
openai/whisper#730 (comment)
openai/whisper#838 (comment)

Copy link
Contributor

@Jiltseb Jiltseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, Its really important!
In order to highlight the changes, do you have a specific test file where the difference is visible in the output (hallucination/repetition)? I would also like to see if it somehow affects the timestamps in padding boundaries.

@@ -20,7 +20,7 @@ def __init__(
self.hop_length = hop_length
self.chunk_length = chunk_length
self.n_samples = chunk_length * sampling_rate
self.nb_max_frames = self.n_samples // hop_length
self.nb_max_frames = (30 * sampling_rate) // hop_length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you use self.chunk_length instead of 30 (hard-coded value)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This MUST be hard coded, whisper encoder expects the input seq length to be 3000 regardless of chunk size, anything less than that will reduce performance, you can test this change alone by using any distil model with 25s chunk length and benchmark WER

for reference:
huggingface/transformers#31991 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, its the input to the whisper encoder. Yes, it should be fixed at 30 sec.

@@ -82,7 +82,6 @@ def __call__(self, waveform, padding=True, chunk_length=None, to_cpu=False):

if chunk_length is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.chunk_length should be reused instead of chunk_length wherever possible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FeatureExtractor is initialized with 30s chunk length when the model instance is created, the actual inference chunk length is passed to the __call__ method when transcribing, chunk_length here can be different from self.chunk_length.

@MahmoudAshraf97
Copy link
Collaborator Author

MahmoudAshraf97 commented Oct 28, 2024

After testing all the models and updating the results above, it seems that the problem might be more complicated than I initially thought, this change has mixed results with improvements in some models and regressions in others but they are all minimal compared to the improvement in distil-large-v3, thoughts are welcome

@Jiltseb
Copy link
Contributor

Jiltseb commented Oct 28, 2024

After testing all the models and updating the results above, it seems that the problem might be more complicated than I initially thought, this change has mixed results with improvements in some models and regressions in others but they are all minimal compared to the improvement in distil-large-v3, thoughts are welcome

Features are not essentially the same when you pad the audio to 30 sec vs when you add 30s of 0.0s and then reintroduce zeros for the features. The former will have non-zero mel feature values even when the audio is silent. This was causing trouble in HF whisper implementation and they are currently fixing it, see this PR.

@MahmoudAshraf97
Copy link
Collaborator Author

MahmoudAshraf97 commented Oct 28, 2024

Features are not essentially the same when you pad the audio to 30 sec vs when you add 30s of 0.0s. The former will have non-zero mel feature values even when the audio is silent. This was causing trouble in HF whisper implementation and they are currently fixing it, see this PR.

The features resulting from the two padding strategies should be identical

from faster_whisper.feature_extractor import FeatureExtractor
import torch

fe = FeatureExtractor()
SAMPLING_RATE = 16000
N_SAMPLES = SAMPLING_RATE * 30
N_FRAMES = 3000
HOP_LENGTH = 160

for i in range(100):
    audio = torch.rand(SAMPLING_RATE * 10)
    
    features_pad_to_30s = fe(
        torch.nn.functional.pad(audio, (0, HOP_LENGTH + N_SAMPLES - audio.size(0))),
        padding=False,
    )[:, :N_FRAMES]
    
    features_pad_30s_zeros = fe(
        torch.nn.functional.pad(audio, (0, N_SAMPLES)), padding=False
    )[:, :N_FRAMES]
    
    assert torch.allclose(features_pad_to_30s, features_pad_30s_zeros)

What the mentioned PR does is that it removes the features corresponding to the padding and then padding the features again with zeros as the zero padding of the audio produces Mel features that are filled with -1.5

What they are trying to do is already implemented in faster-whisper

One can argue that for efficiency, if we are going to remove the features of the zero padding anyway, then there is no need to calculate them from the beginning, padding with HOP_LENGTH is sufficient to get the exact numerical values and we calculate 1 extra value instead of 3000

@MahmoudAshraf97
Copy link
Collaborator Author

The solution to get the improvements of distil-large-v3 while keeping other models as they are is to pad the features to 3000 instead of feature_extractor.nb_max_frames which is dependant on chunk_length here:

segment = pad_or_trim(segment, self.feature_extractor.nb_max_frames)

which makes faster-whisper compliant with the original OpenAI implementation Here

this will affect all models when chunk_length other than 30s is used

@MahmoudAshraf97
Copy link
Collaborator Author

Closed in favor of #1101

@MahmoudAshraf97 MahmoudAshraf97 deleted the fix_sequential_wer branch October 30, 2024 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants