Pad audio instead of mel features to reduce word error rates #1084

MahmoudAshraf97 · 2024-10-24T10:00:11Z

This PR pads audio before feature extraction instead of padding the features, this is inline with how the whisper model was trained because reverting the zero-padded Mel spectrogram features back to time domain results in a white noise with moderate amplitude that causes hallucinations and wrong transcriptions
The distil-large-v3 WER dropped from 26.04 at 83a368e to 14.472
This figure can be reproduced by running benchmarks/yt_commons.py and switching the batched inference to sequential

These are WER comparisons before and after

word_timestamps=False,
without_timestamps=True,
vad_filter=True,

Model	Before WER	After WER
tiny.en	17.936	18.701
tiny	20.056	21.085
base.en	18.002	19.286
base	18.366	18.921
small.en	20.743	19.557
small	18.853	19.177
medium.en	21.755	20.323
medium	21.833	21.979
large-v1	22.319	21.195
large-v2	17.083	16.960
large-v3	20.946	19.961
distil-large-v2	71.848	85.386
distil-medium.en	68.044	68.556
distil-small.en	68.719	67.453
distil-large-v3	26.277	14.520
large-v3-turbo	23.999	22.679

References:
openai/whisper#730 (comment)
openai/whisper#838 (comment)

Jiltseb

Cool, Its really important!
In order to highlight the changes, do you have a specific test file where the difference is visible in the output (hallucination/repetition)? I would also like to see if it somehow affects the timestamps in padding boundaries.

Jiltseb · 2024-10-25T08:32:06Z

faster_whisper/feature_extractor.py

@@ -20,7 +20,7 @@ def __init__(
        self.hop_length = hop_length
        self.chunk_length = chunk_length
        self.n_samples = chunk_length * sampling_rate
-        self.nb_max_frames = self.n_samples // hop_length
+        self.nb_max_frames = (30 * sampling_rate) // hop_length


Can't you use self.chunk_length instead of 30 (hard-coded value)?

This MUST be hard coded, whisper encoder expects the input seq length to be 3000 regardless of chunk size, anything less than that will reduce performance, you can test this change alone by using any distil model with 25s chunk length and benchmark WER

for reference:
huggingface/transformers#31991 (comment)

Ah, its the input to the whisper encoder. Yes, it should be fixed at 30 sec.

Jiltseb · 2024-10-25T08:37:20Z

faster_whisper/feature_extractor.py

@@ -82,7 +82,6 @@ def __call__(self, waveform, padding=True, chunk_length=None, to_cpu=False):

        if chunk_length is not None:


self.chunk_length should be reused instead of chunk_length wherever possible.

FeatureExtractor is initialized with 30s chunk length when the model instance is created, the actual inference chunk length is passed to the __call__ method when transcribing, chunk_length here can be different from self.chunk_length.

MahmoudAshraf97 · 2024-10-28T07:44:13Z

After testing all the models and updating the results above, it seems that the problem might be more complicated than I initially thought, this change has mixed results with improvements in some models and regressions in others but they are all minimal compared to the improvement in distil-large-v3, thoughts are welcome

Jiltseb · 2024-10-28T09:40:36Z

After testing all the models and updating the results above, it seems that the problem might be more complicated than I initially thought, this change has mixed results with improvements in some models and regressions in others but they are all minimal compared to the improvement in distil-large-v3, thoughts are welcome

Features are not essentially the same when you pad the audio to 30 sec vs when you add 30s of 0.0s and then reintroduce zeros for the features. The former will have non-zero mel feature values even when the audio is silent. This was causing trouble in HF whisper implementation and they are currently fixing it, see this PR.

MahmoudAshraf97 · 2024-10-28T10:55:37Z

Features are not essentially the same when you pad the audio to 30 sec vs when you add 30s of 0.0s. The former will have non-zero mel feature values even when the audio is silent. This was causing trouble in HF whisper implementation and they are currently fixing it, see this PR.

The features resulting from the two padding strategies should be identical

from faster_whisper.feature_extractor import FeatureExtractor
import torch

fe = FeatureExtractor()
SAMPLING_RATE = 16000
N_SAMPLES = SAMPLING_RATE * 30
N_FRAMES = 3000
HOP_LENGTH = 160

for i in range(100):
    audio = torch.rand(SAMPLING_RATE * 10)
    
    features_pad_to_30s = fe(
        torch.nn.functional.pad(audio, (0, HOP_LENGTH + N_SAMPLES - audio.size(0))),
        padding=False,
    )[:, :N_FRAMES]
    
    features_pad_30s_zeros = fe(
        torch.nn.functional.pad(audio, (0, N_SAMPLES)), padding=False
    )[:, :N_FRAMES]
    
    assert torch.allclose(features_pad_to_30s, features_pad_30s_zeros)

What the mentioned PR does is that it removes the features corresponding to the padding and then padding the features again with zeros as the zero padding of the audio produces Mel features that are filled with -1.5

What they are trying to do is already implemented in faster-whisper

One can argue that for efficiency, if we are going to remove the features of the zero padding anyway, then there is no need to calculate them from the beginning, padding with HOP_LENGTH is sufficient to get the exact numerical values and we calculate 1 extra value instead of 3000

MahmoudAshraf97 · 2024-10-28T15:24:36Z

The solution to get the improvements of distil-large-v3 while keeping other models as they are is to pad the features to 3000 instead of feature_extractor.nb_max_frames which is dependant on chunk_length here:

faster-whisper/faster_whisper/transcribe.py

Line 1108 in c2a1da1

segment = pad_or_trim(segment, self.feature_extractor.nb_max_frames)

which makes faster-whisper compliant with the original OpenAI implementation Here

this will affect all models when chunk_length other than 30s is used

MahmoudAshraf97 · 2024-10-28T18:59:38Z

Closed in favor of #1101

Jiltseb reviewed Oct 25, 2024

View reviewed changes

pad audio instead of spectrogram features

548a7a3

MahmoudAshraf97 force-pushed the fix_sequential_wer branch from a483549 to 548a7a3 Compare October 26, 2024 09:18

MahmoudAshraf97 mentioned this pull request Oct 28, 2024

Use correct features padding for encoder input #1101

Merged

MahmoudAshraf97 closed this Oct 28, 2024

MahmoudAshraf97 deleted the fix_sequential_wer branch October 30, 2024 11:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pad audio instead of mel features to reduce word error rates #1084

Pad audio instead of mel features to reduce word error rates #1084

MahmoudAshraf97 commented Oct 24, 2024 •

edited

Loading

Jiltseb left a comment

Jiltseb Oct 25, 2024

MahmoudAshraf97 Oct 25, 2024

Jiltseb Oct 25, 2024

Jiltseb Oct 25, 2024

MahmoudAshraf97 Oct 25, 2024

MahmoudAshraf97 commented Oct 28, 2024 •

edited

Loading

Jiltseb commented Oct 28, 2024 •

edited

Loading

MahmoudAshraf97 commented Oct 28, 2024 •

edited

Loading

MahmoudAshraf97 commented Oct 28, 2024

MahmoudAshraf97 commented Oct 28, 2024

		@@ -82,7 +82,6 @@ def __call__(self, waveform, padding=True, chunk_length=None, to_cpu=False):

		if chunk_length is not None:

Pad audio instead of mel features to reduce word error rates #1084

Pad audio instead of mel features to reduce word error rates #1084

Conversation

MahmoudAshraf97 commented Oct 24, 2024 • edited Loading

Jiltseb left a comment

Choose a reason for hiding this comment

Jiltseb Oct 25, 2024

Choose a reason for hiding this comment

MahmoudAshraf97 Oct 25, 2024

Choose a reason for hiding this comment

Jiltseb Oct 25, 2024

Choose a reason for hiding this comment

Jiltseb Oct 25, 2024

Choose a reason for hiding this comment

MahmoudAshraf97 Oct 25, 2024

Choose a reason for hiding this comment

MahmoudAshraf97 commented Oct 28, 2024 • edited Loading

Jiltseb commented Oct 28, 2024 • edited Loading

MahmoudAshraf97 commented Oct 28, 2024 • edited Loading

MahmoudAshraf97 commented Oct 28, 2024

MahmoudAshraf97 commented Oct 28, 2024

MahmoudAshraf97 commented Oct 24, 2024 •

edited

Loading

MahmoudAshraf97 commented Oct 28, 2024 •

edited

Loading

Jiltseb commented Oct 28, 2024 •

edited

Loading

MahmoudAshraf97 commented Oct 28, 2024 •

edited

Loading