Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use correct features padding for encoder input #1101

Merged
merged 2 commits into from
Oct 29, 2024

Conversation

MahmoudAshraf97
Copy link
Collaborator

@MahmoudAshraf97 MahmoudAshraf97 commented Oct 28, 2024

This adheres faster whisper implementation to the original OpenAI implementation as discussed in #1084

These are WER comparisons before and after
These figures can be reproduced by running benchmarks/yt_commons.py and switching the batched inference to sequential

word_timestamps=False,
without_timestamps=True,
vad_filter=True,
Model Before WER After WER
distil-large-v3 26.277 14.762
distil-large-v2 71.848 81.456
distil-medium.en 68.044 66.565
distil-small.en 68.719 67.220

the performance regression of distil-large-v2 can be ignored because distil-large-v3 should be used as a drop-in replacement

This should also affect all use cases where chunk length is less than 30 for all models
There is also an average WER improvement by around 2% (relative) for batched inference across all models

Model Before WER After WER
tiny.en 15.437 15.063
tiny 21.765 21.390
base.en 14.300 13.816
base 17.709 17.251
small.en 13.054 12.617
small 16.413 16.088
medium.en 13.299 12.894
medium 15.991 15.593
large-v1 19.458 19.590
large-v2 15.237 15.148
large-v3 16.514 15.997
large-v3-turbo 14.576 14.044
distil-small.en 14.004 13.918
distil-medium.en 14.074 13.972
distil-large-v2 13.419 13.574
distil-large-v3 13.688 13.533

@MahmoudAshraf97 MahmoudAshraf97 changed the title Improve WER of distil models Use correct features padding for encoder input Oct 29, 2024
@MahmoudAshraf97 MahmoudAshraf97 merged commit 2386843 into SYSTRAN:master Oct 29, 2024
3 checks passed
@MahmoudAshraf97 MahmoudAshraf97 deleted the fix_features_padding branch October 30, 2024 11:54
@MahmoudAshraf97
Copy link
Collaborator Author

This PR disabled the ability to change the encoder input and output dimensions (audio_ctx), any chunk_length passed to the transcribe will be respected, but the encoder input will be padded to 30s equivalent regardless of the chunk length

#171 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant