Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove torch dependency, Faster numpy Feature extraction #1106

Merged
merged 8 commits into from
Nov 14, 2024

Conversation

MahmoudAshraf97
Copy link
Collaborator

@MahmoudAshraf97 MahmoudAshraf97 commented Oct 31, 2024

Hi all,

This PR aims to remove the torch dependency as it's only used for feature extraction, there are 2 options:

  1. use NumPy only and scrap GPU acceleration completely, this will slightly simplify the FE code by removing the device handling and framework handling code
  2. use NumPy + CuPy (the current implementation), this will allow the usage of GPU acceleration for those who want it, we still need to decide whether CuPy will be included in the base requirements or maybe as an optional requirement or none at all

These are performance figures on a 30s segment using this script

OMP_NUM_THREADS=4 OPENBLAS_NUM_THREADS=4 MKL_NUM_THREADS=4 VECLIB_MAXIMUM_THREADS=4 NUMEXPR_NUM_THREADS=4 taskset -c 0-3 python benchmark_fe.py
Cores: {0, 1, 2, 3}
Torch on CPU:
5.66 ms
Torch on GPU:
0.71 ms
New numpy on CPU:
10.36 ms
New CuPY on GPU:
2.19 ms
Old numpy on CPU:
39.55 ms

edit: Decided to remove CuPy as the speed difference is not worth the extra code

@MahmoudAshraf97
Copy link
Collaborator Author

Mentioning community reviewers, feel free to ignore this
@Purfview @Jiltseb @ozancaglayan @jordimas @hoonlight @zh-plus

@ozancaglayan
Copy link
Contributor

ozancaglayan commented Oct 31, 2024

Hi,

  • What's the value for MKL_NUM_THREADS and OMP_NUM_THREADS here for the CPU tests? Or are they not set?
  • Is this for 80 mels or 128 mels case?

@ozancaglayan
Copy link
Contributor

ozancaglayan commented Oct 31, 2024

I have a librosa.stft based implementation for this. I checked that your stft and librosa.stft is producing the same results and more or less performing at the same speed (6ms for unpadded 30s of audio). Unfortunately, these implementations does not seem to be using multi-threaded computation whereas torch.stft efficiently uses all cores which make the difference: ~1ms

Another main bottleneck here is the padding with 30 seconds.

  • I get 32ms on a sample of 30s of audio with n_mels=128 and 27ms with n_mels=80
  • With padding=False, 32ms goes down to 16ms obviously

But essentially the effect of padding the input is padding the output with some constant vector. So one could mitigate this doubling by just padding the stft/mel outputs cleverly.

We lose almost the same amount of time as STFT to np.log10(np.clip(mels, a_min=1e-10, a_max=None)) as np.log10 is horribly slow. I think they fixed this in numpy 2 numpy/numpy#19478

@ozancaglayan
Copy link
Contributor

I think doing this on GPU with torch is a bit overkill as feature extraction is seldom the bottleneck in the whisper pipeline. Simplicity, dependency management is more important here. And when things move towards numpy2, it'll be not that bad.

@ozancaglayan
Copy link
Contributor

This looks interesting as well, should be quite fast
https://github.com/IntelPython/mkl_fft

@MahmoudAshraf97
Copy link
Collaborator Author

MahmoudAshraf97 commented Oct 31, 2024

Hi,

  • What's the value for MKL_NUM_THREADS and OMP_NUM_THREADS here for the CPU tests? Or are they not set?
  • Is this for 80 mels or 128 mels case?

I updated the figures along with the benchmarking script and limiting it to 4 cores for reproducability
I'm using 128 Mels

I think doing this on GPU with torch is a bit overkill as feature extraction is seldom the bottleneck in the whisper pipeline. Simplicity, dependency management is more important here. And when things move towards numpy2, it'll be not that bad.

We're removing torch that's for sure, the choice here is between both cupy and numpy or one of them

I have a librosa.stft based implementation for this. I checked that your stft and librosa.stft is producing the same results and more or less performing at the same speed (6ms for unpadded 30s of audio). Unfortunately, these implementations does not seem to be using multi-threaded computation whereas torch.stft efficiently uses all cores which make the difference: ~1ms

This implementation is ported from C++ pytorch implementation, it's very similar to the librosa implementation but has some extra functionality that I'm not sure we need here, the reason I went with this is that the librosa implementation needs an extra dependency, and copying the function over isn't simple because it uses many librosa internal functions that will be a massive increase to the amount of code needed

Another main bottleneck here is the padding with 30 seconds.

  • I get 32ms on a sample of 30s of audio with n_mels=128 and 27ms with n_mels=80
  • With padding=False, 32ms goes down to 16ms obviously

But essentially the effect of padding the input is padding the output with some constant vector. So one could mitigate this doubling by just padding the stft/mel outputs cleverly.

I'm working on removing the padding completely while still generating identical results but I'm leaving it for another PR, Check here it does reduce the FE time to half indeed

We lose almost the same amount of time as STFT to np.log10(np.clip(mels, a_min=1e-10, a_max=None)) as np.log10 is horribly slow. I think they fixed this in numpy 2 numpy/numpy#19478

I have no idea how to calculate log10 using another way that is faster unfortunately

@jordimas
Copy link
Contributor

jordimas commented Oct 31, 2024

Thanks for your work @MahmoudAshraf97, it's really impressive.

My thinking is inline with @ozancaglayan "Simplicity, dependency management is more important here. And when things move towards numpy2, it'll be not that bad."

All implementations are good, depend on what you optimize for. I will optimize for simplicity and less dependencies.

@ozancaglayan
Copy link
Contributor

I'm working on removing the padding completely while still generating identical results but I'm leaving it for another PR, huggingface/transformers#34111 (comment) it does reduce the FE time to half indeed

Interesting. I'm always confused about one thing. Given that whisper is trained with 30 secs segments, the above PR and your thinking is not about running Whisper freely on arbitrarily sized audio files but rather removing the always add 30 secs more of 0.0s logic in FE right?

I thought that to do proper seeking and timestamping, Whisper requires that +30 secs of zeros all the time but isn't that the case then? i.e. would the output of Whisper be exactly the same if we just pad the inputs to 30 secs at most?

@MahmoudAshraf97
Copy link
Collaborator Author

MahmoudAshraf97 commented Oct 31, 2024

Interesting. I'm always confused about one thing. Given that whisper is trained with 30 secs segments, the above PR and your thinking is not about running Whisper freely on arbitrarily sized audio files but rather removing the always add 30 secs more of 0.0s logic in FE right?

I thought that to do proper seeking and timestamping, Whisper requires that +30 secs of zeros all the time but isn't that the case then? i.e. would the output of Whisper be exactly the same if we just pad the inputs to 30 secs at most?

We don't even need to pad it to 30s, adding hop_length samples is enough since we remove the padding and then pad it again with zeros, that means we remove only the last frame in the features instead of 3000+ samples
I've verified that it's numerically identical but I'm still thoroughly investigating if further effects exist

Edit: it produces exactly the same encoder outputs for NumPy, while CuPy needs atleast 17s of zero padding to achieve the same numerical accuracy for audios less than 30s, that closes the speed gap even more between the two

@MahmoudAshraf97 MahmoudAshraf97 changed the title Remove torch dependency, use CuPy for feature extraction Remove torch dependency, Faster numpy Feature extraction Oct 31, 2024
@MahmoudAshraf97
Copy link
Collaborator Author

I implemented the reduced padding in bc86503, everyone feel free to test it and report if there are any issues

@MahmoudAshraf97 MahmoudAshraf97 marked this pull request as ready for review October 31, 2024 22:27
@ozancaglayan
Copy link
Contributor

Thanks! I wont probably have time next week as I'll be on annual leave. One thing though, does this mean that you'll no longer be padding let's say an audio file of 10 seconds to 30 seconds? People were saying that Whisper has a severe hallucination problem if you let it transcribe signals less than 30 seconds because its not trained with an attention mask. There are even papers which finetunes it to mitigate this.

But I'm not 100% sure whether its the same thing or not, need to test it properly.

@MahmoudAshraf97
Copy link
Collaborator Author

Thanks! I wont probably have time next week as I'll be on annual leave. One thing though, does this mean that you'll no longer be padding let's say an audio file of 10 seconds to 30 seconds? People were saying that Whisper has a severe hallucination problem if you let it transcribe signals less than 30 seconds because its not trained with an attention mask. There are even papers which finetunes it to mitigate this.

But I'm not 100% sure whether its the same thing or not, need to test it properly.

In the original whisper implementation, the features are padded with 30s of zeros, converted to log mel spectrogram, and then the features that correspond to the padding are removed, and the features are then padded again with zeros until they are equivalent to 30s.

What I did here is that I found that padding with 10ms instead of 30s is mathematically equivalent and produces the same encoder input and output within the numerical tolerance of the data type used.

Why does this works? because in the STFT calculation, each window is independent and has an overlap of hop_length, so padding with hop_length is enough to simulate that the last frame has zeros after it, any more padding does not overlap at all with the audio and it's discarded later, so we can count that as redundant calculation.

@MahmoudAshraf97
Copy link
Collaborator Author

I want to merge this before Friday so we can release a new version at 15th of November, so if anyone has any comments or reviews please let me know

@MahmoudAshraf97 MahmoudAshraf97 merged commit 3e0ba86 into SYSTRAN:master Nov 14, 2024
3 checks passed
mludolph added a commit to mludolph/faster-whisper that referenced this pull request Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants