What is it • Setup • Usage • Multilingual • Contribute • More examples
Whisper-Based Automatic Speech Recognition (ASR) with improved timestamp accuracy using forced alignment.
This repository refines the timestamps of openAI's Whisper model via forced aligment with phoneme-based ASR models (e.g. wav2vec2.0), multilingual use-case.
Whisper is an ASR model developed by OpenAI, trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds.
Phoneme-Based ASR A suite of models finetuned to recognise the smallest unit of speech distinguishing one word from another, e.g. the element p in "tap". A popular example model is wav2vec2.0.
Forced Alignment refers to the process by which orthographic transcriptions are aligned to audio recordings to automatically generate phone level segmentation.
- Batch processing: Add
--vad_filter --parallel_bs [int]
for transcribing long audio file in batches (only supported with VAD filtering). Replace[int]
with a batch size that fits your GPU memory, e.g.--parallel_bs 16
. - VAD filtering: Voice Activity Detection (VAD) from Pyannote.audio is used as a preprocessing step to remove reliance on whisper timestamps and only transcribe audio segments containing speech. add
--vad_filter
flag, increases timestamp accuracy and robustness (requires more GPU mem due to 30s inputs in wav2vec2) - Character level timestamps (see
*.char.ass
file output) - Diarization (still in beta, add
--diarize
)
pip install git+https://github.com/m-bain/whisperx.git
If already installed, update package to most recent commit
pip install git+https://github.com/m-bain/whisperx.git --upgrade
If wishing to modify this package, clone and install in editable mode:
$ git clone https://github.com/m-bain/whisperX.git
$ cd whisperX
$ pip install -e .
You may also need to install ffmpeg, rust etc. Follow openAI instructions here https://github.com/openai/whisper#setup.
To enable VAD filtering and Diarization, include your Hugging Face access token that you can generate from Here after the --hf_token
argument and accept the user agreement for the following models: Segmentation , Voice Activity Detection (VAD) , and Speaker Diarization
Run whisper on example segment (using default params)
whisperx examples/sample01.wav
For increased timestamp accuracy, at the cost of higher gpu mem, use bigger models and VAD filtering e.g.
whisperx examples/sample01.wav --model large-v2 --vad_filter --align_model WAV2VEC2_ASR_LARGE_LV60K_960H
Result using WhisperX with forced alignment to wav2vec2.0 large:
sample01.mp4
Compare this to original whisper out the box, where many transcriptions are out of sync:
sample_whisper_og.mov
The phoneme ASR alignment model is language-specific, for tested languages these models are automatically picked from torchaudio pipelines or huggingface.
Just pass in the --language
code, and use the whisper --model large
.
Currently default models provided for {en, fr, de, es, it, ja, zh, nl, uk, pt}
. If the detected language is not in this list, you need to find a phoneme-based ASR model from huggingface model hub and test it on your data.
whisperx --model large --language de examples/sample_de_01.wav
sample_de_01_vis.mov
See more examples in other languages here.
import whisperx
device = "cuda"
audio_file = "audio.mp3"
# transcribe with original whisper
model = whisperx.load_model("large", device)
result = model.transcribe(audio_file)
print(result["segments"]) # before alignment
# load alignment model and metadata
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
# align whisper output
result_aligned = whisperx.align(result["segments"], model_a, metadata, audio_file, device)
print(result_aligned["segments"]) # after alignment
print(result_aligned["word_segments"]) # after alignment
In addition to forced alignment, the following two modifications have been made to the whisper transcription method:
-
--condition_on_prev_text
is set toFalse
by default (reduces hallucination) -
Clamping segment
end_time
to be at least 0.02s (one time precision) later thanstart_time
(prevents segments with negative duration)
- Not thoroughly tested, especially for non-english, results may vary -- please post issue to let me know the results on your data
- Whisper normalises spoken numbers e.g. "fifty seven" to arabic numerals "57". Need to perform this normalization after alignment, so the phonemes can be aligned. Currently just ignores numbers.
- If not using VAD filter, whisperx assumes the initial whisper timestamps are accurate to some degree (within margin of 2 seconds, adjust if needed -- bigger margins more prone to alignment errors)
- Overlapping speech is not handled particularly well by whisper nor whisperx
- Diariazation is far from perfect.
If you are multilingual, a major way you can contribute to this project is to find phoneme models on huggingface (or train your own) and test them on speech for the target language. If the results look good send a merge request and some examples showing its success.
The next major upgrade we are working on is whisper with speaker diarization, so if you have any experience on this please share.
-
Multilingual init
-
Subtitle .ass output
-
Automatic align model selection based on language detection
-
Python usage
-
Character level timestamps
-
Incorporating speaker diarization
-
Inference speedup with batch processing
-
Improve diarization (word level). Harder than first thought...
Contact maxbain[at]robots[dot]ox[dot]ac[dot]uk for queries
This work, and my PhD, is supported by the VGG (Visual Geometry Group) and University of Oxford.
Of course, this is builds on openAI's whisper. And borrows important alignment code from PyTorch tutorial on forced alignment
If you use this in your research, for now just cite the repo,@misc{bain2022whisperx,
author = {Bain, Max and Han, Tengda},
title = {WhisperX},
year = {2022},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/m-bain/whisperX}},
}
as well as the whisper paper,
@article{radford2022robust,
title={Robust speech recognition via large-scale weak supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
journal={arXiv preprint arXiv:2212.04356},
year={2022}
}
and any alignment model used, e.g. wav2vec2.0.
@article{baevski2020wav2vec,
title={wav2vec 2.0: A framework for self-supervised learning of speech representations},
author={Baevski, Alexei and Zhou, Yuhao and Mohamed, Abdelrahman and Auli, Michael},
journal={Advances in Neural Information Processing Systems},
volume={33},
pages={12449--12460},
year={2020}
}