Positive Transfer Of The Whisper Speech Transformer To Human And Animal Voice Activity Detection

We proposed WhisperSeg, utilizing the Whisper Transformer pre-trained for Automatic Speech Recognition (ASR) for both human and animal Voice Activity Detection (VAD). For more details, please refer to our paper:

Positive Transfer of the Whisper Speech Transformer to Human and Animal Voice Activity Detection

Nianlong Gu, Kanghwi Lee, Maris Basha, Sumit Kumar Ram, Guanghao You, Richard H. R. Hahnloser
University of Zurich and ETH Zurich

Accepted to the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024)

Install Environment

conda create -n wseg python=3.10 -y
conda activate wseg
pip install -r requirements.txt
conda install conda-forge::cudnn==8.9.7.29 -y
## suppose CUDA version is 12.1, for other version, please refer to https://pytorch.org/get-started/locally/
pip install --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

NOTE for Windows OS: For method 1 and 2, if running WhisperSeg on windows, one need to further uninstall 'bitsandbytes' by

pip uninstall bitsandbytes

and then install 'bitsandbytes-windows==0.37.5'

pip install bitsandbytes-windows==0.37.5

Then open a new terminal, you can activate the 'wseg' environment by

conda activate wseg

Documentation

Model Training and Evaluation

The pretrained WhisperSeg may not work well on your own dataset. A finetuning would be necessary in this case. We prepared a Jupyter notebook that provides a comprehensive walkthrough of WhisperSeg finetuning. This includes steps for data processing, training, and evaluation. You can access this notebook at docs/WhisperSeg_Training_Pipeline.ipynb, or run it in Google Colab:

Please refer to the following documents for the complete guideline of training WhisperSeg, including 1) dataset processing, 2) model training and 3) model evaluation.

Update 01.07.2024 We provided a web interface where users can finetune WhisperSeg and running segmentation by simply dragging and dropping files and clicking start.

In the 'wseg' python environment, start the web service:

streamlit run scripts/service.py --server.maxUploadSize 2000 -- --backend_dataset_base_folder data/datasets/ --backend_model_base_folder model/

In the browser, open http://localhost:8501 and use the web interface for finetuning and segmentation

How To Use The Trained Model

Use WhisperSeg in command line

Activate the "wseg" ananconda environment:

conda activate wseg

Segment given the file path to the .wav audio file

python scripts/segment.py --model_path nccratliri/whisperseg-animal-vad-ct2 --audio_path data/example_subset/Marmoset/test/marmoset_pair4_animal1_together_A_0.wav --csv_save_path ./out.csv

The out.csv contains the segmentation results:

	onset	offset	cluster
0	15.585	15.682	vocal
1	15.777	15.837	vocal
2	15.883	15.922	vocal
3	16.007	16.047	vocal
4	16.132	16.157	vocal
...	...	...	...
192	61.167	61.293	vocal
193	61.410	61.448	vocal
194	61.502	61.538	vocal
195	61.727	61.867	vocal
196	61.953	61.995	vocal

Segment given the path to the folder that contains multiple .wav files

python scripts/segment.py --model_path nccratliri/whisperseg-animal-vad-ct2 --audio_folder data/example_subset/Zebra_finch/test_juveniles/ --csv_save_path out.csv

	filename	onset	offset	cluster
0	zebra_finch_R3428_40932.67397799_1_24_18_43_17...	0.008	0.043	vocal
1	zebra_finch_R3428_40932.67397799_1_24_18_43_17...	0.458	0.578	vocal
2	zebra_finch_R3428_40932.67397799_1_24_18_43_17...	1.122	1.318	vocal
3	zebra_finch_R3428_40932.67397799_1_24_18_43_17...	2.093	2.117	vocal
4	zebra_finch_R3428_40932.67397799_1_24_18_43_17...	2.162	2.207	vocal
...	...	...	...	...
269	zebra_finch_R3428_40932.31154143_1_24_8_39_14.wav	2.253	2.372	vocal
270	zebra_finch_R3428_40932.31154143_1_24_8_39_14.wav	2.615	2.727	vocal
271	zebra_finch_R3428_40932.31154143_1_24_8_39_14.wav	2.888	2.972	vocal
272	zebra_finch_R3549_40999.66669408_3_31_18_31_9.wav	0.010	0.110	vocal
273	zebra_finch_R3549_40999.66669408_3_31_18_31_9.wav	1.742	1.843	vocal

274 rows × 4 columns

Use WhisperSeg in Python code

Please refer to the section Use WhisperSeg in Python Code below.

Run WhisperSeg as a Web Service, and call it via API

Please refer to the tutorial: Run WhisperSeg as a Web Service
This allows running WhisperSeg on a Web server, and call the segmentation service from any client of different environments, such as python or MatLab. The best way to incorporate WhisperSeg into your original workflow.

Try WhisperSeg on a GUI (Webpage)

Please refer to the tutorial: Run WhisperSeg via GUI

Use WhisperSeg in Python Code

We demonstrate here using a WhisperSeg trained on multi-species data to segment the audio files of different species.

Note: If you are using your custom models, replace the model's name ("nccratliri/whisperseg-large-ms" or "nccratliri/whisperseg-large-ms-ct2") with your own trained model's name.

Load the pretrained multi-species WhisperSeg

Huggingface model

from model import WhisperSegmenter
segmenter = WhisperSegmenter( "nccratliri/whisperseg-large-ms", device="cuda" )

CTranslate2 version for faster inference

Alternatively, we provided a CTranslate2 converted version, which enables 4x faster inference speed.

To use the CTranslate2 converted model (with checkpoint name ended with "-ct2"), we need to import the "WhisperSegmenterFast" module.

from model import WhisperSegmenterFast
segmenter = WhisperSegmenterFast( "nccratliri/whisperseg-large-ms-ct2", device="cuda" )

Illustration of segmentation parameters

The following paratemers need to be configured for different species when calling the segment function.

sr: sampling rate $f_s$ of the audio when loading
spec_time_step: Spectrogram Time Resolution. By default, one single input spectrogram of WhisperSeg contains 1000 columns. 'spec_time_step' represents the time difference between two adjacent columns in the spectrogram. It is equal to FFT_hop_size / sampling_rate: $\frac{L_\text{hop}}{f_s}$ .
min_frequency: (Optional) The minimum frequency when computing the Log Melspectrogram. Frequency components below min_frequency will not be included in the input spectrogram. Default: 0
min_segment_length: (Optional) The minimum allowed length of predicted segments. The predicted segments whose length is below 'min_segment_length' will be discarded. Default: spec_time_step * 2
eps: (Optional) The threshold $\epsilon_\text{vote}$ during the multi-trial majority voting when processing long audio files. Default: spec_time_step * 8
num_trials: (Optional) The number of segmentation variant produced during the multi-trial majority voting process. Setting num_trials to 1 for noisy data with long segment durations, such as the human AVA-speech dataset, and set num_trials to 3 when segmenting animal vocalizations. Default: 3

The recommended settings of these parameters are available at config/segment_config.json. More details are described in Table 1 in the paper: .

Segmentation Examples

import librosa
import json
from audio_utils import SpecViewer
### SpecViewer is a customized class for interactive spectrogram viewing
spec_viewer = SpecViewer()

Zebra finch (adults)

sr = 32000
spec_time_step = 0.0025  

audio, _ = librosa.load( "data/example_subset/Zebra_finch/test_adults/zebra_finch_g17y2U-f00007.wav",
                         sr = sr )
## Note if spec_time_step is not provided, a default value will be used by the model.
prediction = segmenter.segment(  audio, sr = sr, spec_time_step = spec_time_step )
print(prediction)

{'onset': [0.01, 0.38, 0.603, 0.758, 0.912, 1.813, 1.967, 2.073, 2.838, 2.982, 3.112, 3.668, 3.828, 3.953, 5.158, 5.323, 5.467], 'offset': [0.073, 0.447, 0.673, 0.83, 1.483, 1.882, 2.037, 2.643, 2.893, 3.063, 3.283, 3.742, 3.898, 4.523, 5.223, 5.393, 6.043], 'cluster': ['zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0', 'zebra_finch_0']}

spec_viewer.visualize( audio = audio, sr = sr, prediction = prediction,
                       window_size=8, precision_bits=1
                     )

Let's load the human annoated segments and compare them with WhisperSeg's prediction.

label = json.load( open("data/example_subset/Zebra_finch/test_adults/zebra_finch_g17y2U-f00007.json") )
spec_viewer.visualize( audio = audio, sr = sr, prediction = prediction, label=label,
                       window_size=8, precision_bits=1
                     )

Zebra finch (juveniles)

sr = 32000
spec_time_step = 0.0025

audio_file = "data/example_subset/Zebra_finch/test_juveniles/zebra_finch_R3428_40932.29996086_1_24_8_19_56.wav"
label_file = audio_file[:-4] + ".json"
audio, _ = librosa.load( audio_file, sr = sr )
label = json.load( open(label_file) )

prediction = segmenter.segment(  audio, sr = sr, spec_time_step = spec_time_step )
spec_viewer.visualize( audio = audio, sr = sr, prediction = prediction, label=label,
                       window_size=15, precision_bits=1 )

Bengalese finch

sr = 32000
spec_time_step = 0.0025

audio_file = "data/example_subset/Bengalese_finch/test/bengalese_finch_bl26lb16_190412_0721.20144_0.wav"
label_file = audio_file[:-4] + ".json"
audio, _ = librosa.load( audio_file, sr = sr )
label = json.load( open(label_file) )

prediction = segmenter.segment(  audio, sr = sr, spec_time_step = spec_time_step )
spec_viewer.visualize( audio = audio, sr = sr, prediction = prediction, label=label,
                       window_size=3 )

Marmoset

sr = 48000
spec_time_step = 0.0025

audio_file = "data/example_subset/Marmoset/test/marmoset_pair4_animal1_together_A_0.wav"
label_file = audio_file[:-4] + ".json"
audio, _ = librosa.load( audio_file, sr = sr )
label = json.load( open(label_file) )

prediction = segmenter.segment(  audio, sr = sr, spec_time_step = spec_time_step )
spec_viewer.visualize( audio = audio, sr = sr, prediction = prediction, label=label )

Mouse

sr = 300000
spec_time_step = 0.0005
"""Since mouse produce high frequency vocalizations, we need to set min_frequency to a large value (instead of 0), 
   to make the Mel-spectrogram's frequency range match the mouse vocalization's frequency range"""
min_frequency = 35000  

audio_file = "data/example_subset/Mouse/test/mouse_Rfem_Afem01_0.wav"
label_file = audio_file[:-4] + ".json"
audio, _ = librosa.load( audio_file, sr = sr )
label = json.load( open(label_file) )

prediction = segmenter.segment(  audio, sr = sr, min_frequency = min_frequency, spec_time_step = spec_time_step )
spec_viewer.visualize( audio = audio, sr = sr, min_frequency= min_frequency, prediction = prediction, label=label )

Human (AVA-Speech)

sr = 16000
spec_time_step = 0.01
"""For human speech the multi-trial voting is not so effective, so we set num_trials=1 instead of the default value (3)"""
num_trials = 1

audio_file = "data/example_subset/Human_AVA_Speech/test/human_xO4ABy2iOQA_clip.wav"
label_file = audio_file[:-4] + ".json"
audio, _ = librosa.load( audio_file, sr = sr )
label = json.load( open(label_file) )

prediction = segmenter.segment(  audio, sr = sr, spec_time_step = spec_time_step, num_trials = num_trials )
spec_viewer.visualize( audio = audio, sr = sr, prediction = prediction, label=label,
                       window_size=20, precision_bits=0, xticks_step_size = 2 )

Citation

When using our code or models for your work, please cite the following paper:

@inproceedings{gu2024positive,
  title={Positive Transfer of the Whisper Speech Transformer to Human and Animal Voice Activity Detection},
  author={Gu, Nianlong and Lee, Kanghwi and Basha, Maris and Ram, Sumit Kumar and You, Guanghao and Hahnloser, Richard HR},
  booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={7505--7509},
  year={2024},
  organization={IEEE}
}

Contact

Nianlong Gu [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
assets		assets
config		config
data/example_subset		data/example_subset
docs		docs
scripts		scripts
.gitignore		.gitignore
README.md		README.md
audio_utils.py		audio_utils.py
convert_hf_to_ct2.py		convert_hf_to_ct2.py
datautils.py		datautils.py
demo.py		demo.py
evaluate.py		evaluate.py
model.py		model.py
requirements.txt		requirements.txt
segment_service.py		segment_service.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Positive Transfer Of The Whisper Speech Transformer To Human And Animal Voice Activity Detection

Install Environment

Documentation

Model Training and Evaluation

How To Use The Trained Model

Use WhisperSeg in command line

Segment given the file path to the .wav audio file

Segment given the path to the folder that contains multiple .wav files

Use WhisperSeg in Python code

Run WhisperSeg as a Web Service, and call it via API

Try WhisperSeg on a GUI (Webpage)

Use WhisperSeg in Python Code

Load the pretrained multi-species WhisperSeg

Huggingface model

CTranslate2 version for faster inference

Illustration of segmentation parameters

Segmentation Examples

Zebra finch (adults)

Zebra finch (juveniles)

Bengalese finch

Marmoset

Mouse

Human (AVA-Speech)

Citation

Contact

About

Releases

Packages

Languages

nianlonggu/WhisperSeg

Folders and files

Latest commit

History

Repository files navigation

Positive Transfer Of The Whisper Speech Transformer To Human And Animal Voice Activity Detection

Install Environment

Documentation

Model Training and Evaluation

How To Use The Trained Model

Use WhisperSeg in command line

Segment given the file path to the .wav audio file

Segment given the path to the folder that contains multiple .wav files

Use WhisperSeg in Python code

Run WhisperSeg as a Web Service, and call it via API

Try WhisperSeg on a GUI (Webpage)

Use WhisperSeg in Python Code

Load the pretrained multi-species WhisperSeg

Huggingface model

CTranslate2 version for faster inference

Illustration of segmentation parameters

Segmentation Examples

Zebra finch (adults)

Zebra finch (juveniles)

Bengalese finch

Marmoset

Mouse

Human (AVA-Speech)

Citation

Contact

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages