This repository contains the codebase of my thesis on detecting, segmenting and classifying Lemur catta vocalisations in outdoor audio recordings. It is a fork of the original WhisperSeg model by Gu et al., as proposed in their recent paper.
conda env create -f environment.yml
Linux:
conda create -n wseg python=3.10 -y
conda activate wseg
pip install -r requirements.txt
conda install -c pypi cudnn -y
Windows:
conda create -n wseg python=3.10 -y
conda activate wseg
pip install -r requirements_windows.txt
conda install -c pypi cudnn -y
In order to train a new model, first a set of data files must be prepared. To start, split .wav
audio files in halves using split_wavs.py. Move the resulting splits into a new directory (currently saved in same directory).
python util/split_wavs.py --path path/to/data
Next, annotate the split recordings using Raven Pro/Lite and place Raven selection tables into the same folder. The assumed naming convention for further processing is <.wav_file>.Table.1.selections.txt
.
Then proceed with running the code for table cleaning, converting to .json
and trimming of unannotated audio from the front and back of the files. This will also split the results into pretraining and finetuning, into separate folders.
python util/clean_tables.py --path path/to/data/
# duplicate data, as trimming alters files in-place
mkdir -p pretrain finetune
cp path/to/data/ pretrain
cp path/to/data/ finetune
python util/make_json.py --file_path ./pretrain --output_path ./pretrain
python util/make_json.py --file_path ./finetune --output_path ./finetune
python util/trim_wavs.py --file_path ./pretrain
python util/trim_wavs.py --file_path ./finetune
Refer to make_json.py for switches to modify tolerance
or clip_duration
values or for ways to filter and convert annotations (e.g. single calls, merging targets).
Using the prepared data, models can now be trained. This process consists of a pretraining and a finetuning step. Gu et al. recommend using their model checkpoints with a multi-species pretraining history for enhanced effectiveness. These are available for the Whisper-Base and Whisper-Large model architectures and are automatically downloaded when running the model using their huggingface designation. For this, execute the following commands after each other. Each step may take some time, depending on your compute resources.
python train.py \
--initial_model_path nccratliri/whisperseg-base-animal-vad \
--train_dataset_folder path/to/pretrain_data \
--model_folder path/to/save_trained_model \
--gpu_list 0 \
--max_num_epochs 10 \
python train.py \
--initial_model_path path/to/<pretrained_model>/final_checkpoint \
--train_dataset_folder path/to/finetune_data \
--model_folder path/to/save_trained_model \
--gpu_list 0 \
--max_num_epochs 10 \
If you have file(s) set aside for testing, you can evaluate the model's segmentation performance by running
python evaluate.py \
--dataset_path path/to/test_data \
--model_path path/to/<finetuned_model>/final_checkpoint_ct2 \
--output_dir path/to/results_dir
For shell scripts that run these steps for you, refer to train_base.sh, evaluate_large.sh and infer_large.sh. These scripts were written for a SLURM-controlled HPC environment and will handle moving data to a working directory, fully training a model, evaluating it and cleaning up after themselves. If you do not have access to such an environment, they will nonetheless be helpful to understand the training process.
More in-depth explanations of data processing, model training and evaluation can be found in the documentation of the original WhisperSeg implementation by Gu et al. (here and here).
The code to reproduce all experiments in the thesis can be found in jobs. Experiments that rely on a specific preparation of the data come with a prepare_<exp>.sh
script that will process data into the required state. Otherwise, each experiment consists of one or more job_<exp>.sh
files (e.g. for rtx5000 vs v100) and a run_<exp>.sh
file that sends a number of these jobs to the HPC controller.
Experiments:
- Baseline-Pre: Link
- Batchsize & Learning rate: Link
- Patience: Link
- Validation ratio: Link
- Tolerance: Link
- Clip duration: Link
- Call / no-call: Link
- Single call: Link
- 9 calls: Link
- Additional pretraining: Link
- Class balancing: Link
- Strategy augmentation: Link
- 7+3 Un/Curated: Link
- 7+150: Link
- Nianlong Gu, for his kind assistance with questions about the original codebase