This repository implements the audio-only part of our DCASE 2024 challenge submission.
@techreport{Berg_LU_task3_report,
Author = "Berg, Axel and Engman, Johanna and Gulin, Jens and Astrom, Karl and Oskarsson, Magnus",
title = "THE LU SYSTEM FOR DCASE 2024 SOUND EVENT LOCALIZATION AND DETECTION CHALLENGE",
institution = "DCASE2024 Challenge",
year = "2024",
}
A more detailed description of the method can be found in the arXiv paper.
@article{berg2024learning,
title={Learning Multi-Target TDOA Features for Sound Event Localization and Detection},
author={Berg, Axel and Engman, Johanna and Gulin, Jens and {\AA}str{\"o}m, Karl and Oskarsson, Magnus},
journal={arXiv preprint arXiv:2408.17166},
year={2024}
}
This repository is based on the DCASE 2024 SELD challenge baseline implementation.
Please visit the official webpage of the DCASE 2024 Challenge for details missing in this repo.
Download the MIC dataset from here: Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23)
In order to train the models with (optional) additional simulated recordings, then download the official synthetic dataset from here: [DCASE2024 Task 3] Synthetic SELD mixtures for baseline training
In our challenge submission, we also generated extra synthetic data. For this, we refer to Spatial Scaper
In addition, we use channel swapping for data augmentation. We used SoundQ in order to apply this to the real recordings.
This repository consists of multiple Python scripts.
- The
batch_feature_extraction.py
is a standalone wrapper script, that extracts the features, labels, and normalizes the training and test split features for a given dataset. Make sure you update the location of the downloaded datasets before. - The
parameter.py
script consists of all the training, model, and feature parameters. If a user has to change some parameters, they have to create a sub-task with unique id here. Check code for examples. - The
cls_feature_class.py
script has routines for labels creation, features extraction and normalization. - The
cls_vid_features.py
script extracts video features for the audio-visual task from a pretrained ResNet model. - The
cls_data_generator.py
script provides feature + label data in generator mode for training. - The
seldnet_model.py
script implements the SELDnet baseline architecture. - The CST-Former architecture is found in
cst_former/
- The NGCC-PHAT architecture is found in
ngcc/
- The
SELD_evaluation_metrics.py
script implements the metrics for joint evaluation of detection and localization. - The
train_seldnet.py
is a wrapper script that performs pre-training of NGCC-PHAT and end-to-end training of SELDNET/CST-Former - The
cls_compute_seld_results.py
script computes the metrics results on your DCASE output format files.
- For the MIC dataset, download the zip files. This contains both the audio files and the respective metadata.
- Unzip the files under
data_2024/
- Extract features from the downloaded dataset by running the
batch_feature_extraction.py
script. Run the script as shown below. This will dump the normalized features and labels in thefeat_label_dir
folder. The Python script allows you to compute all possible features and labels. You can control this by editing theparameter.py
script before running thebatch_feature_extraction.py
script.
GCC + MS:
python batch_feature_extraction.py 6
SALSA-LITE:
python batch_feature_extraction.py 7
NGCC + MS:
python batch_feature_extraction.py 9
NGCC-PHAT computes features from raw audio on the fly. However, using params['saved_chunks'] = True
, will convert the .wav
.files to .npy
, which makes them faster to load.
- We provide a pre-trained NGCC-PHAT feature extractor in
models/9_tdoa-3tracks-16channels.h5
. If you want to perform ADPIT pre-training of the NGCC-PHAT network yourself, run
python train_seldnet.py 9 my_pretraining
- Finally, you can now train the SELDnet Baseline or CST-Former using.
python3 train_seldnet.py <task-id> <job-id> <seed>
The default training setups are
Baseline w/ NGCC+MS
python3 train_seldnet.py 10 my_experiment
Baseline w/ GCC+MS
python3 train_seldnet.py 6 my_experiment
Baseline w/ SALSA-Lite
python3 train_seldnet.py 7 my_experiment
CST-Former w/ NGCC+MS
python3 train_seldnet.py 333 my_experiment
CST-Former w/ GCC+MS
python3 train_seldnet.py 33 my_experiment
CST-Former w/ SALSA-Lite
python3 train_seldnet.py 34 my_experiment
The evaluation metric scores for the test split of the development dataset with the provided pre-trained models are given below.
Model | F20° | DOAECD | RDECD | weights |
---|---|---|---|---|
CST-Former Small w/ NGCC | 28.3% | 23.7° | 0.42 | models/333_cst-3t16c.h5 |
CST-Former Large w/ NGCC | 31.3% | 21.6° | 0.48 | models/333_cst-3t16c-large.h5 |
This repository is based on the DCASE 2024 SELD challenge baseline implementation.
The original NGCC implementation can be found here.
The CST-Former implementation is borrowed from this repository. Please cite their work as well:
@INPROCEEDINGS{shul2024cstformer,
author={Shul, Yusun and Choi, Jung-Woo},
booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={CST-Former: Transformer with Channel-Spectro-Temporal Attention for Sound Event Localization and Detection},
year={2024},
pages={8686-8690},
doi={10.1109/ICASSP48485.2024.10447181}}
This repo and its contents have the MIT License.