Alexis Plaquet and Hervé Bredin
Proc. InterSpeech 2023.
Since its introduction in 2019, the whole end-to-end neural diarization (EEND) line of work has been addressing speaker diarization as a frame-wise multi-label classification problem with permutation-invariant training. Despite EEND showing great promise, a few recent works took a step back and studied the possible combination of (local) supervised EEND diarization with (global) unsupervised clustering. Yet, these hybrid contributions did not question the original multi-label formulation. We propose to switch from multi-label (where any two speakers can be active at the same time) to powerset multi-class classification (where dedicated classes are assigned to pairs of overlapping speakers). Through extensive experiments on 9 different benchmarks, we show that this formulation leads to significantly better performance (mostly on overlapping speech) and robustness to domain mismatch, while eliminating the detection threshold hyperparameter, critical for the multi-label formulation.
@inproceedings{Plaquet2023,
title={Powerset multi-class cross entropy loss for neural speaker diarization},
author={Plaquet, Alexis and Bredin, Herv\'{e}},
year={2023},
booktitle={Proc. Interspeech 2023},
}
@inproceedings{Bredin2023,
title={pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe},
author={Bredin, Herv\'{e}},
year={2023},
booktitle={Proc. Interspeech 2023},
}
Performance obtained with a model pretrained on AISHELL, AliMeeting, AMI, Ego4D, MSDWild, REPERE, and VoxConverse (see the paper for more details).
The pretrained model checkpoint and corresponding pipeline hyperparameters used to obtain these results are available.
Dataset | DER% | FA% | Miss% | Conf% | Output | Metrics |
---|---|---|---|---|---|---|
AISHELL-4 (channel 1) | 16.85 | 3.25 | 6.29 | 7.30 | 📋 | 📈 |
AliMeeting (channel 1) | 23.30 | 3.69 | 10.38 | 9.22 | 📋 | 📈 |
AMI (headset mix) | 19.71 | 4.51 | 8.54 | 6.66 | 📋 | 📈 |
AMI (array1, channel 1)) | 21.96 | 5.06 | 9.65 | 7.26 | 📋 | 📈 |
Ego4D v1 (validation) | 57.25 | 6.33 | 30.75 | 20.16 | 📋 | 📈 |
MSDWild | 29.17 | 6.64 | 8.80 | 13.73 | 📋 | 📈 |
REPERE (phase 2) | 8.35 | 2.10 | 2.36 | 3.89 | 📋 | 📈 |
VoxConverse (v0.3) | 11.56 | 4.74 | 2.68 | 4.14 | 📋 | 📈 |
DIHARD-3 (Full) | 29.90 | 14.40 | 7.18 | 8.33 | 📋 | 📈 |
This American Life | 21.83 | 2.25 | 12.85 | 6.74 | 📋 | 📈 |
AVA-AVD | 60.60 | 18.54 | 16.19 | 25.87 | 📋 | 📈 |
Performance obtained after training the pretrained model further on one domain.
Dataset | DER% | FA% | Miss% | Conf% | Output | Metrics | Ckpt | Hparams |
---|---|---|---|---|---|---|---|---|
AISHELL-4 (channel 1) | 13.21 | 4.42 | 3.29 | 5.50 | 📋 | 📈 | 💾 | 🔧 |
AliMeeting (channel 1) | 24.49 | 4.62 | 8.80 | 11.07 | 📋 | 📈 | 💾 | 🔧 |
AMI (headset mix) | 17.98 | 4.34 | 8.21 | 5.43 | 📋 | 📈 | 💾 | 🔧 |
AMI (array1, channel 1)) | 22.90 | 4.81 | 9.76 | 8.33 | 📋 | 📈 | 💾 | 🔧 |
Ego4D v1 (validation) | 48.16 | 8.88 | 21.35 | 17.93 | 📋 | 📈 | 💾 | 🔧 |
MSDWild | 28.51 | 6.10 | 8.07 | 14.34 | 📋 | 📈 | 💾 | 🔧 |
REPERE (phase 2) | 8.16 | 1.92 | 2.64 | 3.60 | 📋 | 📈 | 💾 | 🔧 |
VoxConverse (v0.3) | 10.35 | 3.86 | 2.77 | 3.72 | 📋 | 📈 | 💾 | 🔧 |
DIHARD-3 (Full) | 21.31 | 4.77 | 8.72 | 7.82 | 📋 | 📈 | 💾 | 🔧 |
AVA-AVD | 46.45 | 6.71 | 17.75 | 21.98 | 📋 | 📈 | 💾 | 🔧 |
The pyannote.audio version used to train these model is commit e3dc7d6 (although it should not matter for this training, to be more precise it's commit 1f83e0b with commit e3dc7d6 changes cherry-picked).
More recent versions should also work. You only need to clone/download and install pyannote.audio and its dependencies. See pyannote.audio's README for more details.
An example notebook is available, you can see how to load a powerset model (for example, one available in models/), how to train it further on a toy dataset, and finally how to get its local segmentation output and final diarization output.
from pyannote.audio.models.segmentation import PyanNet
from pyannote.audio.pipelines import SpeakerDiarization as SpeakerDiarizationPipeline
# constants (params from the pyannote/speaker-diarization huggingface pipeline)
WAV_FILE="../pyannote-audio/tutorials/assets/sample.wav"
MODEL_PATH="models/powerset/powerset_pretrained.ckpt"
PIPELINE_PARAMS = {
"clustering": {
"method": "centroid",
"min_cluster_size": 15,
"threshold": 0.7153814381597874,
},
"segmentation": {
"min_duration_off": 0.5817029604921046,
# "threshold": 0.4442333667381752, # does not apply to powerset
},
}
# create, instantiate and apply the pipeline
pipeline = SpeakerDiarizationPipeline(
segmentation=MODEL_PATH,
embedding="speechbrain/spkrec-ecapa-voxceleb",
embedding_exclude_overlap=True,
clustering="AgglomerativeClustering",
)
pipeline.instantiate(PIPELINE_PARAMS)
pipeline(WAV_FILE)
from pyannote.audio import Model
from pyannote.audio.tasks import SpeakerDiarization
from pyannote.audio.core.inference import Inference
MODEL_PATH="models/powerset/powerset_pretrained.ckpt"
WAV_FILE="../pyannote-audio/tutorials/assets/sample.wav"
model : SpeakerDiarization = Model.from_pretrained(MODEL_PATH)
inference = Inference(model, step=5.0)
inference(WAV_FILE)
You can train your own version of the model by using the pyannote.audio develop branch (instructions in pyannote.audio's readme), or pyannote.audio v3.x when released.
The SpeakerDiarization
task can be set to use powerset or multilabel representation with its max_speakers_per_frame
constructor parameter : "Maximum number of (overlapping) speakers per frame. Setting this value to 1 or more enables powerset multi-class
training."
In this example, the model is ready to be trained with the powerset multiclass setting described in the paper, and will handle at most 4 speakers per 5-seconds chunk, and 2 active speakers simultaneously.
from pyannote.database import registry
from pyannote.audio.tasks import SpeakerDiarization
from pyannote.audio.models.segmentation import PyanNet
my_protocol = registry.get_protocol('MyProtocol.SpeakerDiarization.Custom')
seg_task = SpeakerDiarization(
my_protocol,
duration=5.0,
max_speakers_per_chunk=4,
max_speakers_per_frame=2,
)
model = PyanNet(task=seg_task)