Powerset multi-class cross entropy loss for neural speaker diarization

Alexis Plaquet and Hervé Bredin
Proc. InterSpeech 2023.

Since its introduction in 2019, the whole end-to-end neural diarization (EEND) line of work has been addressing speaker diarization as a frame-wise multi-label classification problem with permutation-invariant training. Despite EEND showing great promise, a few recent works took a step back and studied the possible combination of (local) supervised EEND diarization with (global) unsupervised clustering. Yet, these hybrid contributions did not question the original multi-label formulation. We propose to switch from multi-label (where any two speakers can be active at the same time) to powerset multi-class classification (where dedicated classes are assigned to pairs of overlapping speakers). Through extensive experiments on 9 different benchmarks, we show that this formulation leads to significantly better performance (mostly on overlapping speech) and robustness to domain mismatch, while eliminating the detection threshold hyperparameter, critical for the multi-label formulation.

Read the paper

Citations

@inproceedings{Plaquet2023,
  title={Powerset multi-class cross entropy loss for neural speaker diarization},
  author={Plaquet, Alexis and Bredin, Herv\'{e}},
  year={2023},
  booktitle={Proc. Interspeech 2023},
}

@inproceedings{Bredin2023,
  title={pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe},
  author={Bredin, Herv\'{e}},
  year={2023},
  booktitle={Proc. Interspeech 2023},
}

Benchmark (and checkpoints)

Pretrained model

Performance obtained with a model pretrained on AISHELL, AliMeeting, AMI, Ego4D, MSDWild, REPERE, and VoxConverse (see the paper for more details).

The pretrained model checkpoint and corresponding pipeline hyperparameters used to obtain these results are available.

Dataset	DER%	FA%	Miss%	Conf%	Output	Metrics
AISHELL-4 (channel 1)	16.85	3.25	6.29	7.30	📋	📈
AliMeeting (channel 1)	23.30	3.69	10.38	9.22	📋	📈
AMI (headset mix)	19.71	4.51	8.54	6.66	📋	📈
AMI (array1, channel 1))	21.96	5.06	9.65	7.26	📋	📈
Ego4D v1 (validation)	57.25	6.33	30.75	20.16	📋	📈
MSDWild	29.17	6.64	8.80	13.73	📋	📈
REPERE (phase 2)	8.35	2.10	2.36	3.89	📋	📈
VoxConverse (v0.3)	11.56	4.74	2.68	4.14	📋	📈
DIHARD-3 (Full)	29.90	14.40	7.18	8.33	📋	📈

This American Life	21.83	2.25	12.85	6.74	📋	📈
AVA-AVD	60.60	18.54	16.19	25.87	📋	📈

Finetuned models

Performance obtained after training the pretrained model further on one domain.

Dataset	DER%	FA%	Miss%	Conf%	Output	Metrics	Ckpt	Hparams
AISHELL-4 (channel 1)	13.21	4.42	3.29	5.50	📋	📈	💾	🔧
AliMeeting (channel 1)	24.49	4.62	8.80	11.07	📋	📈	💾	🔧
AMI (headset mix)	17.98	4.34	8.21	5.43	📋	📈	💾	🔧
AMI (array1, channel 1))	22.90	4.81	9.76	8.33	📋	📈	💾	🔧
Ego4D v1 (validation)	48.16	8.88	21.35	17.93	📋	📈	💾	🔧
MSDWild	28.51	6.10	8.07	14.34	📋	📈	💾	🔧
REPERE (phase 2)	8.16	1.92	2.64	3.60	📋	📈	💾	🔧
VoxConverse (v0.3)	10.35	3.86	2.77	3.72	📋	📈	💾	🔧
DIHARD-3 (Full)	21.31	4.77	8.72	7.82	📋	📈	💾	🔧

AVA-AVD	46.45	6.71	17.75	21.98	📋	📈	💾	🔧

Reproducibility

Reproducing the paper results

The pyannote.audio version used to train these model is commit e3dc7d6 (although it should not matter for this training, to be more precise it's commit 1f83e0b with commit e3dc7d6 changes cherry-picked).

More recent versions should also work. You only need to clone/download and install pyannote.audio and its dependencies. See pyannote.audio's README for more details.

Example notebook : Adapting a powerset model and looking at its outputs

An example notebook is available, you can see how to load a powerset model (for example, one available in models/), how to train it further on a toy dataset, and finally how to get its local segmentation output and final diarization output.

Using checkpoints in a pipeline

from pyannote.audio.models.segmentation import PyanNet
from pyannote.audio.pipelines import SpeakerDiarization as SpeakerDiarizationPipeline

# constants (params from the pyannote/speaker-diarization huggingface pipeline)
WAV_FILE="../pyannote-audio/tutorials/assets/sample.wav"
MODEL_PATH="models/powerset/powerset_pretrained.ckpt"
PIPELINE_PARAMS = {
    "clustering": {
        "method": "centroid",
        "min_cluster_size": 15,
        "threshold": 0.7153814381597874,
    },
    "segmentation": {
        "min_duration_off": 0.5817029604921046,
        # "threshold": 0.4442333667381752,  # does not apply to powerset
    },
}

# create, instantiate and apply the pipeline
pipeline = SpeakerDiarizationPipeline(
    segmentation=MODEL_PATH,
    embedding="speechbrain/spkrec-ecapa-voxceleb",
    embedding_exclude_overlap=True,
    clustering="AgglomerativeClustering",
)
pipeline.instantiate(PIPELINE_PARAMS)
pipeline(WAV_FILE)

Using checkpoints for the segmentation model only

from pyannote.audio import Model
from pyannote.audio.tasks import SpeakerDiarization
from pyannote.audio.core.inference import Inference

MODEL_PATH="models/powerset/powerset_pretrained.ckpt"
WAV_FILE="../pyannote-audio/tutorials/assets/sample.wav"

model : SpeakerDiarization = Model.from_pretrained(MODEL_PATH)
inference = Inference(model, step=5.0)
inference(WAV_FILE)

Training your own powerset segmentation model

You can train your own version of the model by using the pyannote.audio develop branch (instructions in pyannote.audio's readme), or pyannote.audio v3.x when released.

The SpeakerDiarization task can be set to use powerset or multilabel representation with its max_speakers_per_frame constructor parameter : "Maximum number of (overlapping) speakers per frame. Setting this value to 1 or more enables powerset multi-class training."

In this example, the model is ready to be trained with the powerset multiclass setting described in the paper, and will handle at most 4 speakers per 5-seconds chunk, and 2 active speakers simultaneously.

from pyannote.database import registry
from pyannote.audio.tasks import SpeakerDiarization
from pyannote.audio.models.segmentation import PyanNet

my_protocol = registry.get_protocol('MyProtocol.SpeakerDiarization.Custom')

seg_task = SpeakerDiarization(
    my_protocol, 
    duration=5.0,
    max_speakers_per_chunk=4,
    max_speakers_per_frame=2,
)
model = PyanNet(task=seg_task)

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
eval		eval
hparams		hparams
models		models
rttm		rttm
README.md		README.md
notebook_ps_diarization.ipynb		notebook_ps_diarization.ipynb
paper_powerset_speaker_diarization.pdf		paper_powerset_speaker_diarization.pdf
poster_powerset_speaker_diarization.pdf		poster_powerset_speaker_diarization.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Powerset multi-class cross entropy loss for neural speaker diarization

Citations

Benchmark (and checkpoints)

Pretrained model

Finetuned models

Reproducibility

Reproducing the paper results

Example notebook : Adapting a powerset model and looking at its outputs

Using checkpoints in a pipeline

Using checkpoints for the segmentation model only

Training your own powerset segmentation model

About

Contributors 2

Languages

FrenchKrab/IS2023-powerset-diarization

Folders and files

Latest commit

History

Repository files navigation

Powerset multi-class cross entropy loss for neural speaker diarization

Citations

Benchmark (and checkpoints)

Pretrained model

Finetuned models

Reproducibility

Reproducing the paper results

Example notebook : Adapting a powerset model and looking at its outputs

Using checkpoints in a pipeline

Using checkpoints for the segmentation model only

Training your own powerset segmentation model

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2

Languages