Skip to content

Latest commit

 

History

History
154 lines (117 loc) · 13.4 KB

README.md

File metadata and controls

154 lines (117 loc) · 13.4 KB

Powerset multi-class cross entropy loss for neural speaker diarization

Alexis Plaquet and Hervé Bredin
Proc. InterSpeech 2023.

Since its introduction in 2019, the whole end-to-end neural diarization (EEND) line of work has been addressing speaker diarization as a frame-wise multi-label classification problem with permutation-invariant training. Despite EEND showing great promise, a few recent works took a step back and studied the possible combination of (local) supervised EEND diarization with (global) unsupervised clustering. Yet, these hybrid contributions did not question the original multi-label formulation. We propose to switch from multi-label (where any two speakers can be active at the same time) to powerset multi-class classification (where dedicated classes are assigned to pairs of overlapping speakers). Through extensive experiments on 9 different benchmarks, we show that this formulation leads to significantly better performance (mostly on overlapping speech) and robustness to domain mismatch, while eliminating the detection threshold hyperparameter, critical for the multi-label formulation.

Read the paper

Citations

@inproceedings{Plaquet2023,
  title={Powerset multi-class cross entropy loss for neural speaker diarization},
  author={Plaquet, Alexis and Bredin, Herv\'{e}},
  year={2023},
  booktitle={Proc. Interspeech 2023},
}

@inproceedings{Bredin2023,
  title={pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe},
  author={Bredin, Herv\'{e}},
  year={2023},
  booktitle={Proc. Interspeech 2023},
}

Benchmark (and checkpoints)

Pretrained model

Performance obtained with a model pretrained on AISHELL, AliMeeting, AMI, Ego4D, MSDWild, REPERE, and VoxConverse (see the paper for more details).

The pretrained model checkpoint and corresponding pipeline hyperparameters used to obtain these results are available.

Dataset DER% FA% Miss% Conf% Output Metrics
AISHELL-4 (channel 1) 16.85 3.25 6.29 7.30 📋 📈
AliMeeting (channel 1) 23.30 3.69 10.38 9.22 📋 📈
AMI (headset mix) 19.71 4.51 8.54 6.66 📋 📈
AMI (array1, channel 1)) 21.96 5.06 9.65 7.26 📋 📈
Ego4D v1 (validation) 57.25 6.33 30.75 20.16 📋 📈
MSDWild 29.17 6.64 8.80 13.73 📋 📈
REPERE (phase 2) 8.35 2.10 2.36 3.89 📋 📈
VoxConverse (v0.3) 11.56 4.74 2.68 4.14 📋 📈
DIHARD-3 (Full) 29.90 14.40 7.18 8.33 📋 📈
This American Life 21.83 2.25 12.85 6.74 📋 📈
AVA-AVD 60.60 18.54 16.19 25.87 📋 📈

Finetuned models

Performance obtained after training the pretrained model further on one domain.

Dataset DER% FA% Miss% Conf% Output Metrics Ckpt Hparams
AISHELL-4 (channel 1) 13.21 4.42 3.29 5.50 📋 📈  💾 🔧
AliMeeting (channel 1) 24.49 4.62 8.80 11.07 📋 📈 💾 🔧
AMI (headset mix) 17.98 4.34 8.21 5.43 📋 📈 💾 🔧
AMI (array1, channel 1)) 22.90 4.81 9.76 8.33 📋 📈 💾 🔧
Ego4D v1 (validation) 48.16 8.88 21.35 17.93 📋 📈 💾 🔧
MSDWild 28.51 6.10 8.07 14.34 📋 📈 💾 🔧
REPERE (phase 2) 8.16 1.92 2.64 3.60 📋 📈 💾 🔧
VoxConverse (v0.3) 10.35 3.86 2.77 3.72 📋 📈 💾 🔧
DIHARD-3 (Full) 21.31 4.77 8.72 7.82 📋 📈 💾 🔧
AVA-AVD 46.45 6.71 17.75 21.98 📋 📈 💾 🔧

Reproducibility

Reproducing the paper results

The pyannote.audio version used to train these model is commit e3dc7d6 (although it should not matter for this training, to be more precise it's commit 1f83e0b with commit e3dc7d6 changes cherry-picked).

More recent versions should also work. You only need to clone/download and install pyannote.audio and its dependencies. See pyannote.audio's README for more details.

Example notebook : Adapting a powerset model and looking at its outputs

Open In Colab

In-repository Notebook

An example notebook is available, you can see how to load a powerset model (for example, one available in models/), how to train it further on a toy dataset, and finally how to get its local segmentation output and final diarization output.

Using checkpoints in a pipeline

from pyannote.audio.models.segmentation import PyanNet
from pyannote.audio.pipelines import SpeakerDiarization as SpeakerDiarizationPipeline

# constants (params from the pyannote/speaker-diarization huggingface pipeline)
WAV_FILE="../pyannote-audio/tutorials/assets/sample.wav"
MODEL_PATH="models/powerset/powerset_pretrained.ckpt"
PIPELINE_PARAMS = {
    "clustering": {
        "method": "centroid",
        "min_cluster_size": 15,
        "threshold": 0.7153814381597874,
    },
    "segmentation": {
        "min_duration_off": 0.5817029604921046,
        # "threshold": 0.4442333667381752,  # does not apply to powerset
    },
}

# create, instantiate and apply the pipeline
pipeline = SpeakerDiarizationPipeline(
    segmentation=MODEL_PATH,
    embedding="speechbrain/spkrec-ecapa-voxceleb",
    embedding_exclude_overlap=True,
    clustering="AgglomerativeClustering",
)
pipeline.instantiate(PIPELINE_PARAMS)
pipeline(WAV_FILE)

Using checkpoints for the segmentation model only

from pyannote.audio import Model
from pyannote.audio.tasks import SpeakerDiarization
from pyannote.audio.core.inference import Inference

MODEL_PATH="models/powerset/powerset_pretrained.ckpt"
WAV_FILE="../pyannote-audio/tutorials/assets/sample.wav"

model : SpeakerDiarization = Model.from_pretrained(MODEL_PATH)
inference = Inference(model, step=5.0)
inference(WAV_FILE)

Training your own powerset segmentation model

You can train your own version of the model by using the pyannote.audio develop branch (instructions in pyannote.audio's readme), or pyannote.audio v3.x when released.

The SpeakerDiarization task can be set to use powerset or multilabel representation with its max_speakers_per_frame constructor parameter : "Maximum number of (overlapping) speakers per frame. Setting this value to 1 or more enables powerset multi-class training."

In this example, the model is ready to be trained with the powerset multiclass setting described in the paper, and will handle at most 4 speakers per 5-seconds chunk, and 2 active speakers simultaneously.

from pyannote.database import registry
from pyannote.audio.tasks import SpeakerDiarization
from pyannote.audio.models.segmentation import PyanNet

my_protocol = registry.get_protocol('MyProtocol.SpeakerDiarization.Custom')

seg_task = SpeakerDiarization(
    my_protocol, 
    duration=5.0,
    max_speakers_per_chunk=4,
    max_speakers_per_frame=2,
)
model = PyanNet(task=seg_task)