Pix2Seq for Speaker Diarization.
pip install git+https://github.com/mogwai/nanodrz.git#egg=nanodrz
You'll need to do this before you can do inference in 3_inference.ipynb
python scripts/preprocess.py
train configs/medium_mel.yaml
These notebooks contain inference code.
Download the checkpoint
notebooks/3_inference.ipynb
notebooks/4_real_inference.ipynb
https://wandb.ai/harrycblum/nano-diarization?workspace=user-harrycblum
https://fluxions.notion.site/nanodrz-Experiment-Log-acea3d5f436949b68e1f5a520c8cfdbc
Diarization models in the wild currently combine a variety of models and are quite complex. This approach attempts to use synthetic data to experiment with overlapping speech difficulty.
At the highest level, the idea here is that if we as humans were to listen to random people talking, we may not know who they are but would be able to distinguish between the different speakers. This approach is trying to copy that ability. At the moment popular diarization pipelines are a collation of lots of different models. This approach attempts to try and to combine a lot of these into one.
The basic idea is that we're taking a very compressed version of the the audio and predicting of 3 tokens for each section of speech. The first two tokens are the start and end quantized time tokens, the third being the speaker that spoke (A, B, C, ...).
Let's say we have 512 possible tokens, the first 2 are for end of sequence and padding. The last set of tokens are reserved the number of speakers that we can predict. To keep this simple, we're generated synthetic samples with 2 speakers < 30 seconds. This means that the tokens in between index 2 and 509 are for time embeddings. We quantize the continuous timestamps 0.2, 2.5 etc into fixed integers. (Look at the Coordinate Quantization Notebook)
So our sequence for our decoder only transformer is:
audio token, audio token, ... , start_diairise_cmd_token, start, end, label, start, end, label, ..., eos
Great question, we'll still need a second phase to do this so that we know who A and B are. But these speaker ID models are quite good now so giving it a few diarized sections to indentify who A is will hopefully work.
Transformers need a lot of data and this way we can make datasets on the fly without our models being able to overfit. It also make researching how approaches perform against harder problems (overlapping speech) which I don't think there is enough good real data in the wild to train a transformer on.
We'll be limited to doing chunks of audio, however we can prompt the model with some audio from the previous chunk:
Say we complete inference on our first chunk and identified two speakers, A and B. In the second chunk, we'll take 5 seconds of the audio for each speaker and prompt the model with is so that it knows who they are. This prevents the model swapping the labels for speakers between chunks.
2 chunk inference example
5 seconds of A, 200ms of silence, 5 seconds of B, 200ms silence, {audio we want to diarize}, start_diairise_cmd_token, 0, 5, A 5.2, 10.2, B, ..., eos
Yes we loose some capacity but this prevents us having to ID each chunks speaker labels at the end.
- Can we get transformers doing diarisation
- Can we create synthetic combinations of tts datasets to experiment with diarisation difficulty.
- Will this help generalise to different domains.
- What happens as we increase the difficult of the data with interrupts, noise, number of speakers, loudness of speakers.
I've chosen a strong copy left license for now because I want to get people giving back to this project.
https://arxiv.org/pdf/2310.13025.pdf
James Parsloe for project structure and lots of the utils found here.
Herve Bredin for his amazing research on pyannote.audio
Andrej Karpathy - NanoGPT For lots of great explanation on transformers and insparation for this codebase's style.
@software{nanodrz_2024,
author = {Coultas Blum, Harry},
month = {01},
title = {{nanodrz}},
url = {https://github.com/mogwai/nanodrz},
version = {1.0.0},
year = {2024}
}