How to define a speaker per segments from overlapped widow of frames? Prediction on realtime data? #43

alamnasim · 2019-04-04T05:58:25Z

Describe the question

A clear and concise description of what the question is.
Summary of work:
Audio signal is transformed into frames (log-mel-filterbank energies features) with frame width 25ms and step of 10ms. then frames are constituted into the over-lapped window of size 240ms and 50% overlap. window level d-vector calculated and then d-vectors are constituted into a segment of 400ms or more so that a segment contains single speaker's d-vector.

Questions:
During Testing, since each audio file contains utterances of different speakers, if we make the over-lapped window of frames,

How do we sure about the 400 ms sized segment will represent a single speaker, and
If we make each segment as of 400 ms fixed, won't it affect the accuracy?
How do i perform real-time prediction if I have an audio file, how do I get speaker wise timestamp of each utterance?

Help appreciated.

My background

Have I read the README.md file?

yes/no - if you answered no, please stop filing the issue, and read it first
yes
Have I searched for similar questions from closed issues?
yes/no - if you answered no, please do it first
yes
Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?
yes/no
yes
Have I tried to find the answers in the reference Speaker Diarization with LSTM?
yes/no
yes
Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?
yes/no
yes

The text was updated successfully, but these errors were encountered:

wq2012 · 2019-04-04T14:54:32Z

There is NO guarantee that each 400ms segment is from a single speaker. It's an approximation. But since most single-speaker sessions are much longer than 400ms, the errors in a few of these segments won't affect the final performance significantly.
400ms is the maximum. It can be smaller if Voice Activity Detector (VAD) detected a boundary. Also, 400ms is NOT the golden value that everyone should use. We swept this parameter on our dev dataset to find out 400ms produces the best performance. Each different application may use a different value, even different strategy here. In most real applications, temporal resolution in the diarization results are not important, thus 400ms may be too small. But if you really care about the DER number and want to publish your results, you may further tune this parameter on your own domain dataset.
This library on GitHub does not provide an online API. It's described in Add a online_predict() API for streaming input #28 but we don't have the bandwidth to work on it. Also you don't directly get the timestamps. You process the audio into frames, then feed these frames to speaker encoder to get embeddings. Then you use UIS-RNN to get speaker label for each embedding. The temporal resolution is limited by the segment size (400ms).

alamnasim · 2019-04-05T05:50:40Z

Thanks for your reply.

I have one more question:
I have around 21k speakers and one utterance per speaker as a separate file(as in VCTK dataset).
I perform all the necessary steps to convert each utterance to segment embeddings.

I got around 58% accuracy and minimum loss of -330 (avg loss around -300) it remained constant. I am having doubt of labeling train segment whether they are correct.
Previously I took around 1700 calls, each call has 2 different speakers, but not sure about whether each call is having both are unique(but sure about one speaker is always unique). I did window label embedding and run uisrnn over it and got around 79% accuracy and loss changes frequently(not remain constant as in above case ).

I am not able to understand why this happens, whether i am correct or not. Please suggest.

The train lablel are as follows for question 1:

[u'0_0' u'0_0' u'0_0' u'0_0' u'0_0' u'0_0' u'1_1' u'1_1' u'1_1' u'1_1'
u'1_1' u'1_1' u'2_2' u'2_2' u'2_2' u'2_2' u'2_2' u'2_2' u'3_3' u'3_3'
u'3_3' u'3_3' u'3_3' u'3_3' u'4_4' u'4_4' u'4_4' u'4_4' u'4_4' u'4_4'
u'5_5' u'5_5' u'5_5' u'5_5' u'5_5' u'5_5' u'6_6' u'6_6' u'6_6' u'6_6'
u'6_6' u'6_6' u'7_7' u'7_7' u'7_7' u'7_7' u'7_7' u'7_7' u'8_8' u'8_8'
u'8_8' u'8_8' u'8_8' u'8_8' u'9_9' u'9_9' u'9_9' u'9_9' u'9_9' u'9_9'
u'10_10' u'10_10' u'10_10' u'10_10' u'10_10' u'10_10' u'11_11' u'11_11'
u'11_11' u'11_11' u'11_11' u'11_11' u'12_12' u'12_12' u'12_12' u'12_12'
u'12_12' u'12_12' u'13_13' u'13_13' u'13_13' u'13_13' u'13_13' u'13_13'
u'14_14' u'14_14' u'14_14' u'14_14' u'14_14' u'14_14' u'15_15' u'15_15'
u'15_15' u'15_15' ...........................................................................
.........u'21460_21460' u'21460_21460' u'21460_21460' u'21460_21460'
u'21460_21460' u'21460_21460' u'21480_21480' u'21480_21480'
u'21480_21480' u'21480_21480' u'21480_21480' u'21480_21480']

wq2012 · 2019-04-06T15:04:51Z

@alamnasim You mean each of your training speaker has only one single utterance, and you concatenated all of them to a single utterance?

If I understood your setup correctly (sorry if I got it wrong), you are making a completely fake problem.

UIS-RNN is a supervised learning technique that tries to learn these information from training data:

Dialogue styles.
Speaker turn frequency.
Domain-specific hints for speaker turns.

Your training data have zero information about "dialogues". I don't think UIS-RNN is going to learn anything here. I doubt you should simply use some unsupervised clustering methods and likely get the same results.

I explained it here: https://www.youtube.com/watch?v=pGkqwRPzx9U&t=24m1s

alamnasim · 2019-04-06T18:35:57Z

Thanks a lot, I understood where I was wrong.

rohithkodali · 2019-04-20T12:34:12Z

hi @alamnasim did your data trained without any memory error? i have similar number of speakers but it always throw memory error

alamnasim added the question Further information is requested label Apr 4, 2019

wq2012 closed this as completed Apr 4, 2019

wq2012 self-assigned this Apr 4, 2019

wq2012 mentioned this issue Jun 20, 2019

about the training loss and the batch size #33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to define a speaker per segments from overlapped widow of frames? Prediction on realtime data? #43

How to define a speaker per segments from overlapped widow of frames? Prediction on realtime data? #43

alamnasim commented Apr 4, 2019

wq2012 commented Apr 4, 2019

alamnasim commented Apr 5, 2019

wq2012 commented Apr 6, 2019

alamnasim commented Apr 6, 2019

rohithkodali commented Apr 20, 2019

How to define a speaker per segments from overlapped widow of frames? Prediction on realtime data? #43

How to define a speaker per segments from overlapped widow of frames? Prediction on realtime data? #43

Comments

alamnasim commented Apr 4, 2019

Describe the question

My background

wq2012 commented Apr 4, 2019

alamnasim commented Apr 5, 2019

wq2012 commented Apr 6, 2019

alamnasim commented Apr 6, 2019

rohithkodali commented Apr 20, 2019