Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to define a speaker per segments from overlapped widow of frames? Prediction on realtime data? #43

Closed
alamnasim opened this issue Apr 4, 2019 · 5 comments
Assignees
Labels
question Further information is requested

Comments

@alamnasim
Copy link

Describe the question

A clear and concise description of what the question is.
Summary of work:
Audio signal is transformed into frames (log-mel-filterbank energies features) with frame width 25ms and step of 10ms. then frames are constituted into the over-lapped window of size 240ms and 50% overlap. window level d-vector calculated and then d-vectors are constituted into a segment of 400ms or more so that a segment contains single speaker's d-vector.

Questions:
During Testing, since each audio file contains utterances of different speakers, if we make the over-lapped window of frames,

  1. How do we sure about the 400 ms sized segment will represent a single speaker, and
  2. If we make each segment as of 400 ms fixed, won't it affect the accuracy?
  3. How do i perform real-time prediction if I have an audio file, how do I get speaker wise timestamp of each utterance?

Help appreciated.

My background

Have I read the README.md file?

  • yes/no - if you answered no, please stop filing the issue, and read it first
    yes
    Have I searched for similar questions from closed issues?
  • yes/no - if you answered no, please do it first
    yes
    Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?
  • yes/no
    yes
    Have I tried to find the answers in the reference Speaker Diarization with LSTM?
  • yes/no
    yes
    Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?
  • yes/no
    yes
@alamnasim alamnasim added the question Further information is requested label Apr 4, 2019
@wq2012
Copy link
Member

wq2012 commented Apr 4, 2019

  1. There is NO guarantee that each 400ms segment is from a single speaker. It's an approximation. But since most single-speaker sessions are much longer than 400ms, the errors in a few of these segments won't affect the final performance significantly.
  2. 400ms is the maximum. It can be smaller if Voice Activity Detector (VAD) detected a boundary. Also, 400ms is NOT the golden value that everyone should use. We swept this parameter on our dev dataset to find out 400ms produces the best performance. Each different application may use a different value, even different strategy here. In most real applications, temporal resolution in the diarization results are not important, thus 400ms may be too small. But if you really care about the DER number and want to publish your results, you may further tune this parameter on your own domain dataset.
  3. This library on GitHub does not provide an online API. It's described in Add a online_predict() API for streaming input #28 but we don't have the bandwidth to work on it. Also you don't directly get the timestamps. You process the audio into frames, then feed these frames to speaker encoder to get embeddings. Then you use UIS-RNN to get speaker label for each embedding. The temporal resolution is limited by the segment size (400ms).

@wq2012 wq2012 closed this as completed Apr 4, 2019
@wq2012 wq2012 self-assigned this Apr 4, 2019
@alamnasim
Copy link
Author

Thanks for your reply.

I have one more question:
I have around 21k speakers and one utterance per speaker as a separate file(as in VCTK dataset).
I perform all the necessary steps to convert each utterance to segment embeddings.

  1. I got around 58% accuracy and minimum loss of -330 (avg loss around -300) it remained constant. I am having doubt of labeling train segment whether they are correct.

  2. Previously I took around 1700 calls, each call has 2 different speakers, but not sure about whether each call is having both are unique(but sure about one speaker is always unique). I did window label embedding and run uisrnn over it and got around 79% accuracy and loss changes frequently(not remain constant as in above case ).

I am not able to understand why this happens, whether i am correct or not. Please suggest.

The train lablel are as follows for question 1:

[u'0_0' u'0_0' u'0_0' u'0_0' u'0_0' u'0_0' u'1_1' u'1_1' u'1_1' u'1_1'
u'1_1' u'1_1' u'2_2' u'2_2' u'2_2' u'2_2' u'2_2' u'2_2' u'3_3' u'3_3'
u'3_3' u'3_3' u'3_3' u'3_3' u'4_4' u'4_4' u'4_4' u'4_4' u'4_4' u'4_4'
u'5_5' u'5_5' u'5_5' u'5_5' u'5_5' u'5_5' u'6_6' u'6_6' u'6_6' u'6_6'
u'6_6' u'6_6' u'7_7' u'7_7' u'7_7' u'7_7' u'7_7' u'7_7' u'8_8' u'8_8'
u'8_8' u'8_8' u'8_8' u'8_8' u'9_9' u'9_9' u'9_9' u'9_9' u'9_9' u'9_9'
u'10_10' u'10_10' u'10_10' u'10_10' u'10_10' u'10_10' u'11_11' u'11_11'
u'11_11' u'11_11' u'11_11' u'11_11' u'12_12' u'12_12' u'12_12' u'12_12'
u'12_12' u'12_12' u'13_13' u'13_13' u'13_13' u'13_13' u'13_13' u'13_13'
u'14_14' u'14_14' u'14_14' u'14_14' u'14_14' u'14_14' u'15_15' u'15_15'
u'15_15' u'15_15' ...........................................................................
.........u'21460_21460' u'21460_21460' u'21460_21460' u'21460_21460'
u'21460_21460' u'21460_21460' u'21480_21480' u'21480_21480'
u'21480_21480' u'21480_21480' u'21480_21480' u'21480_21480']

@wq2012
Copy link
Member

wq2012 commented Apr 6, 2019

@alamnasim You mean each of your training speaker has only one single utterance, and you concatenated all of them to a single utterance?

If I understood your setup correctly (sorry if I got it wrong), you are making a completely fake problem.

UIS-RNN is a supervised learning technique that tries to learn these information from training data:

  1. Dialogue styles.
  2. Speaker turn frequency.
  3. Domain-specific hints for speaker turns.

Your training data have zero information about "dialogues". I don't think UIS-RNN is going to learn anything here. I doubt you should simply use some unsupervised clustering methods and likely get the same results.

I explained it here: https://www.youtube.com/watch?v=pGkqwRPzx9U&t=24m1s

@alamnasim
Copy link
Author

Thanks a lot, I understood where I was wrong.

@rohithkodali
Copy link

hi @alamnasim did your data trained without any memory error? i have similar number of speakers but it always throw memory error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants