-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to define a speaker per segments from overlapped widow of frames? Prediction on realtime data? #43
Comments
|
Thanks for your reply. I have one more question:
I am not able to understand why this happens, whether i am correct or not. Please suggest. The train lablel are as follows for question 1: [u'0_0' u'0_0' u'0_0' u'0_0' u'0_0' u'0_0' u'1_1' u'1_1' u'1_1' u'1_1' |
@alamnasim You mean each of your training speaker has only one single utterance, and you concatenated all of them to a single utterance? If I understood your setup correctly (sorry if I got it wrong), you are making a completely fake problem. UIS-RNN is a supervised learning technique that tries to learn these information from training data:
Your training data have zero information about "dialogues". I don't think UIS-RNN is going to learn anything here. I doubt you should simply use some unsupervised clustering methods and likely get the same results. I explained it here: https://www.youtube.com/watch?v=pGkqwRPzx9U&t=24m1s |
Thanks a lot, I understood where I was wrong. |
hi @alamnasim did your data trained without any memory error? i have similar number of speakers but it always throw memory error |
Describe the question
A clear and concise description of what the question is.
Summary of work:
Audio signal is transformed into frames (log-mel-filterbank energies features) with frame width 25ms and step of 10ms. then frames are constituted into the over-lapped window of size 240ms and 50% overlap. window level d-vector calculated and then d-vectors are constituted into a segment of 400ms or more so that a segment contains single speaker's d-vector.
Questions:
During Testing, since each audio file contains utterances of different speakers, if we make the over-lapped window of frames,
Help appreciated.
My background
Have I read the
README.md
file?yes
Have I searched for similar questions from closed issues?
yes
Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?
yes
Have I tried to find the answers in the reference Speaker Diarization with LSTM?
yes
Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?
yes
The text was updated successfully, but these errors were encountered: