Livestream Speaker Diarization not distinguishing different speakers consistently #283
-
Which Deepgram product are you using?Deepgram API DetailsI've been testing Deepgram's Livestream Speaker Diarization, and I've noticed that it's struggling to distinguish different speakers, especially when the tones of the different speakers are similar. I can't seem to find a fix, and I'm not sure if it's something on my end that I'm doing wrong or on Deepgram's end. I'm using test_suite.py (https://github.com/deepgram/streaming-test-suite) with some added code to make each speaker more visible when printed. the following is a transcription attempt between two people in the beginning of this sample YouTube video (Daily English Conversation Practice) If you are making a request to the Deepgram API, what is the full Deepgram URL you are making a request to?wss://api.deepgram.com/v1/listen?tier=enhanced&model=meeting&punctuate=true&diarize=true If you are making a request to the Deepgram API and have a request ID, please paste it below:No response If possible, please attach your code or paste it into the text box.
If possible, please attach an example audio file to reproduce the issue.I used the following YouTube video (Daily English Conversation Practice) on my phone and held it up to my mic to simulate a meeting room with one microphone. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 9 replies
-
Hey @ali-rafiei, if a mic is held up to your computer's speaker the audio will be extremely low quality. Most meeting recording software records the audio within the app, even for a multi-person meeting with one mic. When a phone is held up to another speaker, the audio goes into the original microphone that the speakers used, comes out of your computer's speaker, then into your phone's mic. Whereas a meeting room with multiple people only has the original input from the speakers. Transcribing multiperson meetings is difficult because speakers talk at the same time, are different distances away from the mic, may not be looking at the mic when speaking, and have different speakers all speaking on the same channel. These are difficult problems to overcome - even for a human listening to the conversation - and the problems are different than the "mic -> speaker -> mic" situation that you are using. Every time audio goes into a mic or out of a speaker, the sound wave gets degraded. The better the mic/speaker, the less this happens, but our phone mics and computer speakers are rarely high quality. So your sound wave is being degraded three times, which is likely worse (when performing transcription) than multiple people talking in the same room. Even if the audio sounds okay to your ear, the sound wave itself will be significantly transformed. This poor sound quality is likely one reason for the poor diarization. In addition, diarization improves the longer the audio. If you have a 30 second audio file, the diarization results will be much worse than for a 30 minute audio file. The code in our streaming test suite works well so I'm guessing your code is good. It's likely the audio that is resulting in the poor results. Also, how often is the transcription identifying the speakers incorrectly? |
Beta Was this translation helpful? Give feedback.
-
Thanks for the reply @jjmaldonis, if you can provide with an ETA that would be super. |
Beta Was this translation helpful? Give feedback.
@ali-rafiei I have good news: we have a new live-streaming diarization model currently in development, and we’re looking for beta testers! If you’d be interested in joining the beta program, please email me with your project ID: shir(dot)goldberg(at)deepgram(dot)com.
Any other information you’re willing to share about your diarization usecase and how the feature currently performs for you would be greatly appreciated as well.