Diarization for live transcription with multichannel to enable attribution #814
-
I collect audio from video meetings (Zoom, Teams, Meet, etc) in separate streams. In other words, I have unique byte streams per speaker/participant. These streams are not contiguous, since there is nothing to collect while a participant is muted. I am dumping everything into files, filling the gaps with zeroes and interleaving them to have n channels, where n = the total number of participants. That way, by using the multichannel feature with prerecorded transcription, I can be sure who said what and attribute each channel to a particular user with confidence. Even when speakers overlap. This is a bit inefficient, since filling the gaps with zeroes implies larger files, and more usage, and the larger n is, the worse it gets. (Normally 1 user is speaking at a time, so efficiency would be 1 / n. I want to switch over to live transcription, especially since I can collect the audio in order and stream it almost instantaneously as the meeting progresses. The problem I'm facing is that to keep audio in separate streams, I'm finding myself opening n websocket connections, essentially opening multiple streaming transcript sessions per video call, which seems a bit excessive. Ideally, everything could be sent in one websocket session, and use the first byte or two to tell deepgram which channel every chunk belongs to. Has anyone faced this? Is there an elegant solution to this? I feel like a caveman pumping bytes left and right. |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
Thanks for asking your question about Deepgram! If you didn't already include it in your post, please be sure to add as much detail as possible so we can assist you efficiently, such as:
|
Beta Was this translation helpful? Give feedback.
-
I've no answer but I am trying to do the same thing - merge audio device streams to make a multichannel stream. Has any one ever done this? Something like this.....
|
Beta Was this translation helpful? Give feedback.
-
Hi @aldofunes and @wapdat, It's possible you do not need to go to these lengths to achieve what you want and use Deepgram's livestream STT API. Many meeting applications like Zoom, Teams, and Meet can provide all participant audio in a single stream, where each participant is assigned a unique channel within the multi-channel audio stream. Deepgram supports multi-channel audio transcription, which can handle this efficiently without needing to manually interleave and pad audio data. In fact, our multi-channel feature was designed with this usecase in mind. Therefore, I recommend first checking if your video meeting application can output a multi-channel audio stream directly. This way, each participant's audio is on a separate channel in the same audio stream. With a multi-channel audio stream, you can leverage Deepgram's multi-channel transcription feature to identify and transcribe each participant separately. This avoids the need for zero-padding and reduces file sizes significantly. By combining the audio into a multi-channel format, you can use a single websocket connection for live transcription. This simplifies your implementation and avoids the need for multiple streaming sessions. Hope this helps! |
Beta Was this translation helpful? Give feedback.
-
Hey @aldofunes did you figure out anything while having a stream per user? I'm receiving audio from Discord, and all they give is a stream in Opus format per participant. |
Beta Was this translation helpful? Give feedback.
Hi @aldofunes and @wapdat,
It's possible you do not need to go to these lengths to achieve what you want and use Deepgram's livestream STT API.
Many meeting applications like Zoom, Teams, and Meet can provide all participant audio in a single stream, where each participant is assigned a unique channel within the multi-channel audio stream. Deepgram supports multi-channel audio transcription, which can handle this efficiently without needing to manually interleave and pad audio data. In fact, our multi-channel feature was designed with this usecase in mind.
Therefore, I recommend first checking if your video meeting application can output a multi-channel audio stream directly. This way, ea…