Diarization for live transcription with multichannel to enable attribution #814

aldofunes · 2024-06-16T14:51:40Z

aldofunes
Jun 16, 2024

I collect audio from video meetings (Zoom, Teams, Meet, etc) in separate streams. In other words, I have unique byte streams per speaker/participant. These streams are not contiguous, since there is nothing to collect while a participant is muted.

I am dumping everything into files, filling the gaps with zeroes and interleaving them to have n channels, where n = the total number of participants. That way, by using the multichannel feature with prerecorded transcription, I can be sure who said what and attribute each channel to a particular user with confidence. Even when speakers overlap.

This is a bit inefficient, since filling the gaps with zeroes implies larger files, and more usage, and the larger n is, the worse it gets. (Normally 1 user is speaking at a time, so efficiency would be 1 / n.

I want to switch over to live transcription, especially since I can collect the audio in order and stream it almost instantaneously as the meeting progresses. The problem I'm facing is that to keep audio in separate streams, I'm finding myself opening n websocket connections, essentially opening multiple streaming transcript sessions per video call, which seems a bit excessive.

Ideally, everything could be sent in one websocket session, and use the first byte or two to tell deepgram which channel every chunk belongs to.

Has anyone faced this? Is there an elegant solution to this? I feel like a caveman pumping bytes left and right.

Answered by SandraRodgers

Jun 26, 2024

Hi @aldofunes and @wapdat,

It's possible you do not need to go to these lengths to achieve what you want and use Deepgram's livestream STT API.

Many meeting applications like Zoom, Teams, and Meet can provide all participant audio in a single stream, where each participant is assigned a unique channel within the multi-channel audio stream. Deepgram supports multi-channel audio transcription, which can handle this efficiently without needing to manually interleave and pad audio data. In fact, our multi-channel feature was designed with this usecase in mind.

Therefore, I recommend first checking if your video meeting application can output a multi-channel audio stream directly. This way, ea…

View full answer

team-deepgram · 2024-06-16T14:51:49Z

team-deepgram
Jun 16, 2024
Maintainer

Thanks for asking your question about Deepgram! If you didn't already include it in your post, please be sure to add as much detail as possible so we can assist you efficiently, such as:

The request_id if you have a question about your requests or transcription responses.
The features you used or the full api.deepgram.com URL you sent your request to, including parameters.
Any code snippets you can share.

0 replies

wapdat · 2024-06-25T15:50:30Z

wapdat
Jun 25, 2024

I've no answer but I am trying to do the same thing - merge audio device streams to make a multichannel stream.

Has any one ever done this?

Something like this.....

    instance.data.createMultiChannelStream = async function createMultiChannelStream(deviceIds) {
        try {
            
            console.log("Merging stream ids: " + deviceIds);
            
            const audioContext = new AudioContext();
            const channelMerger = audioContext.createChannelMerger(deviceIds.length);

            const audioPromises = deviceIds.map(async (deviceId) => {
                const stream = await navigator.mediaDevices.getUserMedia({ audio: { deviceId: { exact: deviceId } } });
                const source = audioContext.createMediaStreamSource(stream);
                return { source, stream };
            });

            const audioSources = await Promise.all(audioPromises);
            audioSources.forEach((audioSource, index) => {
                audioSource.source.connect(channelMerger, 0, index);
            });

            channelMerger.connect(audioContext.destination);


            const outputTrack = audioContext.createMediaStreamDestination().stream.getTracks()[0];
            const multiChannelStream = new MediaStream([outputTrack]);

            return multiChannelStream;

        } catch (error) {
            console.error("Error creating multi-channel stream:", error);
            throw error;  // Propagate the error
        }
    }

0 replies

SandraRodgers · 2024-06-26T19:20:56Z

SandraRodgers
Jun 26, 2024
Maintainer

Hi @aldofunes and @wapdat,

It's possible you do not need to go to these lengths to achieve what you want and use Deepgram's livestream STT API.

Many meeting applications like Zoom, Teams, and Meet can provide all participant audio in a single stream, where each participant is assigned a unique channel within the multi-channel audio stream. Deepgram supports multi-channel audio transcription, which can handle this efficiently without needing to manually interleave and pad audio data. In fact, our multi-channel feature was designed with this usecase in mind.

Therefore, I recommend first checking if your video meeting application can output a multi-channel audio stream directly. This way, each participant's audio is on a separate channel in the same audio stream.

With a multi-channel audio stream, you can leverage Deepgram's multi-channel transcription feature to identify and transcribe each participant separately. This avoids the need for zero-padding and reduces file sizes significantly.

By combining the audio into a multi-channel format, you can use a single websocket connection for live transcription. This simplifies your implementation and avoids the need for multiple streaming sessions.

Hope this helps!

0 replies

NotDemonix · 2024-12-24T19:43:47Z

NotDemonix
Dec 24, 2024

Hey @aldofunes did you figure out anything while having a stream per user? I'm receiving audio from Discord, and all they give is a stream in Opus format per participant.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepgram

Diarization for live transcription with multichannel to enable attribution #814

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Deepgram

Diarization for live transcription with multichannel to enable attribution #814

aldofunes Jun 16, 2024

Replies: 4 comments

team-deepgram Jun 16, 2024 Maintainer

wapdat Jun 25, 2024

SandraRodgers Jun 26, 2024 Maintainer

NotDemonix Dec 24, 2024

aldofunes
Jun 16, 2024

team-deepgram
Jun 16, 2024
Maintainer

wapdat
Jun 25, 2024

SandraRodgers
Jun 26, 2024
Maintainer

NotDemonix
Dec 24, 2024