Help required in Deepgram Streaming Inference on Audio File #959

rkchamp25 · 2024-10-15T07:44:58Z

rkchamp25
Oct 15, 2024

Hi

We are already using deepgram and have it integrated in our app. Great Product btw.

We want to get the transcription of an Audio File in streaming fashion using deepgram. But we have a confusion in what is the proper way to do that?

We found this file in this repo from your documentation (https://github.com/deepgram/live-streaming-starter-kit/blob/main/test_suite.py) to stream audio and get transcription in streaming way.
We also found this, it says stream file but returns result once only after completion. (https://github.com/deepgram/deepgram-python-sdk/blob/main/examples/speech-to-text/rest/stream_file/main.py)

Also we notice a difference in the transcription (and hence WER) if we pass complete mp3 or if we pass 16k monochannel wave audio or if we pass using your stream file example or live streaming starter kit example or even when getting transcription through website or using api we see some differences.

Basically we want to send the chunks of audio file to deepgram and then keep getting the output in streaming fashion. What is the correct and proper and best way to do that?

Thank You

Answered by jkroll-deepgram

Oct 21, 2024

@rkchamp25, no, our streaming and pre-recorded endpoints are served by different models, and may have slightly different results. Pre-recorded transcription tends to be about 2% absolute lower word error rate (WER), since it has greater context. For instance, we've benchmarked our English transcription as 8.4% WER for pre-recorded audio, and 10.7% WER for streaming audio.

View full answer

jkroll-deepgram · 2024-10-15T14:34:44Z

jkroll-deepgram
Oct 15, 2024
Collaborator

Hi @rkchamp25, we offer two transcription modes: streaming and pre-recorded. You will want streaming transcription, which uses a websocket connection rather than a REST API. That is what the first example you linked uses (the live-streaming starter kit).

The second example has a confusing name, but it is pre-recorded transcription (transcribes the whole file in one go, as you noted).

Here is our doc for getting started with streaming transcription. You can also look at our websocket directory within our code examples. The live-streaming starter kit is also a great resource.

4 replies

rkchamp25 Oct 15, 2024
Author

Thank You for prompt reply. @jkroll-deepgram

I am aware about these things which you have shared.

As I said I am looking for a solution in which I use a Prerecorded Audio File which I already have and get it transcribed in streaming fashion from deepgram. In other words instead of sending the audio from a remote url/stream to deepgram (as shown in your documentation/example), I will be sending the chunks of that file and get a streaming response back from deepgram.
What is the best way to do this?

You have an example for this (https://github.com/deepgram/live-streaming-starter-kit/blob/main/test_suite.py), I want to confirm if this is the best way or we can do it some other way? And if this way of sending a file will be equivalent to sending audio from remote stream in terms of accuracy?

jkroll-deepgram Oct 15, 2024
Collaborator

Hi @rkchamp25, yes, you can stream audio from any source - whether a user's microphone, a phone call, a file, etc. Deepgram's streaming transcription returns interim results every ~1 second, and final results every ~2-5 seconds. The live-streaming starter kit is a great starting point for how to stream a file to Deepgram using "plain Python" (no SDKs).

We do have a past version of that repository that uses the Python SDK, see here.

It would be your choice as to whether you prefer to work directly with websockets, or to use the Python SDK which offers some abstraction and conveniences.

rkchamp25 Oct 18, 2024
Author

@jkroll-deepgram
Should the results of the streaming file transcription be exactly the same as when we send complete audio file at once?

jkroll-deepgram Oct 21, 2024
Collaborator

@rkchamp25, no, our streaming and pre-recorded endpoints are served by different models, and may have slightly different results. Pre-recorded transcription tends to be about 2% absolute lower word error rate (WER), since it has greater context. For instance, we've benchmarked our English transcription as 8.4% WER for pre-recorded audio, and 10.7% WER for streaming audio.

Answer selected by deepgram-community

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepgram

Help required in Deepgram Streaming Inference on Audio File #959

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Deepgram

Help required in Deepgram Streaming Inference on Audio File #959

rkchamp25 Oct 15, 2024

Replies: 1 comment · 4 replies

jkroll-deepgram Oct 15, 2024 Collaborator

rkchamp25 Oct 15, 2024 Author

jkroll-deepgram Oct 15, 2024 Collaborator

rkchamp25 Oct 18, 2024 Author

jkroll-deepgram Oct 21, 2024 Collaborator

rkchamp25
Oct 15, 2024

Replies: 1 comment 4 replies

jkroll-deepgram
Oct 15, 2024
Collaborator

rkchamp25 Oct 15, 2024
Author

jkroll-deepgram Oct 15, 2024
Collaborator

rkchamp25 Oct 18, 2024
Author

jkroll-deepgram Oct 21, 2024
Collaborator