Real-time Speech To Text - Always Send Finalize Response #1035
-
Hey! I have a feature request for the real-time speech-to-text web socket API. I'm working on a system that streams audio from a user and converts it to text. Sometimes, the audio messages are long (30 seconds—1 minute), but sometimes, they are as short as one word. I stream the audio in real time, meaning I get back chunks of the transcription. When the audio clip is done, I send the finalized message to ensure the audio stream is flushed. My problem is that sending the finalized message doesn't always return a finalized response. From looking at the docs, this looks like it's expected. If the finalize message results in more transcript text coming back, the In my system, I want to return the text as fast as possible. However, since I don't always get a finalized message response, I don't know if the text I already have is all of it or if a finalized response is coming with more text. So right now, I either have to return what I have or wait an arbitrary amount of time to ensure there's no final text coming back. In summary, it would be fantastic if the server always acked for the finalized message so the client can know for sure that all audio has been flushed and the final text has been returned. Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 8 comments 1 reply
-
Thanks for asking your question. Please be sure to reply with as much detail as possible so the community can assist you efficiently. |
Beta Was this translation helpful? Give feedback.
-
Hey there! It looks like you haven't connected your GitHub account to your Deepgram account. You can do this at https://community.deepgram.com - being verified through this process will allow our team to help you in a much more streamlined fashion. |
Beta Was this translation helpful? Give feedback.
-
It looks like we're missing some important information to help debug your issue. Would you mind providing us with the following details in a reply?
|
Beta Was this translation helpful? Give feedback.
-
thanks for submitting this though I'm not sure if we'd implement this in our product, but I can leave it open to see what others think. I think the key thing to remember here is what we call out in our Docs: Upon receiving the Finalize message, the server will process all remaining audio data and return the final results. You may receive a response with the So you won't get a response if there is NOT a noticeable amount of audio buffered in the server. In Simpler Terms When the server gets the "Finalize" message, it wraps up any remaining audio processing and sends the final transcription or results back. The server lets the client know that this is complete by sending a response with the https://developers.deepgram.com/docs/finalize#what-is-the-finalize-message
|
Beta Was this translation helpful? Give feedback.
-
Thanks! Yeah, so my problem is I'm streaming audio up and randomly getting a text back, but there's no relationship I can observe between the amount of audio I stream up and the text I get back. So I have no way of knowing how much data is on the server unprocessed, so when I send the Finalize message, I don't know whether I will get a message back. My goal is to return the full-text transcript as fast as possible when the audio is done streaming, but I have a problem because I don't know if I need to wait for the finalized message to flush out the remaining audio and return text or if I have all of the audio, and the finalized message will do nothing. Thus, there's no 100% correct way to implement my system. The safe thing to do is send the fanzine after the audio stream is done and then wait an arbitrary amount of time, like 100ms. That at least allows me some time to get the final text from the finalize if it's coming back. But it seems that 80% of the time, the final message doesn't return audio for me, so 80% of the time, I'm wasting 100ms of my time. I'm using deepgram over Google, Amazon, etc because of it's latency, so 100ms is a lot for me! Anyways, it would be great if this gets fixed, but no problem if not. I just wanted to clarify my issue and explain why it's impossible to implement my scenario well with the current system. Thanks! |
Beta Was this translation helpful? Give feedback.
-
Thanks for the details. One thing you can try is using Check out our Docs on this topic and let me know if this helps your situation:
|
Beta Was this translation helpful? Give feedback.
-
Oh, that’s a great tip! Originally I was going to keep the websockets open for multiple requests, but my request rate isn't very high, so for now I'm closing them. My main concern was the connection time, but the WS connects fast enough when the stream starts to be done before it ends. Thanks for the help! |
Beta Was this translation helpful? Give feedback.
-
Great to hear!
|
Beta Was this translation helpful? Give feedback.
Thanks for the details. One thing you can try is using
CloseStream
instead ofFinalize
to close your audio stream, you'll get back a confirmation message every time you do this.Finalize
is only for mid-stream finalization , like if you have your own client side VAD.Check out our Docs on this topic and let me know if this helps your situation:
https://developers.deepgram.com/docs/close-stream