-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Fix request hanging in periodic concurrency greater than 4 #410
Conversation
@Tabrizian @rmccorm4 Is it reasonable to create a feature request for verbose mode on the server logging when a request without inputs is received? That would have helped save so much time debugging this. |
Did you confirm the server is actually receiving the request?
I believe the server would return an error on reading the request that the inputs didn't match model config. So I suspect the server didn't even get this far. |
@rmccorm4 No this is just my guess of what happened after the request has been sent. Was too tired to go further 😅 Given that the input wasn't properly generated for the request, I think there is a possibility that request didn't even reach the server. |
Creates parity with changes from #410
Creates parity with changes from #410
Creates parity with changes from triton-inference-server/client#410
Creates parity with changes from triton-inference-server/client#410
While testing out the new periodic concurrency mode, @matthewkotila and I have noticed that the 5-th request would consistently hang forever and wait for response from the server.
The issue was that
max_threads_
was set default to 4, and if this value was less than the concurrency value, the request created after themax_threads_
number of times (e.g. the 5-th one in our case) will be missing input data. This is because the infer_data_manager that prepares all the input data, usesmax_threads_
count to generate the input data and populate the requests with the input data. And I believe this is why the server - when it received the empty 5-th request - did not pass it down to the model.The fix checks the
max_threads_
value against the concurrency range (similar to how regular concurrency mode does). Not quite sure why we havemax_threads_
option in the first place, but I feel like we want to eventually move away from usingmax_threads_
and just deduce the thread numbers from concurrency level.Thanks @matthewkotila for drilling down and investigating the bug and @Tabrizian & @rmccorm4 for tips and insights.