BentoML Adaptive Microbatching Discussion thread #927
Replies: 5 comments 9 replies
-
In order to enable this feature when deploying a kubernetes service one needs to pass an environment variable to the deployment as follows:
If you have not specified any AMB parameter values when defining the |
Beta Was this translation helpful? Give feedback.
-
What I have not been able to figure is is how to change the The reason I want to do this is that I have found via experiments on my local machine that certain (higher) values of the above parameters lead to a large number of I am also concerned that the optimal parameters may be different in the cloud than on my local and hence it will be extremely tedious for me to try various parameters by repeatedly rebuilding a docker image and reuploading. At the moment I am implementing a retry on the sender side but that is affecting throughput for obvious reasons. Any ideas? |
Beta Was this translation helpful? Give feedback.
-
Hi @bakshienator77 great questions! Setting batching parameter after the docker image was builtRight now there isn't a way to change
It is a great point that the user may want to set a different value for those parameters when deployed to different hardware. I think we should definitely add support for this. I just created an issue for it here: #930 Tuning the micro-batching parametersBentoML automatically adjusts the actual batch size and wait time (throughput & latency tradeoff) in real-time via a regression model based on past inference requests. So these two parameters are not always the actual latency or batch size, they are an optimization target that user set for BentoML. MAX_LATENCY is typically determined by your use case, e.g. your end users can wait up to 2 seconds for the page to load or other SLA requirements. It gives BentoML a target latency to hit, otherwise, BentoML will just keep waiting for new requests to come, which increases the batch size, and get an optimal throughput(if the memory allows, increasing batch size almost always brings better throughput). MAX_BATCH_SIZE is typically determined by your model and the hardware(mostly memory, GPU memory). Since BentoML adjusts the batch size in real-time based on historical inference request compute time, users don't really need to set this parameter most of the time. The only time you may want to change this parameter is that when you know it will certainly lead to a problem if BentoML tries to send a batch size that is larger than this amount. Hope this helps. And apologize for the confusion, we are working on improving related documentations. |
Beta Was this translation helpful? Give feedback.
-
I'm quite interested in how BentoML implemented this adaptive method. I checked the source code, but I'm confused about this part: BentoML/bentoml/marshal/dispatcher.py Line 175 in c7e6424 From my understanding, |
Beta Was this translation helpful? Give feedback.
-
Hi @parano I have a
and by decorating the
I am expecting that the service will accumulate inputs until it reaches a batch size of 32 or a latency of 1s, right? Am I misunderstanding or doing something wrong here? |
Beta Was this translation helpful? Give feedback.
-
I would like to start this thread for people to discuss best practices with regards to the AMB feature in BentoML. A few points I think would benefit many users would be:
I hope this thread is useful for all.
Beta Was this translation helpful? Give feedback.
All reactions