BentoML Adaptive Microbatching Discussion thread #927

bakshienator77 · 2020-07-22T11:57:00Z

bakshienator77
Jul 22, 2020

I would like to start this thread for people to discuss best practices with regards to the AMB feature in BentoML. A few points I think would benefit many users would be:

Recommendations for the maximum batch size and latency length for optimal performance
Errors users face when using this feature (and resolutions for them)
Implementation/deployment details related to this feature.

I hope this thread is useful for all.

bakshienator77 · 2020-07-22T12:04:07Z

bakshienator77
Jul 22, 2020
Author

In order to enable this feature when deploying a kubernetes service one needs to pass an environment variable to the deployment as follows:

kind: Service
metadata:
    labels:
        app: mybentoservice
    name: mybentoservice
spec:
    ports:
    - name: predict
      port: 5000
      targetPort: 5000
    selector:
      app: mybentoservice
    type: LoadBalancer
---
apiVersion: apps/v1
kind: Deployment
metadata:
    labels:
        app: mybentoservice
    name: mybentoservice
spec:
    selector:
        matchLabels:
            app: mybentoservice
    template:
        metadata:
            labels:
                app: mybentoservice
        spec:
            containers:
            - image: dockerusername/mybentoservice
              imagePullPolicy: IfNotPresent
              name: mybentoservice
              env:
              - name: BENTOML_ENABLE_MICROBATCH
                value: 'true'
              ports:
              - containerPort: 5000

If you have not specified any AMB parameter values when defining the @api decorator in you handler file then the default will be activated.

1 reply

yubozhao Sep 29, 2020

@bakshienator77 Check out the latest release version 0.9.0. The new version brings refactor of the API decorator and redesign of the input/output adapters.
With the introduction of batch option on the @api decorator, users have much more control of the micro batching. Check it out and let me know what you think!

bakshienator77 · 2020-07-22T12:14:42Z

bakshienator77
Jul 22, 2020
Author

What I have not been able to figure is is how to change the mb_max_batch_size and mb_max_latency in the deployment later on, without having to rebuild and reupload a docker image.

The reason I want to do this is that I have found via experiments on my local machine that certain (higher) values of the above parameters lead to a large number of 429: Too many requests errors. I am sending requests asynchronously using the requests-futures library in order to properly leverage the power of the adaptive microbatching feature (and there will be multiple senders as well in my final application). When I set the above parameter values to lower values then the frequency of the 429 errors reduces significantly. Can anyone help with this?

I am also concerned that the optimal parameters may be different in the cloud than on my local and hence it will be extremely tedious for me to try various parameters by repeatedly rebuilding a docker image and reuploading. At the moment I am implementing a retry on the sender side but that is affecting throughput for obvious reasons.

Any ideas?

0 replies

parano · 2020-07-22T20:27:17Z

parano
Jul 22, 2020
Maintainer

Hi @bakshienator77 great questions!

Setting batching parameter after the docker image was built

Right now there isn't a way to change mb_max_batch_size and mb_max_latency after the docker container is built. But if you don't specify the value in your API handler, there is a way to set the default mb_max_batch_size and mb_max_latency for your deployment via an environment variable. The default value is configurable, here is where we set the default. And in a k8s deployment, you can change those configuration values via environment variable:

BENTOML__MARSHAL_SERVER__DEFAULT_MAX_LATENCY=1000
BENTOML__MARSHAL_SERVER__DEFAULT_MAX_BATCH_SIZE=1000

It is a great point that the user may want to set a different value for those parameters when deployed to different hardware. I think we should definitely add support for this. I just created an issue for it here: #930

Tuning the micro-batching parameters

BentoML automatically adjusts the actual batch size and wait time (throughput & latency tradeoff) in real-time via a regression model based on past inference requests. So these two parameters are not always the actual latency or batch size, they are an optimization target that user set for BentoML.

MAX_LATENCY is typically determined by your use case, e.g. your end users can wait up to 2 seconds for the page to load or other SLA requirements. It gives BentoML a target latency to hit, otherwise, BentoML will just keep waiting for new requests to come, which increases the batch size, and get an optimal throughput(if the memory allows, increasing batch size almost always brings better throughput).

MAX_BATCH_SIZE is typically determined by your model and the hardware(mostly memory, GPU memory). Since BentoML adjusts the batch size in real-time based on historical inference request compute time, users don't really need to set this parameter most of the time. The only time you may want to change this parameter is that when you know it will certainly lead to a problem if BentoML tries to send a batch size that is larger than this amount.

Hope this helps. And apologize for the confusion, we are working on improving related documentations.

0 replies

kemingy · 2021-01-23T05:55:58Z

kemingy
Jan 23, 2021

I'm quite interested in how BentoML implemented this adaptive method. I checked the source code, but I'm confused about this part:

BentoML/bentoml/marshal/dispatcher.py

Line 175 in c7e6424

if n * (wn + dt + (a or 0)) <= self.optimizer.wait * decay:

From my understanding, a means the average wait time that one request will introduce to this batch (according to the regression), dt is the tick interval, wn is the latest request's waiting time, optimizer.wait is the average of wait time, n is the current size of the queue. So why there is a n * on the left? Since the rests are all averaged.

1 reply

zhant09 Jul 13, 2021

#1747

Same question... Do you get the answer now?

JulesBelveze · 2021-04-15T07:37:09Z

JulesBelveze
Apr 15, 2021

Hi @parano I have a PytorchModelArtifact and I am trying to enable batching for JsonInput().
Correct me if I'm wrong: I'm sending a bunch of requests to the API in parallel with the following body:

{"text": "blablabla..."}

and by decorating the predict method of my service with the following:

@bentoml.api(input=JsonInput(), output=JsonOutput(), mb_max_batch_size=32, mb_max_latency=1000, batch=True)

I am expecting that the service will accumulate inputs until it reaches a batch size of 32 or a latency of 1s, right?
However, by logging the length of my inputs and outputs it seems like the API is only processing one request at a time..

Am I misunderstanding or doing something wrong here?
Thanks for your help :)

7 replies

parano Apr 15, 2021
Maintainer

@JulesBelveze that's weird, two things you could try:

it is possible that your model inference time is too close to or exceeds 1 second and BentoML decides that batch size 1 is the only way to meet the latency requirement. Could you try set it to a larger number?
It takes a bit of time to "warm-up" a BentoServer instance, for it to collect internal data points on the relationship between batch size and total latency for the current BentoService. Try to keep sending requests to the server and observe the change?

Another thing that may or may not be related - could you try to use docker to run the API server or use bentoml serve-gunicorn NERService:latest --enable-microbatch instead?

JulesBelveze Apr 15, 2021

@parano the two options you suggested unfortunately didn't work 😕
Here is my predict method, if it can help somehow:

     ....
    @bentoml.api(input=JsonInput(), output=JsonOutput(), mb_max_batch_size=32, mb_max_latency=100000, batch=True)
    def predict(self, payload: List[JsonSerializable]) -> List[JsonSerializable]:
        logger.info(f"INPUT LEN: {len(payload)}")
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

        model_input = self.preprocess([elt["text"] for elt in payload])
        is_token = model_input["is_token"]
        model_input.pop("is_token")

        input = {key: val.to(device) for key, val in model_input.items()}
        self.artifacts.ner.to(device)
        with torch.no_grad():
            output = self.artifacts.ner(**input)
        output = output[0]

        response = self.postprocess(
            output, input["input_ids"], input["valid_mask"], is_token
        )
        logger.info(f"OUTPUT LEN: {len(response)}")
        return [{
            "entities": ents,
            "text": text
        } for text, ents in response]

I'm experiencing a weird issue when running the server with gunicorn on my local though.. I get a continuous flow of:

[2021-04-15 10:24:58 +0200] [91374] [WARNING] Worker with pid 91383 was terminated due to signal 11
[2021-04-15 10:24:58 +0200] [91384] [INFO] Booting worker with pid: 91384
[2021-04-15 10:24:58 +0200] [91374] [WARNING] Worker with pid 91384 was terminated due to signal 11
[2021-04-15 10:24:58 +0200] [91385] [INFO] Booting worker with pid: 91385
   ...

parano Apr 15, 2021
Maintainer

@JulesBelveze does it returns the response successfully? What's the average latency look like?

And could you also share the script you use to send requests to the server?

JulesBelveze Apr 15, 2021

@parano I managed to have the gunicorn server up and running on a VM but still looks like batching is not working..

Yes sure, I tried both a curl version:

 xargs -I % -P 8 curl -X POST "http://0.0.0.0:5000/predict" -H  "accept: application/json" -H  "Content-Type: application/json" -d @test.json < <(printf '%s\n' {1..100})

(which often leads to some 429 errors at start). And also the following python script, that I took from the Slack channel:

import aiohttp
import asyncio
import os
import json

with open(os.environ["PAYLOAD"], "r") as reader:
    payload = json.load(reader)

list_ = ["http://localhost:5000"] * 10

async def fetch(session, url):
    async with session.post(url, json=payload) as response:
        return await response.json()

async def fetch_all(urls, loop):
    async with aiohttp.ClientSession(loop=loop) as session:
        results = await asyncio.gather(*[fetch(session, url) for url in urls], return_exceptions=True)
        return results
    
if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    urls = list_
    htmls = loop.run_until_complete(fetch_all(urls, loop))
    print(htmls)

JulesBelveze Apr 15, 2021

@parano I managed to have it by running both serve and serve-gunicorn on a VM without changing anything. However, I deployed the service to my GKE cluster and I'm still facing the prb.. By running the above python script I am getting a lot of errors 429...

EDIT: fix it by increasing the latency

Thanks for your help :D

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BentoML

BentoML Adaptive Microbatching Discussion thread #927

{{title}}

Replies: 5 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

BentoML Adaptive Microbatching Discussion thread #927

Replies: 5 comments · 9 replies

bakshienator77 Jul 22, 2020 Author

bakshienator77 Jul 22, 2020 Author

parano Jul 22, 2020 Maintainer

Setting batching parameter after the docker image was built

Tuning the micro-batching parameters

parano Apr 15, 2021 Maintainer

parano Apr 15, 2021 Maintainer

Replies: 5 comments 9 replies

bakshienator77
Jul 22, 2020
Author

bakshienator77
Jul 22, 2020
Author

parano
Jul 22, 2020
Maintainer

parano Apr 15, 2021
Maintainer

parano Apr 15, 2021
Maintainer