Multi-thread inference #880

csxiang18 · 2021-04-20T03:13:34Z

csxiang18
Apr 20, 2021

Hi, I have a quick question about the multi-thread inference.

I have a tomcat server to provide the dnn prediction service. And I packaged the model like this:

public class DnnService {
    // load the model from a pretrained model.
    private static ZooModel<float[], float[]> model;
    static {
        model = loadPretrainedModel();
    }

    public float[] predict(float[] features) {
        Predictor<float[], float[] > predictor = model.newPredictor();
        try {
            float[] result = predictor.predict(features);
            return result;
        } catch (Exception e) {
            e.printStackTrace();
        }
        return DEFAULT;
    }
}

Each tomcat thread will use dnnService.predict(features);. Will this cause any performance or GC issues? Or is this the correct way to use? I see from the doc that for each thread, we should use a newPredictor. But when I use this like this way, I see that the memory seems to increase a lot.

frankfliu · 2021-04-20T03:23:12Z

frankfliu
Apr 20, 2021

@csxiang18
If you want a model-server solution, I really recommend you to consider djl-serving: https://github.com/awslabs/djl/tree/master/serving/serving, it handles many complex cases for model serving.

Your code will cause memory leak, the predictor should be closed explicitly:

    public float[] predict(float[] features) {
        try (Predictor<float[], float[] > predictor = model.newPredictor()) {
            float[] result = predictor.predict(features);
            return result;
        } catch (Exception e) {
            e.printStackTrace();
        }
        return DEFAULT;
    }

But Predictor creation may has it's own cost (depends on engine, MXNet engine has minimal cost of creating new Predictor). We recommend you to create a Predictor per thread, or use Object pool. The easiest way (not cleanest way) is to use ThreadLocal.

1 reply

csxiang18 Apr 20, 2021
Author

@frankfliu Thanks so much for answering this question! I think ThreadLocal is more convenient for me to use. By the way, is this the correct way for thread local?

private static ThreadLocal<Predictor<float[], float[]>> localPredictor = new ThreadLocal<>();
public float[] predict(float[] features) {
    if (localPredictor.get() == null) {
        localPredictor.set(model.newPredictor());
    } 
    try {
        float[] result = localPredictor.get().predict(features);
        return result;
    } catch (Exception e) {
        e.printStackTrace();
    }
    return DEFAULT;
}

frankfliu · 2021-04-20T15:43:48Z

frankfliu
Apr 20, 2021

The easiest way is:

private static ThreadLocal<Predictor<float[], float[]>> localPredictor = ThreadLocal.withInitial(model.newPredictor());
public float[] predict(float[] features) {
    return localPredictor.get().predict(features);
}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-thread inference #880

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Multi-thread inference #880

csxiang18 Apr 20, 2021

Replies: 2 comments · 1 reply

frankfliu Apr 20, 2021

csxiang18 Apr 20, 2021 Author

frankfliu Apr 20, 2021

csxiang18
Apr 20, 2021

Replies: 2 comments 1 reply

frankfliu
Apr 20, 2021

csxiang18 Apr 20, 2021
Author

frankfliu
Apr 20, 2021