Unable to query engine in offline mode to ask user whether to download #2875

petebankhead · 2023-11-26T12:57:16Z

Description

I maintain an open-source desktop application that uses DJL (which is great, thanks!).

Our complication is that we

don't want to inflate our downloads with the engines (which not every user will need)
don't want to download anything without user agreement

DJL's ability to download engines on demand solves 1.

What we need for 2. is a way to query whether an engine is available without prompting any download if it is not.

I've called this a 'bug' rather than 'enhancement' because it was possible to solve this in DJL v0.24.0 and earlier using System.setProperty("offline", "true");.

However since #2826 this strategy no longer works, so it fails in v0.25.0.

Related issue:

The security considerations of automatic library downloads #2632

How to Reproduce?

I've written a short demo of the issue at
https://gist.github.com/petebankhead/836c8e0db8a4e4d9f0d5ecdb032a08ba

I have focussed only on PyTorch.

Error Message

Current behavior, after #2826 (v0.25.0)

I get the following messages using the current master:

Creating cache dir /var/folders/5f/l578vlrn29gd42rdsqbsv1380000gn/T/djl-cache350144786515737871
Can't get engine offline, I'd now prompt the user now to get it online
Engine is null

Error getting engine offline: java.lang.ExceptionInInitializerError
Error getting engine online: java.lang.NoClassDefFoundError: Could not initialize class ai.djl.pytorch.engine.PtEngineProvider$InstanceHolder

Without #2826 (v0.24.0)

Replacing PtEngineProvider to remove the changes in #2826 it works as expected

Creating cache dir /var/folders/5f/l578vlrn29gd42rdsqbsv1380000gn/T/djl-cache6785785179087611189
Can't get engine offline, I'd now prompt the user now to get it online
Error getting engine offline: ai.djl.engine.EngineException: Failed to save pytorch index file
[main] INFO ai.djl.pytorch.jni.LibUtils - Downloading https://publish.djl.ai/pytorch/2.1.1/cpu/osx-aarch64/native/lib/libtorch_cpu.dylib.gz ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Downloading https://publish.djl.ai/pytorch/2.1.1/cpu/osx-aarch64/native/lib/libtorch.dylib.gz ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Downloading https://publish.djl.ai/pytorch/2.1.1/cpu/osx-aarch64/native/lib/libc10.dylib.gz ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting jnilib/osx-aarch64/cpu/libdjl_torch.dylib to cache ...
[main] INFO ai.djl.pytorch.engine.PtEngine - PyTorch graph executor optimizer is enabled, this may impact your inference latency and throughput. See: https://docs.djl.ai/docs/development/inference_performance_optimization.html#graph-executor-optimization
[main] INFO ai.djl.pytorch.engine.PtEngine - Number of inter-op threads is 10
[main] INFO ai.djl.pytorch.engine.PtEngine - Number of intra-op threads is 10
Engine is PyTorch:2.1.1, capabilities: [
]
PyTorch Library: /var/folders/5f/l578vlrn29gd42rdsqbsv1380000gn/T/djl-cache6785785179087611189/pytorch/2.1.1-cpu-osx-aarch64

Expected Behavior

It should be possible to query programmatically whether an engine exists without immediately prompting the download.

Any failure to download the engine (because of offline mode) shouldn't prevent requesting the engine later.

What have you tried to solve it?

I can't see a workaround in v0.25, so we will likely need to stay with v0.24.

It seems that InstanceHolder can't be initialized in offline mode if the engine isn't available, so the class then isn't available.

Whenever that happens, all subsequent attempts fail with NoClassDefFoundError even if System.setProperty("offline", "false"); has been called.

Restoring the behavior prior to #2826 would enable our previous workaround to work, although I think a cleaner solution would be to provide an API to query a model that doesn't prompt a download.

I am reluctant to rely upon "offline" as a system property, and share @docent's opinion at #2632 (comment) that an alternative name would help avoid potential clashes.

Environment Info

I'm using Java 17 on Apple Silicon & can provide further detail if needed, but I expect the issue to affect all platforms.

The text was updated successfully, but these errors were encountered:

Fixes: deepjavalibrary#2875

frankfliu · 2023-11-26T19:12:01Z

I create an PR to address your offline key concern: #2877

Fixes: #2875

petebankhead · 2023-11-27T05:33:24Z

Thanks @frankfliu for looking into this so quickly.

I should clarify that PyTorch isn't the only engine affected - I expect that all that support downloading are affected. If I switch PyTorch to TensorFlow in my example code the same problem occurs:

Creating cache dir /var/folders/5f/l578vlrn29gd42rdsqbsv1380000gn/T/djl-cache16194401268772114452
Error getting engine offline: java.lang.ExceptionInInitializerError
Error getting engine online: java.lang.NoClassDefFoundError: Could not initialize class ai.djl.tensorflow.engine.TfEngineProvider$InstanceHolder
Engine is null

Therefore this fix only helps with PyTorch, but it's still not possible to query all engines offline without risking that all access to the engine will be broken through NoClassDefFoundError.

Is it possible to roll back #2826 for all engines, or address this another way?

Necessary because of deepjavalibrary/djl#2875

zachgk · 2023-11-29T23:16:35Z

One thing you can take a look at is the DJL Engine.registerEngine(EngineProvider). This will allow you to manually register an engine. For a more complete usage of this, take a look at the DJL Serving Dependency Manager. The dependency manager is used to add engines (downloading the jars, loading them into the JVM, then registering the engine) as models are added into the model server. This sounds like it might do most of what you are looking for

petebankhead · 2023-11-30T07:08:09Z

Thanks @zachgk I hadn't seen DJL Serving Dependency Manager - this looks like it could be a good way for us to add additional engines in the future.

I don't think I can use it to solve the problem here. We provide PyTorch and TensorFlow engines (although developers can build from source with other engines if needed) using the recommended pytorch-engine and tensorflow-engine.

We then provide a dialog showing potentially-available engines, so the user can manually request them to be downloaded (which temporarily switches the off "offline" system property, then simply request the engine again):

In the screenshot, the TensorFlow native libraries have been downloaded by the PyTorch ones have not.

Our problem is that we can only know that an engine is really available (i.e. natives available) by querying it through Engine.getEngine(name).

In 0.24.0 we could query with offline=true, catch an exception if it fails. This meant we could update the status, and allow the user to explicitly request the engine is downloaded if they want.

In 0.25.0 we can't, because querying with offline=true results in InstanceLoader becoming unloadable if the native libraries are missing. Then the engine becomes permanently unavailable.

#2876 addresses this for PyTorch, but not for TensorFlow. I'm not sure if other engines are affected by this issue. I realize we're doing things in a non-standard way, but it feels like a bug for a failed request for Engine.getEngine(name) to leave things in an invalid state that prevents the engine from being loaded later.

frankfliu · 2023-12-03T18:44:59Z

@petebankhead

We can make changes for TensorFlow. but changing system properties at runtime isn't thread-safe and system properties are global, it may impact other external modules. The Engine is meant for singleton. Rely on a lazy load singleton implementation is not a good idea.

Instead of try and fail approach, can you check the cache folder before you load Engine. If the cache directory is not exsist, you prompt user to download. So you wan't run into issue.

petebankhead · 2023-12-03T20:33:49Z

Thanks @frankfliu I understand there are ways to break the use of the system property, but I couldn't find a better way. The calls in our app are through a method that uses try/finally and synchronisation to try to reduce the risk of things going wrong.

I don't know how to check for the correct cache directory existing, because I don't see a way to determine which version and flavor of any specific engine will be required in advance. And even if I knew the cache directory, I wouldn't know in advance that it contains all the required dependencies.

In fact, with TensorFlow it is especially challenging if an incompatible CUDA is found - even when the CPU engine has already been downloaded. This seems to be because a call to Engine.toString() fails when offline because it triggers a call to findLibraryInClasspath() with a placeholder platform for the installed CUDA version. This results in requesting a download, which throws an exception if offline. If online, it fails more quietly due to the CUDA incompatibility, and then reverts to recognising the CPU version. So really, I can see no way to determine the TensorFlow engine properly in advance; a download needs to be requested before it can be known what the correct version/flavor combination is - and therefore the cache directory. I encountered this problem in qupath/qupath-extension-djl#13 (comment) (having reverted to 0.24.0).

The ideal solution for us would be to have two Engine.getEngine(String) methods to

Request an engine 'quietly', returning null if it is not downloaded and available
Request an engine with permission to download if required

An option for progress monitoring any download would also be helpful.... but not essential. Having 1. would enable us to know whether calling 2. is likely to return quickly or slowly, and we wouldn't need to work with the system property at all.

If this isn't possible, we may be able to figure out some other approach in our app that informs our users that any use of DJL requires going online. But I think the requirement for 'check an engine and control whether it is downloaded' could be general enough that if it can be solved in DJL itself that would be great.

frankfliu · 2023-12-04T18:34:58Z

@petebankhead
#2884

petebankhead · 2023-12-04T20:18:13Z

Thanks @frankfliu !

fixes deepjavalibrary#2875 reverts deepjavalibrary#2876 This adds new support for DJL manual initialization of engines to support `DJL_ENGINE_MANUAL_INIT`. Once done, no engines providers will be found or loaded on startup. Instead, they can be added manually by: ```java PtEngineProvider provider = new PtEngineProvider(); provider.getEngine(); // Optional, throws exception if the provider can not load Engine.registerEngine(provider); Engine.setDefaultEngine(provider.getEngineName()); // Optional, sets as default ```

zachgk · 2023-12-05T20:58:26Z

@petebankhead I think the issue here is that the Engine was not really intended to be used that way. It started out as loading all engines strictly on startup and would throw exceptions immediately. Then we made it lazy for performance, which also accidentally means it could handle that case. But it wasn't really designed for it.

I opened up a PR #2885 to support it better. Basically, it takes out the manual initialization so you can intentionally trigger the engine registration. You can also verify that the engines can initialize properly and then only perform the engine registration afterwards

petebankhead · 2023-12-05T21:32:22Z

Excellent, thank you @zachgk!

* Creates DJL manual engine initialization fixes #2875 reverts #2876 This adds new support for DJL manual initialization of engines to support `DJL_ENGINE_MANUAL_INIT`. Once done, no engines providers will be found or loaded on startup. Instead, they can be added manually by: ```java PtEngineProvider provider = new PtEngineProvider(); provider.getEngine(); // Optional, throws exception if the provider can not load Engine.registerEngine(provider); Engine.setDefaultEngine(provider.getEngineName()); // Optional, sets as default ``` * Revert "[tensorflow] Revert InstanceHolder for TensorFlow engine (#2884)" This reverts commit 586bb07. * Revert "[api] Replace double-check singlton with lazy initialization (#2826)" This reverts commit 3927867. * Make engines initialized This makes several updates: - engines will now initialize once per instance of EngineProvider rather than re-attempt to initialize - Registering an engine can overwrite the existing one - All engines now use the synchronized form rather than the static instance holder. This allows them to have multiple versions and be local to the instance rather than global (but the instance is saved globally) * Throws Exception on bad getEngine * Removes unnecessary check

Fixes: #2875

* Creates DJL manual engine initialization fixes #2875 reverts #2876 This adds new support for DJL manual initialization of engines to support `DJL_ENGINE_MANUAL_INIT`. Once done, no engines providers will be found or loaded on startup. Instead, they can be added manually by: ```java PtEngineProvider provider = new PtEngineProvider(); provider.getEngine(); // Optional, throws exception if the provider can not load Engine.registerEngine(provider); Engine.setDefaultEngine(provider.getEngineName()); // Optional, sets as default ``` * Revert "[tensorflow] Revert InstanceHolder for TensorFlow engine (#2884)" This reverts commit 586bb07. * Revert "[api] Replace double-check singlton with lazy initialization (#2826)" This reverts commit 3927867. * Make engines initialized This makes several updates: - engines will now initialize once per instance of EngineProvider rather than re-attempt to initialize - Registering an engine can overwrite the existing one - All engines now use the synchronized form rather than the static instance holder. This allows them to have multiple versions and be local to the instance rather than global (but the instance is saved globally) * Throws Exception on bad getEngine * Removes unnecessary check

petebankhead added the bug Something isn't working label Nov 26, 2023

frankfliu mentioned this issue Nov 26, 2023

[pytorch] Revert InstanceHolder for PyTorch engine #2876

Merged

frankfliu added a commit to frankfliu/djl that referenced this issue Nov 26, 2023

[pytorch] Revert InstanceHolder for PyTorch engine

9e040de

Fixes: deepjavalibrary#2875

frankfliu closed this as completed in #2876 Nov 26, 2023

frankfliu added a commit that referenced this issue Nov 26, 2023

[pytorch] Revert InstanceHolder for PyTorch engine (#2876)

e45f344

Fixes: #2875

petebankhead added a commit to petebankhead/qupath that referenced this issue Nov 29, 2023

Downgrade DJL to 0.24.0

d4e94a9

Necessary because of deepjavalibrary/djl#2875

petebankhead mentioned this issue Nov 29, 2023

Downgrade DJL to 0.24.0 qupath/qupath#1428

Merged

zachgk mentioned this issue Dec 5, 2023

Creates DJL manual engine initialization #2885

Merged

frankfliu added a commit that referenced this issue Apr 26, 2024

[pytorch] Revert InstanceHolder for PyTorch engine (#2876)

f41be74

Fixes: #2875

alanocallaghan mentioned this issue Apr 29, 2024

Update for DJL > 0.24 qupath/qupath-extension-wsinfer#56

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to query engine in offline mode to ask user whether to download #2875

Unable to query engine in offline mode to ask user whether to download #2875

petebankhead commented Nov 26, 2023

frankfliu commented Nov 26, 2023

petebankhead commented Nov 27, 2023

zachgk commented Nov 29, 2023

petebankhead commented Nov 30, 2023

frankfliu commented Dec 3, 2023 •

edited

Loading

petebankhead commented Dec 3, 2023

frankfliu commented Dec 4, 2023

petebankhead commented Dec 4, 2023

zachgk commented Dec 5, 2023

petebankhead commented Dec 5, 2023

Unable to query engine in offline mode to ask user whether to download #2875

Unable to query engine in offline mode to ask user whether to download #2875

Comments

petebankhead commented Nov 26, 2023

Description

How to Reproduce?

Error Message

Current behavior, after #2826 (v0.25.0)

Without #2826 (v0.24.0)

Expected Behavior

What have you tried to solve it?

Environment Info

frankfliu commented Nov 26, 2023

petebankhead commented Nov 27, 2023

zachgk commented Nov 29, 2023

petebankhead commented Nov 30, 2023

frankfliu commented Dec 3, 2023 • edited Loading

petebankhead commented Dec 3, 2023

frankfliu commented Dec 4, 2023

petebankhead commented Dec 4, 2023

zachgk commented Dec 5, 2023

petebankhead commented Dec 5, 2023

frankfliu commented Dec 3, 2023 •

edited

Loading