Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to query engine in offline mode to ask user whether to download #2875

Closed
petebankhead opened this issue Nov 26, 2023 · 10 comments · Fixed by #2876 or #2885
Closed

Unable to query engine in offline mode to ask user whether to download #2875

petebankhead opened this issue Nov 26, 2023 · 10 comments · Fixed by #2876 or #2885
Labels
bug Something isn't working

Comments

@petebankhead
Copy link
Contributor

Description

I maintain an open-source desktop application that uses DJL (which is great, thanks!).

Our complication is that we

  1. don't want to inflate our downloads with the engines (which not every user will need)
  2. don't want to download anything without user agreement

DJL's ability to download engines on demand solves 1.

What we need for 2. is a way to query whether an engine is available without prompting any download if it is not.

I've called this a 'bug' rather than 'enhancement' because it was possible to solve this in DJL v0.24.0 and earlier using System.setProperty("offline", "true");.

However since #2826 this strategy no longer works, so it fails in v0.25.0.

Related issue:

How to Reproduce?

I've written a short demo of the issue at
https://gist.github.com/petebankhead/836c8e0db8a4e4d9f0d5ecdb032a08ba

I have focussed only on PyTorch.

Error Message

Current behavior, after #2826 (v0.25.0)

I get the following messages using the current master:

Creating cache dir /var/folders/5f/l578vlrn29gd42rdsqbsv1380000gn/T/djl-cache350144786515737871
Can't get engine offline, I'd now prompt the user now to get it online
Engine is null

Error getting engine offline: java.lang.ExceptionInInitializerError
Error getting engine online: java.lang.NoClassDefFoundError: Could not initialize class ai.djl.pytorch.engine.PtEngineProvider$InstanceHolder

Without #2826 (v0.24.0)

Replacing PtEngineProvider to remove the changes in #2826 it works as expected

Creating cache dir /var/folders/5f/l578vlrn29gd42rdsqbsv1380000gn/T/djl-cache6785785179087611189
Can't get engine offline, I'd now prompt the user now to get it online
Error getting engine offline: ai.djl.engine.EngineException: Failed to save pytorch index file
[main] INFO ai.djl.pytorch.jni.LibUtils - Downloading https://publish.djl.ai/pytorch/2.1.1/cpu/osx-aarch64/native/lib/libtorch_cpu.dylib.gz ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Downloading https://publish.djl.ai/pytorch/2.1.1/cpu/osx-aarch64/native/lib/libtorch.dylib.gz ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Downloading https://publish.djl.ai/pytorch/2.1.1/cpu/osx-aarch64/native/lib/libc10.dylib.gz ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting jnilib/osx-aarch64/cpu/libdjl_torch.dylib to cache ...
[main] INFO ai.djl.pytorch.engine.PtEngine - PyTorch graph executor optimizer is enabled, this may impact your inference latency and throughput. See: https://docs.djl.ai/docs/development/inference_performance_optimization.html#graph-executor-optimization
[main] INFO ai.djl.pytorch.engine.PtEngine - Number of inter-op threads is 10
[main] INFO ai.djl.pytorch.engine.PtEngine - Number of intra-op threads is 10
Engine is PyTorch:2.1.1, capabilities: [
]
PyTorch Library: /var/folders/5f/l578vlrn29gd42rdsqbsv1380000gn/T/djl-cache6785785179087611189/pytorch/2.1.1-cpu-osx-aarch64

Expected Behavior

It should be possible to query programmatically whether an engine exists without immediately prompting the download.

Any failure to download the engine (because of offline mode) shouldn't prevent requesting the engine later.

What have you tried to solve it?

I can't see a workaround in v0.25, so we will likely need to stay with v0.24.

It seems that InstanceHolder can't be initialized in offline mode if the engine isn't available, so the class then isn't available.

Whenever that happens, all subsequent attempts fail with NoClassDefFoundError even if System.setProperty("offline", "false"); has been called.

Restoring the behavior prior to #2826 would enable our previous workaround to work, although I think a cleaner solution would be to provide an API to query a model that doesn't prompt a download.

I am reluctant to rely upon "offline" as a system property, and share @docent's opinion at #2632 (comment) that an alternative name would help avoid potential clashes.

Environment Info

I'm using Java 17 on Apple Silicon & can provide further detail if needed, but I expect the issue to affect all platforms.

@petebankhead petebankhead added the bug Something isn't working label Nov 26, 2023
frankfliu added a commit to frankfliu/djl that referenced this issue Nov 26, 2023
@frankfliu
Copy link
Contributor

I create an PR to address your offline key concern: #2877

@petebankhead
Copy link
Contributor Author

Thanks @frankfliu for looking into this so quickly.

I should clarify that PyTorch isn't the only engine affected - I expect that all that support downloading are affected. If I switch PyTorch to TensorFlow in my example code the same problem occurs:

Creating cache dir /var/folders/5f/l578vlrn29gd42rdsqbsv1380000gn/T/djl-cache16194401268772114452
Error getting engine offline: java.lang.ExceptionInInitializerError
Error getting engine online: java.lang.NoClassDefFoundError: Could not initialize class ai.djl.tensorflow.engine.TfEngineProvider$InstanceHolder
Engine is null

Therefore this fix only helps with PyTorch, but it's still not possible to query all engines offline without risking that all access to the engine will be broken through NoClassDefFoundError.

Is it possible to roll back #2826 for all engines, or address this another way?

petebankhead added a commit to petebankhead/qupath that referenced this issue Nov 29, 2023
@zachgk
Copy link
Contributor

zachgk commented Nov 29, 2023

One thing you can take a look at is the DJL Engine.registerEngine(EngineProvider). This will allow you to manually register an engine. For a more complete usage of this, take a look at the DJL Serving Dependency Manager. The dependency manager is used to add engines (downloading the jars, loading them into the JVM, then registering the engine) as models are added into the model server. This sounds like it might do most of what you are looking for

@petebankhead
Copy link
Contributor Author

Thanks @zachgk I hadn't seen DJL Serving Dependency Manager - this looks like it could be a good way for us to add additional engines in the future.

I don't think I can use it to solve the problem here. We provide PyTorch and TensorFlow engines (although developers can build from source with other engines if needed) using the recommended pytorch-engine and tensorflow-engine.

We then provide a dialog showing potentially-available engines, so the user can manually request them to be downloaded (which temporarily switches the off "offline" system property, then simply request the engine again):

In the screenshot, the TensorFlow native libraries have been downloaded by the PyTorch ones have not.

Our problem is that we can only know that an engine is really available (i.e. natives available) by querying it through Engine.getEngine(name).

In 0.24.0 we could query with offline=true, catch an exception if it fails. This meant we could update the status, and allow the user to explicitly request the engine is downloaded if they want.

In 0.25.0 we can't, because querying with offline=true results in InstanceLoader becoming unloadable if the native libraries are missing. Then the engine becomes permanently unavailable.

#2876 addresses this for PyTorch, but not for TensorFlow. I'm not sure if other engines are affected by this issue. I realize we're doing things in a non-standard way, but it feels like a bug for a failed request for Engine.getEngine(name) to leave things in an invalid state that prevents the engine from being loaded later.

@frankfliu
Copy link
Contributor

frankfliu commented Dec 3, 2023

@petebankhead

We can make changes for TensorFlow. but changing system properties at runtime isn't thread-safe and system properties are global, it may impact other external modules. The Engine is meant for singleton. Rely on a lazy load singleton implementation is not a good idea.

Instead of try and fail approach, can you check the cache folder before you load Engine. If the cache directory is not exsist, you prompt user to download. So you wan't run into issue.

@petebankhead
Copy link
Contributor Author

Thanks @frankfliu I understand there are ways to break the use of the system property, but I couldn't find a better way. The calls in our app are through a method that uses try/finally and synchronisation to try to reduce the risk of things going wrong.

I don't know how to check for the correct cache directory existing, because I don't see a way to determine which version and flavor of any specific engine will be required in advance. And even if I knew the cache directory, I wouldn't know in advance that it contains all the required dependencies.

In fact, with TensorFlow it is especially challenging if an incompatible CUDA is found - even when the CPU engine has already been downloaded. This seems to be because a call to Engine.toString() fails when offline because it triggers a call to findLibraryInClasspath() with a placeholder platform for the installed CUDA version. This results in requesting a download, which throws an exception if offline. If online, it fails more quietly due to the CUDA incompatibility, and then reverts to recognising the CPU version. So really, I can see no way to determine the TensorFlow engine properly in advance; a download needs to be requested before it can be known what the correct version/flavor combination is - and therefore the cache directory. I encountered this problem in qupath/qupath-extension-djl#13 (comment) (having reverted to 0.24.0).

The ideal solution for us would be to have two Engine.getEngine(String) methods to

  1. Request an engine 'quietly', returning null if it is not downloaded and available
  2. Request an engine with permission to download if required

An option for progress monitoring any download would also be helpful.... but not essential. Having 1. would enable us to know whether calling 2. is likely to return quickly or slowly, and we wouldn't need to work with the system property at all.

If this isn't possible, we may be able to figure out some other approach in our app that informs our users that any use of DJL requires going online. But I think the requirement for 'check an engine and control whether it is downloaded' could be general enough that if it can be solved in DJL itself that would be great.

@frankfliu
Copy link
Contributor

@petebankhead
#2884

@petebankhead
Copy link
Contributor Author

Thanks @frankfliu !

zachgk added a commit to zachgk/djl that referenced this issue Dec 5, 2023
fixes deepjavalibrary#2875
reverts deepjavalibrary#2876

This adds new support for DJL manual initialization of engines to support
`DJL_ENGINE_MANUAL_INIT`. Once done, no engines providers will be found or
loaded on startup. Instead, they can be added manually by:

```java
PtEngineProvider provider = new PtEngineProvider();
provider.getEngine(); // Optional, throws exception if the provider can not load
Engine.registerEngine(provider);
Engine.setDefaultEngine(provider.getEngineName()); // Optional, sets as default
```
@zachgk
Copy link
Contributor

zachgk commented Dec 5, 2023

@petebankhead I think the issue here is that the Engine was not really intended to be used that way. It started out as loading all engines strictly on startup and would throw exceptions immediately. Then we made it lazy for performance, which also accidentally means it could handle that case. But it wasn't really designed for it.

I opened up a PR #2885 to support it better. Basically, it takes out the manual initialization so you can intentionally trigger the engine registration. You can also verify that the engines can initialize properly and then only perform the engine registration afterwards

@petebankhead
Copy link
Contributor Author

Excellent, thank you @zachgk!

zachgk added a commit that referenced this issue Jan 9, 2024
* Creates DJL manual engine initialization

fixes #2875
reverts #2876

This adds new support for DJL manual initialization of engines to support
`DJL_ENGINE_MANUAL_INIT`. Once done, no engines providers will be found or
loaded on startup. Instead, they can be added manually by:

```java
PtEngineProvider provider = new PtEngineProvider();
provider.getEngine(); // Optional, throws exception if the provider can not load
Engine.registerEngine(provider);
Engine.setDefaultEngine(provider.getEngineName()); // Optional, sets as default
```

* Revert "[tensorflow] Revert InstanceHolder for TensorFlow engine (#2884)"

This reverts commit 586bb07.

* Revert "[api] Replace double-check singlton with lazy initialization (#2826)"

This reverts commit 3927867.

* Make engines initialized

This makes several updates:
- engines will now initialize once per instance of EngineProvider rather than
re-attempt to initialize
- Registering an engine can overwrite the existing one
- All engines now use the synchronized form rather than the static instance
holder. This allows them to have multiple versions and be local to the instance
rather than global (but the instance is saved globally)

* Throws Exception on bad getEngine

* Removes unnecessary check
frankfliu pushed a commit that referenced this issue Apr 26, 2024
* Creates DJL manual engine initialization

fixes #2875
reverts #2876

This adds new support for DJL manual initialization of engines to support
`DJL_ENGINE_MANUAL_INIT`. Once done, no engines providers will be found or
loaded on startup. Instead, they can be added manually by:

```java
PtEngineProvider provider = new PtEngineProvider();
provider.getEngine(); // Optional, throws exception if the provider can not load
Engine.registerEngine(provider);
Engine.setDefaultEngine(provider.getEngineName()); // Optional, sets as default
```

* Revert "[tensorflow] Revert InstanceHolder for TensorFlow engine (#2884)"

This reverts commit 586bb07.

* Revert "[api] Replace double-check singlton with lazy initialization (#2826)"

This reverts commit 3927867.

* Make engines initialized

This makes several updates:
- engines will now initialize once per instance of EngineProvider rather than
re-attempt to initialize
- Registering an engine can overwrite the existing one
- All engines now use the synchronized form rather than the static instance
holder. This allows them to have multiple versions and be local to the instance
rather than global (but the instance is saved globally)

* Throws Exception on bad getEngine

* Removes unnecessary check
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants