[RFC] Switching to JNI for Tensorflow engine #625
Replies: 6 comments 9 replies
-
Hello, JavaCPP author here! I'm sorry you feel of JavaCPP as a "black box", but I would be happy to support your needs in any way that is required. JavaCPP is basically to Java what cython or pybind11 with setuptools is to Python. There is no reason to hack JNI manually in your build system. For reference, a minimally wrapped C API, for example, in the case of MXNet, looks like what I did for this pull request: apache/mxnet#19797.
That's unrelated to JNI or JavaCPP and should be fixed with TF 2.4.x, see this pull request: tensorflow/java#212.
To disable using the GC, we can set the "org.bytedeco.javacpp.nopointergc" system property to "true", and that's it!
We're only having problems securing compute resources. Neither Google nor Oracle are being too helpful. Our requests have gone nowhere for over a year now. If Amazon can help us get access to hardware more quickly, we would be glad to use it! /cc @karllessard |
Beta Was this translation helpful? Give feedback.
-
It's also worth noting that SIG-JVM has spent it's time focusing on adding features and improving compatibility with TF 2 in python. Things like training, SavedModel & function support. We've also been working on making it more typesafe and improving the usability from Java. We've not got around to the performance work yet, but as we start to reach feature parity then we can move on to looking at performance. |
Beta Was this translation helpful? Give feedback.
-
TF Java in the main TF repository (the 1.x series) is no longer maintained and you should not rely on it. The new version maintained by SIG JVM is still in a pre-release mode and, like @Craigacp pointed out, current efforts are more aiming the API and functionalities than the performance issues right now. But we know as a certainty that they will need to be addressed before we can do any official release of the library. Since both TF Java and JavaCPP are projects open to contributions, it would be very valuable if we can merge our efforts and try to identify the problems at the source instead of writing workarounds in libraries consuming them, like DJL (I personally do not recommend to start writing your own JNI layer as TensorFlow's C ABI is becoming way more complex than it used to be in the 1.x days). @saudet already mentioned that we are currently upgrading to TF 2.4 and that could be a good starting point to start investigating more deeply into these core issues, and we would certainly benefit from the help of external contributors as well. Still, I suspect there won't be a lot of work required to reach the performances we are looking for. |
Beta Was this translation helpful? Give feedback.
-
Thanks @saudet for being active and @karllessard to point out EagerSession in tensorflow/java#208 (comment). Here are issues we are facing.
@saudet we are also interested in how javacpp do performance optimization so it is better than hand-written JNI. Our PyTorch JNI might be benefited from it. Again, Thanks for everyone's feedback. Those are valuable to DJL team. We can also schedule a meeting. Let me know when is best time for you. |
Beta Was this translation helpful? Give feedback.
-
@saudet
|
Beta Was this translation helpful? Give feedback.
-
We have identified a potential issue in TF Java that could result in OOM when eager sessions were remaining alive for a relatively long time. There is a PR actually opened to fix it and I think this solution could also resolve some issues that was observed previously when using DJL with TensorFlow. Now, TF Java is offering very lightweight bindings to interact with the TensorFlow runtime in a Java idiomatic way. While nothing prevents users to call directly the JavaCPP wrappers inside the library to access directly the raw TensorFlow's ABI, TF Java is not designed nor maintained with this purpose and code generated by JavaCPP, always subject to changes, is reserved for internal usage. I would like to reiterate my previous suggestion that we should try to solve any problems with the integration of TF Java in DJL by investigating (and potentially fixing) the issues in both libraries first, before attempting a drastic change of design as described in the current proposal. |
Beta Was this translation helpful? Give feedback.
-
Background
In the following doc, TensorFlow Java or tf java refers to Java language binding on TensorFlow repo and sig-jvm refers to new java sub-repo outside TensorFlow that uses JavaCpp. There are two executors supported by TensorFlow. On TensorFlow 1.x, it only supports symbolic graph executor called GraphExecutor. When TensorFlow 2.0 was launched, it is recommended to use EagerExecutor that runs the operation in imperative way. In terms of bridge library, TensorFlow Java uses JNI while sig-jvm uses JavaCpp.
Problem
When more and more customers get onboarded with DJL TensorFlow, we ran into several issues. The first one is build problem of TensorFlow native binaries. When upgrading to latest TensorFlow version, it is hard for us to adjust native code such as C APIs to make it work with limited TensorFlow expertise. JavaCpp is yet another black-box for us. It has to build Java and cpp altogether due to its limitation. Similarly, now libtensorflow built by sig-jvm is top of old MKLDNN which has performance degradation by over 50% (see reference issue). We worked around the issue by removing the MKLDNN. The other is GC issue. Our customer Stan from Netflix reported the performance problem caused by JavaCpp. We don’t have control over JavaCpp and don’t deeply understand how it works. All factors make us difficult to provide best optimal TensorFlow libraries for customers. In the doc, we will revisit TensorFlow java and propose several solutions.
Overview
TensorFlow Java
TensorFlow Java package includes two parts: operators and other sources. All operators are generated by Javapoet and op_gen tool. I will dive deeper into it in section TensorFlow Java Operators Generator.
The src/main/java/org/tensorflow has basic Java and utilities classes except for operators themselves. The example classes are Graph, Session, EagerOperation, Tensor, DataType and NativeLibrary that loads JNI. The entry points of C++ code (function signature with native) are also included. There is the other native directory under main. It is mainly for JNI C++ code that interacts with TensorFlow C API. There is an example that demos a simple image classification tf java 1.x graph executor.
Sig-JVM
Sig-JVM have three packages. tensorflow-core provides low-level libraries similar to TensorFlow Java. tensorflow-framework offers more high-level APIs like DJL API package. ndarray is a utility for tensor and data I/O. Currently DJL only depends on tensorflow-core. So we will focus on it. In tensorflow-core, sig-jvm also copied an amount of Java code from original TensorFlow repo. They replaced the JNI layer with JavaCpp mapping code. There are also adding new stuff like more datatype and integration code with their ndarray package.
Similarly, they use TensorFlow Java Operators Generator and put tool code in tensorflow-core-api/src/bazel/op_generator/. The generated operator are also all check-in tensorflow-core-api/src/gen/java/org/tensorflow/op/.
TensorFlow Java Operators Generator
Both TensorFlow Java and sig-jvm use generated operator. The source code is in tensorflow/java/src/gen. A java class called OperatorProcessor.java that uses Javapoet and a binary gen_op written by C++ are involved. The operator is produced during the compilation time. It first builds gen_op and then packed into libtensorflow jar along with other components. The operator definition could be found by in tensorflow/core/api_def/java_api.
Take
concat
for instances.Here is the original proto txt file.
The generated ops on Java side are actually operator functions that wraps around OperationBuilder.
We can actually get rid of java layer and implement the logic on TfNDArray.java.
The real JNI call is
Proposed Solutions
1. Switch to TensorFlow Java
We copy JNI source code and whatever Java classes we need for DJL, own and maintain it. As a result, we have full control over the source code. We can solve any memory issue on JNI. Building custom libtensorflow should be also straightforward following the official doc. We might consider to get rid of operators and generator and do it in MXNet op builder way.
pros:
cons:
2. Stick with sig-jvm but workaround or fork JavaCpp
We study JavaCpp source code and understand what mechanism it uses to release native resource. And work around on our side like what we did to reuse objects on JNA for performance ads team. If it doesn’t work out, we can fork TensorFlow Java or even JavaCpp and adjust the source code to meet our customers’ need. The benefit is we are not far away from sig-jvm and can leverage their power. But we might end up doing lots of changes and the time spent on it is more than approach 1.
pros:
cons:
Q & A
Q: How do we deal with swig?
The swig is on the way to be deprecated. It was used to act as a bridge connecting Python and C++. Now they switch to pybind11. For TensorFlow java, I can’t find any *.i file. So the refactoring effort should be done.
Reference Issues
Beta Was this translation helpful? Give feedback.
All reactions