-
Notifications
You must be signed in to change notification settings - Fork 214
Cannot run model using tensorflow-text ops #82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It's probably because we're building against CentOS, which uses the old C++ ABI of GCC: |
It seems the tensorflow python libraries are built with GCC4 and is using the old ABI as well, and tensorflow-text works with that so I don't think the ABI is the problem here. Through debugging I found the error is caused by different symbols being referenced at run time for the hash_code check at https://github.com/tensorflow/tensorflow/blob/2b96f3662bd776e277f86997659e61046b56c315/tensorflow/core/framework/resource_mgr.h#L721 In my example the hash_code in the resource handle is the address of the hash_bit symbol in libtensorflow.so.2 but the newly created TypeIndex hash_code resolves to the address of hash_bit symbol in libtensorflow_framework.so.2. All libraries are loaded up front so naively I would have thought it would consistently resolve to the same hash_bit symbol but evidently that is not the case here. |
It seems to come down to how bindings are set up when loading the shared objects. Using
When running the model via python it doesn't bind to libtensorflow_framework but to _pywrap_tensorflow_internal.so (which basically serves the role of libtensorflow for python I guess):
One difference seems to be that the tensorflow libraries are loaded with RTLD_GLOBAL in python, which doesn't seem to be the case currently with the java libs. I tested this by appending '!' to the library names in the tensorflow javacpp preset properties and rebuilding but that didn't help (I also tried to load the libraries directly with |
It seems like the difference is that libtensorflow only exports TF_* and TFE_* symbols globally while the python library exports many more symbols (including the hash_bit symbols). When modifying the build of libtensorflow to export the same symbols as the python shared library it works. This seems to be more of a general issue with libtensorflow since I guess anyone using that directly might run into similar problems with custom ops and exporting all symbols doesn't feel like the best solution. But until that is fixed upstream it is trivial to apply this change as a patch when building the java libraries:
I can put together a PR for it as well of course if you want |
We should find out why those symbols aren't exported. I suspect it's because the python API is using things that aren't considered part of the C API. It is a little odd to me that other native libraries depend on those symbols. Maybe the modular tensorflow people know more? |
While on the Java side, we restrict our development using the C API, we cannot prevent maintainers of these custom op libraries from trespassing it. So if we want to support them, it looks like we need to export these symbols. What are the side effects of this @samuelotter , does it increases significantly the TF binary size? |
It shouldn't increase the size of the binaries in any material way and while I haven't built and tested all targets the ones I have built show no significant difference. I'm not an expert on custom ops but it seems they are usually implemented in C++ directly towards the C++ API right now, which is not ABI stable. I'm not even sure the C API exposes everything you need to implement many custom ops right now. While there seems to be ongoing work to provide an ABI stable way of doing this by wrapping the C APIs that doesn't seem to be fully complete yet. I think chances are high that many custom ops would not work out of the box with libtensorflow right now which is unfortunate. Working with this has made me think about how using custom ops with the java bindings could be improved long term. Getting custom ops to work with the java bindings won't be a great experience even if this problem is solved since most custom ops are primarily distributed as python packages you either have to also have python libraries installed and try to find the location of the custom ops library paths on your system, or do what I've done and extract the shared objects from the python wheels and repackage them in jar. Neither option is very elegant. I guess the ideal scenario would be for popular custom op libraries to also be wrapped and distributed as java libraries. I'm not sure how to get there though, but a start may be to provide some sort of template project which could be used as a starting point for wrapping a custom op library to at least reduce the barrier somewhat. Tensorflow Serving comes with a bunch of commonly used/popular custom ops supported out of the box, in a similar vein it might be worth considering maintaining packages for a few of the more popular ones (such as tensorflow-text and tensorflow-io which are also tensorflow projects but not in the core repo) under the umbrella of this project? |
I share your point of view here. Ideally, it should be distributed directly by the providers of these ops and we should make it easy for them to do it. Basically, we could just the providers define an API def of their op like those ones and let them use the same generator as we use to create concrete Java wrappers that could then be packaged with Maven. There has been great progress to support custom ops from the C API with the addition of the Kernel endpoints. But probably these libraries were never updated to make use of them, could also be a contribution we do for them? This way, we won't have these symbols export issues. But I don't know how complex is this task. |
@samuelotter BTW, we would also need to modify exported_symbols.lds for Mac and something else for this to work on Windows. |
It looks like the |
@samuelotter , now that we have applied these changes, can you please validate that you can now successfully load |
It works now. Thanks for looking into this and fixing it! |
System information
Describe the current behavior
It crashes with the following output:
Describe the expected behavior
Succesful execution of the model
Code to reproduce the issue
The following script can be used to generate a minimal saved model triggering the problem:
Other info / logs
There is an issue in tensorflow-text (tensorflow/text#272) where the same thing happens on macos (while this is on linux). This model works on linux using python to load and execute the model however, so the root cause is most likely different. Looking at the fix for the macos issue (tensorflow/tensorflow@1823f87#diff-991a6b786e16708ba1e6f5c9926cf151) makes me suspect that this may be caused by type ids being generated differently due to tensorflow-java building native tensorflow libs separately in a slightly different way than the python libraries.
The text was updated successfully, but these errors were encountered: