Cannot run model using tensorflow-text ops #82

samuelotter · 2020-07-01T20:19:18Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Debian Buster
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
TensorFlow installed from (source or binary): Maven snapshots
TensorFlow version (use command below): 0.1.0-SNAPSHOT (tensorflow_text 2.2.1)
Python version: 3.7 and 3.6
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version:
GPU model and memory:

Describe the current behavior
It crashes with the following output:

2020-07-01 20:14:58.469557: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: tokenizer
2020-07-01 20:14:58.474996: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2020-07-01 20:14:58.475083: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:295] Reading SavedModel debug info (if present) from: tokenizer
2020-07-01 20:14:58.475670: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-01 20:14:58.478426: W external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-07-01 20:14:58.478482: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
2020-07-01 20:14:58.478517: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (1470715d11a2): /proc/driver/nvidia/version does not exist
2020-07-01 20:14:58.478585: I external/org_tensorflow/tensorflow/core/common_runtime/process_util.cc:147] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
2020-07-01 20:14:58.491011: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:234] Restoring SavedModel bundle.
2020-07-01 20:14:58.522974: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:183] Running initialization op on SavedModel bundle at path: tokenizer
2020-07-01 20:14:58.537814: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:364] SavedModel load for tags { serve }; Status: success: OK. Took 68286 microseconds.
2020-07-01 20:14:59.321091: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1753] OP_REQUIRES failed at wordpiece_kernel.cc:204 : Invalid argument: Trying to access resource using the wrong type. Expected N10tensorflow6lookup15LookupInterfaceE got N10tensorflow6lookup15LookupInterfaceE
[WARNING]
org.tensorflow.exceptions.TFInvalidArgumentException: Trying to access resource using the wrong type. Expected N10tensorflow6lookup15LookupInterfaceE got N10tensorflow6lookup15LookupInterfaceE
         [[{{node WordpieceTokenizeWithOffsets/WordpieceTokenizeWithOffsets/WordpieceTokenizeWithOffsets}}]]
    at org.tensorflow.internal.c_api.AbstractTF_Status.throwExceptionIfNotOK (AbstractTF_Status.java:87)
    at org.tensorflow.Session.run (Session.java:595)
    at org.tensorflow.Session.access$100 (Session.java:70)
    at org.tensorflow.Session$Runner.runHelper (Session.java:335)
    at org.tensorflow.Session$Runner.run (Session.java:285)
    at Main.main (Main.java:25)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:254)
    at java.lang.Thread.run (Thread.java:834)

Describe the expected behavior
Succesful execution of the model

Code to reproduce the issue

import org.tensorflow.SavedModelBundle;
import org.tensorflow.Session;
import org.tensorflow.Tensor;
import org.tensorflow.TensorFlow;
import org.tensorflow.tools.ndarray.NdArrays;
import org.tensorflow.types.TString;

import java.nio.charset.StandardCharsets;
import java.util.List;

public class Main {
    public static void main(String[] args) {
        TensorFlow.version();

        String libDir = "/usr/local/lib/python3.7/dist-packages/tensorflow_text/python/ops/";
        TensorFlow.loadLibrary(libDir + "_wordpiece_tokenizer.so");
        TensorFlow.loadLibrary(libDir + "_normalize_ops.so");
        TensorFlow.loadLibrary(libDir + "_regex_split_ops.so");

        SavedModelBundle savedModelBundle = SavedModelBundle.load("tokenizer", "serve");
        Session.Runner runner = savedModelBundle.session().runner();

        runner.feed("serving_default_text:0", TString.tensorOfBytes(NdArrays.vectorOfObjects("a b c d e".getBytes(StandardCharsets.UTF_8))));
        runner.fetch("StatefulPartitionedCall_1:0");
        List<Tensor<?>> outputs = runner.run();
        System.out.println(outputs);
    }
}

The following script can be used to generate a minimal saved model triggering the problem:

from tensorflow.python.framework import dtypes
from tensorflow.python.ops import array_ops
from tensorflow.python.ops import lookup_ops
from tensorflow.python.ops import math_ops
from tensorflow.python.ops import string_ops
from tensorflow_text.python.ops import bert_tokenizer
import tensorflow as tf

vocab = [
   b'a', b'b', b'c', b'd'
]

def _create_table(vocab, num_oov=1):
  init = lookup_ops.KeyValueTensorInitializer(
      vocab,
      math_ops.range(
          array_ops.size(vocab, out_type=dtypes.int64), dtype=dtypes.int64),
      key_dtype=dtypes.string,
      value_dtype=dtypes.int64)
  return lookup_ops.StaticVocabularyTableV1(
      init, num_oov, lookup_key_dtype=dtypes.string)


table = _create_table(vocab)

class Module(tf.Module):
    def __init__(self, table):
        self.table = table
        self.tokenizer = bert_tokenizer.BertTokenizer(
                             self.table,
                             token_out_type=dtypes.string,
                             lower_case=True,
                             preserve_unused_token=False)

    @tf.function(input_signature=[tf.TensorSpec(1, dtype=tf.dtypes.string)])
    def serve(self, text):
        return self.tokenizer.tokenize(text)

module = Module(table)
tf.saved_model.save(module, 'tokenizer')

model = tf.saved_model.load('tokenizer')
print(model.serve(['a a b c d e']))

Other info / logs
There is an issue in tensorflow-text (tensorflow/text#272) where the same thing happens on macos (while this is on linux). This model works on linux using python to load and execute the model however, so the root cause is most likely different. Looking at the fix for the macos issue (tensorflow/tensorflow@1823f87#diff-991a6b786e16708ba1e6f5c9926cf151) makes me suspect that this may be caused by type ids being generated differently due to tensorflow-java building native tensorflow libs separately in a slightly different way than the python libraries.

The text was updated successfully, but these errors were encountered:

saudet · 2020-07-03T11:03:06Z

It's probably because we're building against CentOS, which uses the old C++ ABI of GCC:
https://bugzilla.redhat.com/show_bug.cgi?id=1546704
libstdc++ is backward compatible, so we can run TensorFlow compiled that way on Ubuntu no problem, but anything that interacts with the C++ API also needs to be compiled that way.

samuelotter · 2020-08-04T08:35:01Z

It seems the tensorflow python libraries are built with GCC4 and is using the old ABI as well, and tensorflow-text works with that so I don't think the ABI is the problem here.

Through debugging I found the error is caused by different symbols being referenced at run time for the hash_code check at https://github.com/tensorflow/tensorflow/blob/2b96f3662bd776e277f86997659e61046b56c315/tensorflow/core/framework/resource_mgr.h#L721

In my example the hash_code in the resource handle is the address of the hash_bit symbol in libtensorflow.so.2 but the newly created TypeIndex hash_code resolves to the address of hash_bit symbol in libtensorflow_framework.so.2.

All libraries are loaded up front so naively I would have thought it would consistently resolve to the same hash_bit symbol but evidently that is not the case here.

samuelotter · 2020-08-06T05:55:32Z

It seems to come down to how bindings are set up when loading the shared objects.

Using LD_DEBUG=bindings when running the model via java it shows the op library binds to libtensorflow_framework instead of libtensorflow:

binding file /usr/local/lib/python3.6/dist-packages/tensorflow_text/python/ops/_wordpiece_tokenizer.so [0] to /root/.javacpp/cache/tensorflow-core-api-0.1.0-SNAPSHOT-linux-x86_64.jar/org/tensorflow/internal/c_api/linux-x86_64/libtensorflow_framework.so.2 [0]: normal symbol `_ZZN10tensorflow9TypeIndex4MakeINS_6lookup15LookupInterfaceEEES0_PKcE8hash_bit'

When running the model via python it doesn't bind to libtensorflow_framework but to _pywrap_tensorflow_internal.so (which basically serves the role of libtensorflow for python I guess):

binding file /usr/local/lib/python3.6/dist-packages/tensorflow_text/python/ops/_wordpiece_tokenizer.so [0] to /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so [0]: normal symbol `_ZZN10tensorflow9TypeIndex4MakeINS_6lookup15LookupInterfaceEEES0_PKcE8hash_bit'

One difference seems to be that the tensorflow libraries are loaded with RTLD_GLOBAL in python, which doesn't seem to be the case currently with the java libs. I tested this by appending '!' to the library names in the tensorflow javacpp preset properties and rebuilding but that didn't help (I also tried to load the libraries directly with Loader.loadGlobal just to be sure). So RTLD_GLOBAL is probably not the cause for this difference.

samuelotter · 2020-08-06T13:54:33Z

It seems like the difference is that libtensorflow only exports TF_* and TFE_* symbols globally while the python library exports many more symbols (including the hash_bit symbols). When modifying the build of libtensorflow to export the same symbols as the python shared library it works.

This seems to be more of a general issue with libtensorflow since I guess anyone using that directly might run into similar problems with custom ops and exporting all symbols doesn't feel like the best solution.

But until that is fixed upstream it is trivial to apply this change as a patch when building the java libraries:

diff --git a/tensorflow/c/version_script.lds b/tensorflow/c/version_script.lds
index c352a1440d..7716dcb231 100644
--- a/tensorflow/c/version_script.lds
+++ b/tensorflow/c/version_script.lds
@@ -1,8 +1,16 @@
 VERS_1.0 {
   # Export symbols in c_api.h.
   global:
+  *tensorflow*;
+    *toco*;
+    *perftools*gputools*;
+    *tf_*;
     *TF_*;
+    *Eager*;
     *TFE_*;
+    *nsync_*;
+    *stream_executor*;
+    *xla*;

   # Hide everything else.
   local:

I can put together a PR for it as well of course if you want

Craigacp · 2020-08-06T14:55:39Z

We should find out why those symbols aren't exported. I suspect it's because the python API is using things that aren't considered part of the C API. It is a little odd to me that other native libraries depend on those symbols. Maybe the modular tensorflow people know more?

karllessard · 2020-08-11T02:28:30Z

While on the Java side, we restrict our development using the C API, we cannot prevent maintainers of these custom op libraries from trespassing it. So if we want to support them, it looks like we need to export these symbols. What are the side effects of this @samuelotter , does it increases significantly the TF binary size?

samuelotter · 2020-08-12T07:52:46Z

It shouldn't increase the size of the binaries in any material way and while I haven't built and tested all targets the ones I have built show no significant difference.

I'm not an expert on custom ops but it seems they are usually implemented in C++ directly towards the C++ API right now, which is not ABI stable. I'm not even sure the C API exposes everything you need to implement many custom ops right now. While there seems to be ongoing work to provide an ABI stable way of doing this by wrapping the C APIs that doesn't seem to be fully complete yet. I think chances are high that many custom ops would not work out of the box with libtensorflow right now which is unfortunate.

Working with this has made me think about how using custom ops with the java bindings could be improved long term. Getting custom ops to work with the java bindings won't be a great experience even if this problem is solved since most custom ops are primarily distributed as python packages you either have to also have python libraries installed and try to find the location of the custom ops library paths on your system, or do what I've done and extract the shared objects from the python wheels and repackage them in jar. Neither option is very elegant.

I guess the ideal scenario would be for popular custom op libraries to also be wrapped and distributed as java libraries. I'm not sure how to get there though, but a start may be to provide some sort of template project which could be used as a starting point for wrapping a custom op library to at least reduce the barrier somewhat.

Tensorflow Serving comes with a bunch of commonly used/popular custom ops supported out of the box, in a similar vein it might be worth considering maintaining packages for a few of the more popular ones (such as tensorflow-text and tensorflow-io which are also tensorflow projects but not in the core repo) under the umbrella of this project?

karllessard · 2020-08-12T22:44:06Z

I guess the ideal scenario would be for popular custom op libraries to also be wrapped and distributed as java libraries.

I share your point of view here. Ideally, it should be distributed directly by the providers of these ops and we should make it easy for them to do it. Basically, we could just the providers define an API def of their op like those ones and let them use the same generator as we use to create concrete Java wrappers that could then be packaged with Maven.

There has been great progress to support custom ops from the C API with the addition of the Kernel endpoints. But probably these libraries were never updated to make use of them, could also be a contribution we do for them? This way, we won't have these symbols export issues. But I don't know how complex is this task.

saudet · 2020-09-11T16:29:44Z

@samuelotter BTW, we would also need to modify exported_symbols.lds for Mac and something else for this to work on Windows.

saudet · 2020-09-11T16:53:07Z

It looks like the tensorflow_cc target is still there and supported, for example, see tensorflow/tensorflow#41904.
Maybe we could try to use that instead of tensorflow?

karllessard · 2020-09-22T17:29:43Z

@samuelotter , now that we have applied these changes, can you please validate that you can now successfully load tensorflow-text when using the latest snapshots of TF Java?

samuelotter · 2020-09-23T09:24:19Z

It works now. Thanks for looking into this and fixing it!

saudet mentioned this issue Sep 12, 2020

Build the tensorflow_cc target instead of just tensorflow #113

Merged

karllessard closed this as completed in #113 Sep 13, 2020

liyinhgqw mentioned this issue Feb 26, 2021

Failed to run models with libtensorflow_jni and tensorflow text together tensorflow/tensorflow#47431

Closed

allenlavoie mentioned this issue Oct 6, 2021

Make libtensorflow_jni target compatible with Tensorflow Text tensorflow/tensorflow#47432

Closed

reuschling mentioned this issue Jun 17, 2022

Can't run tensorflow multi-lingual sentence encoder deepjavalibrary/djl#192

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cannot run model using tensorflow-text ops #82

Cannot run model using tensorflow-text ops #82

samuelotter commented Jul 1, 2020

saudet commented Jul 3, 2020

Uh oh!

samuelotter commented Aug 4, 2020

Uh oh!

samuelotter commented Aug 6, 2020

Uh oh!

samuelotter commented Aug 6, 2020

Uh oh!

Craigacp commented Aug 6, 2020

Uh oh!

karllessard commented Aug 11, 2020

Uh oh!

samuelotter commented Aug 12, 2020

Uh oh!

karllessard commented Aug 12, 2020

Uh oh!

saudet commented Sep 11, 2020

Uh oh!

saudet commented Sep 11, 2020

Uh oh!

karllessard commented Sep 22, 2020

Uh oh!

samuelotter commented Sep 23, 2020

Uh oh!

Cannot run model using tensorflow-text ops #82

Cannot run model using tensorflow-text ops #82

Comments

samuelotter commented Jul 1, 2020

saudet commented Jul 3, 2020

Uh oh!

samuelotter commented Aug 4, 2020

Uh oh!

samuelotter commented Aug 6, 2020

Uh oh!

samuelotter commented Aug 6, 2020

Uh oh!

Craigacp commented Aug 6, 2020

Uh oh!

karllessard commented Aug 11, 2020

Uh oh!

samuelotter commented Aug 12, 2020

Uh oh!

karllessard commented Aug 12, 2020

Uh oh!

saudet commented Sep 11, 2020

Uh oh!

saudet commented Sep 11, 2020

Uh oh!

karllessard commented Sep 22, 2020

Uh oh!

samuelotter commented Sep 23, 2020

Uh oh!