Skip to content

Upgrade for TensorFlow 2.4.1 #212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 23, 2021

Conversation

saudet
Copy link
Contributor

@saudet saudet commented Feb 10, 2021

This draft PR includes initial changes required just to get the builds going with TF 2.4.1. Some tests are crashing, probably because we still need to update the handling of TF_String as per the breaking change below, but there's probably a bit more to be done.

  • The byte layout for string tensors across the C-API has been updated to match TF Core/C++; i.e., a contiguous array of tensorflow::tstring/TF_TStrings.

https://github.com/tensorflow/tensorflow/releases/tag/v2.4.0

It's quite a big release, and a lot of ops have been added, so even after getting everything fixed up, there's probably more to be done before we can get this merged. @karllessard Please take a look and feel free to push any changes to my fork and/or merge this into a branch of the main repository.

On the flip side, most toolchain workarounds have been moved from configure.py to .bazelrc (so we can clean up build.sh a little), we finally have support for CUDA 11 and cuDNN 8, MKL-DNN 1.x now works on all platforms (Linux, Mac, and Windows) without huge external binaries, and I was able to find what the issue was with CUDA on CentOS. Apparently, it doesn't like its new version of glibc, and we can prevent it from getting installed by calling yum this way:

          GLIBC="glibc glibc-common glibc-devel glibc-headers"
          yum --disablerepo updates -y install $GLIBC
          yum -x "$GLIBC" -y update ...

https://github.com/saudet/tensorflow-java/blob/upgrade-tensorflow-241/.github/workflows/ci.yml#L53-L58
@karllessard Feel free to merge this bit right away for TF 2.3.1!

Verified

This commit was signed with the committer’s verified signature.
lunny Lunny Xiao
@karllessard
Copy link
Collaborator

karllessard commented Feb 11, 2021

Thanks a lot @saudet , let's prioritize this merge and leave 2.3.1 broken for now. I'll pull your branch and see what I can do to fix the string issues + classify the new ops

@karllessard
Copy link
Collaborator

Finally, while I'm building your branch, let's give it a try on 2.3.1... 8b36d7e

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

// #include "tensorflow/core/platform/ctstring_internal.h"

// Initialize a new tstring. This must be called before using any function
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any clue why the documentation of these new string endpoints get all messed up by being aggregated together?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TF_TString type isn't defined in ctstring.h, so we need to include ctstring_internal.h as well, which contains the definitions of those functions too. JavaCPP simply skips the redundant declarations that follow. We can hack something up, but this is something they should fix upstream, and eventually will I assume, so probably not something worth spending time on here.

if ("size" == output.var().name()) {
// reserved name
continue;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not skip the getter of an output that have a reserved name but just rename it in the API def, though I'm not sure why size would cause any problem here. Can you give more context to this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are new ops with outputs named "size". Do whatever you want with them, I just did that to get them to compile.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll take care of it

@saudet
Copy link
Contributor Author

saudet commented Feb 22, 2021

Everything builds and all tests pass, so I believe this is ready to merge, unless we should refine all changes to the ops here as well? Also, the GPU builds on Linux are timing out:
https://github.com/saudet/tensorflow-java/actions/runs/585339990
For the past few days, the normal CPU builds have been taking 4 hours instead of 3, and that would explain why the GPU builds don't finish in 6 hours. Either something's changed with the VMs, but more likely this will pass.

@saudet saudet marked this pull request as ready for review February 22, 2021 11:09
@karllessard
Copy link
Collaborator

I'm actually working on classifying the new ops, which should be completed before merging. I'll make a PR to your branch probably today.

I'm almost done, I just hit a backward-compatibility issue with some of them which I'll duplicate for now but that's something I'd like to bring as a topic for our next session to decide how to handle this properly next time.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@google-cla
Copy link

google-cla bot commented Feb 22, 2021

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

@karllessard
Copy link
Collaborator

Just pushed the changes, let's see how it goes with the build: https://github.com/tensorflow/java/actions/runs/589500056

@karllessard
Copy link
Collaborator

@googlebot I consent

@karllessard karllessard merged commit 3f0369a into tensorflow:master Feb 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants