Skip to content

Tracker for TensorFlow-DirectML #55226

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
penpornk opened this issue Mar 14, 2022 · 25 comments
Open

Tracker for TensorFlow-DirectML #55226

penpornk opened this issue Mar 14, 2022 · 25 comments
Assignees
Labels
type:others issues not falling in bug, perfromance, support, build and install or feature

Comments

@penpornk
Copy link
Member

penpornk commented Mar 14, 2022

cc: @PatriceVignola @wchao1115

This issue tracks pending PRs, issues, and possible cherry-picks necessary for TensorFlow-DirectML for each TF release. Please post a comment with new things to track and I will update this post to reflect the changes.

New PRs:

PRs that need more investigation.

PRs that made it into TF nightly (post r2.10 branch cut):

PRs that made it into TF 2.10:

PRs that made it into TF 2.9:

Closed PR:

@penpornk penpornk added the type:others issues not falling in bug, perfromance, support, build and install or feature label Mar 14, 2022
@penpornk penpornk self-assigned this Mar 14, 2022
@sushreebarsa sushreebarsa removed their assignment Mar 15, 2022
@PatriceVignola
Copy link
Contributor

@penpornk How can pluggable devices implement the Assign op which uses the legacy variables? TF_AssignVariable already exists to implement AssignVariableOp which uses resources, but there doesn't seem to be an equivalent for the legacy variables. Should we add a TF_AssignRefVariable function?

@penpornk
Copy link
Member Author

@PatriceVignola

@penpornk How can pluggable devices implement the Assign op which uses the legacy variables? TF_AssignVariable already exists to implement AssignVariableOp which uses resources, but there doesn't seem to be an equivalent for the legacy variables. Should we add a TF_AssignRefVariable function?

Looping in @wangpengmit for more info on variables.

@wangpengmit
Copy link
Member

Resource and ref vars are very different, so my first thought is that TF_AssignRefVariable is probably needed.

@PatriceVignola
Copy link
Contributor

I have a working prototype in a fork that I put in the kernels_experimental header. I can submit a PR if that's ok with everyone.

Also, my understanding is that ref vars are deprecated in TF2 and are replaced by resources at the python API level. Is that correct? But even though they are deprecated, some popular benchmarks like AI-Benchmark use a frozen TF1 model when running on TF2, which yield bad results for pluggable devices since they don't currently support them.

@wangpengmit
Copy link
Member

my understanding is that ref vars are deprecated in TF2 and are replaced by resources at the python API level. Is that correct? But even though they are deprecated, some popular benchmarks like AI-Benchmark use a frozen TF1 model when running on TF2.

Yes, ref vars are deprecated at the Python level. Existing TF1 models may still be using them, so TF2 internals still support them.

@penpornk
Copy link
Member Author

I have a working prototype in a fork that I put in the kernels_experimental header. I can submit a PR if that's ok with everyone.

@PatriceVignola Oh, that's great! Please submit the PR and we can continue the discussion there (e.g., whether this needs to be an RFC, etc). Thank you very much! :)

@PatriceVignola
Copy link
Contributor

PatriceVignola commented Mar 26, 2022

@penpornk I submitted a PR here: #55379

@PatriceVignola
Copy link
Contributor

PatriceVignola commented Mar 26, 2022

New PRs to add to the list:

#55381
#55382
#55384
#55385
#55386
#55387

@PatriceVignola
Copy link
Contributor

3 new PRs:

#55392
#55393
#55395

@PatriceVignola
Copy link
Contributor

@penpornk A simple PR that was opened a month ago. I think it flew under the radar.

#54746

@PatriceVignola
Copy link
Contributor

New PR: #55544

@PatriceVignola
Copy link
Contributor

PatriceVignola commented Apr 8, 2022

New PR: #55557

@PatriceVignola
Copy link
Contributor

New PR: #55558

@PatriceVignola
Copy link
Contributor

New PR: #55579

@PatriceVignola
Copy link
Contributor

@penpornk We noticed that support for TF_VARIANT is not really available for pluggable devices through the API since it's very hard (or even impossible) to read the content of a C++ object through ABIs. Even if we were to use the exact same headers as TensorFlow uses for all C++ objects, we would most likely need to use the exact same compiler.

Since some models depend heavily on variant tensors which contain TensorList objects, we'd like to propose 2 new APIs that address this issue:

TF_CAPI_EXPORT extern void TF_AddNVariant(
    TF_OpKernelContext* ctx,
    void (*binaryAddFunc)(TF_OpKernelContext* ctx, const TF_Tensor* a, const TF_Tensor* b, TF_Tensor* out),
    TF_Status* status);

TF_CAPI_EXPORT extern void TF_ZerosLikeVariant(
    TF_OpKernelContext* ctx,
    void (*zerosLikeFunc)(TF_OpKernelContext* ctx, const TF_Tensor* input, TF_Tensor* out),
    TF_Status* status);

Like the TF_AssignVariable and TF_AssignUpdateVariable functions currently available in kernels_experimental.h, the goal of TF_AddNVariant and TF_ZerosLikeVariant would be to allow plugins to implement those 2 key operations by treating the Variant objects as a black box. The Variant objects would be unwrapped within TensorFlow core, which would then call the binaryAddFunc or zerosLikeFunc functions provided by the user, which are guaranteed to contain tensors of primitives (e.g. TF_FLOAT).

Is this something that the TensorFlow team would like to see an RFC or PR for? For one of the RNN models that we track (Pixel-RNN), we see up to a 5x performance improvement by not having to do those operations on the CPU.

@penpornk
Copy link
Member Author

@penpornk We noticed that support for TF_VARIANT is not really available for pluggable devices through the API since it's very hard (or even impossible) to read the content of a C++ object through ABIs. Even if we were to use the exact same headers as TensorFlow uses for all C++ objects, we would most likely need to use the exact same compiler.

Yes, we currently don't have a generic way to support TF_VARIANT/DT_VARIANT. Even if you use the exact same header, compiler, and compiler flags, it's possible something else could still go wrong. The support so far has been on an op-by-op basis, e.g., Kernel C API extension for Variable ops, TensorList support, and your recent PRs.

Since some models depend heavily on variant tensors which contain TensorList objects, we'd like to propose 2 new APIs that address this issue:

TF_CAPI_EXPORT extern void TF_AddNVariant(
    TF_OpKernelContext* ctx,
    void (*binaryAddFunc)(TF_OpKernelContext* ctx, const TF_Tensor* a, const TF_Tensor* b, TF_Tensor* out),
    TF_Status* status);

TF_CAPI_EXPORT extern void TF_ZerosLikeVariant(
    TF_OpKernelContext* ctx,
    void (*zerosLikeFunc)(TF_OpKernelContext* ctx, const TF_Tensor* input, TF_Tensor* out),
    TF_Status* status);

Like the TF_AssignVariable and TF_AssignUpdateVariable functions currently available in kernels_experimental.h, the goal of TF_AddNVariant and TF_ZerosLikeVariant would be to allow plugins to implement those 2 key operations by treating the Variant objects as a black box. The Variant objects would be unwrapped within TensorFlow core, which would then call the binaryAddFunc or zerosLikeFunc functions provided by the user, which are guaranteed to contain tensors of primitives (e.g. TF_FLOAT).

Is this something that the TensorFlow team would like to see an RFC or PR for? For one of the RNN models that we track (Pixel-RNN), we see up to a 5x performance improvement by not having to do those operations on the CPU.

Thank you for the suggestion! The team thinks this sounds reasonable. If you already have a prototype code for this, would you mind opening a PR? We can take it from there. (If there are some points that need more discussion, we can start an RFC.)

@PatriceVignola
Copy link
Contributor

Sure! I created a PR here: #55645

@PatriceVignola
Copy link
Contributor

2 new PRs:

#55677
#55678

@PatriceVignola
Copy link
Contributor

@penpornk I have a new PR to track: #56707

Do you want to create a new tracker for TF 2.10, or is this one fine?

@penpornk penpornk changed the title Tracker for TensorFlow-DirectML in TF 2.9.0 Tracker for TensorFlow-DirectML in TF 2.10.0 Jul 8, 2022
@penpornk
Copy link
Member Author

penpornk commented Jul 8, 2022

@PatriceVignola Thank you! Added to the list.

Do you want to create a new tracker for TF 2.10, or is this one fine?

Let's just reuse this one. :)

@NeilGirdhar
Copy link
Contributor

Would it be possible to ensure that 2.10 has #54330 in it?

@PatriceVignola
Copy link
Contributor

@penpornk Thank you for monitoring the other PR! Another PR that is important for us and we believe is a pretty significant bug in the pluggable device implementation is this one: #56707

The Pluggable Device RFC says that pluggable devices should be able to name themselves "GPU" and overwrite the built-in CUDA GPU, but in practice what happens is a lot of duplicate registration errors are being thrown because the previous registrations for the CUDA GPU device are not removed. This forces users to use the tensorflow-cpu package instead of the tensorflow package that GPU users are more familiar with.

@penpornk
Copy link
Member Author

penpornk commented Jul 29, 2022

@PatriceVignola Happy to report that #55558 went into TF 2.10.

Another PR that is important for us and we believe is a pretty significant bug in the pluggable device implementation is this one: #56707

Unfortunately, we need more time to carefully consider possible side effects of this one. I have replied on the PR.

I would like to introduce @rishikasinha-tf. In the future, please help cc both her and me on PRs that got stuck. I'll also try to go through PRs that are tracked in the top post soon. Apologies again for the delay! :(

@penpornk
Copy link
Member Author

@NeilGirdhar Yes, #54330 is in v2.10. We have just cut the branch on Wednesday so anything that was merged before that is in the release.

@penpornk penpornk changed the title Tracker for TensorFlow-DirectML in TF 2.10.0 Tracker for TensorFlow-DirectML Aug 24, 2022
@PatriceVignola
Copy link
Contributor

@penpornk We have 2 new small PRs that we'd like to get into 2.11:

#58097
#58207

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:others issues not falling in bug, perfromance, support, build and install or feature
Projects
None yet
Development

No branches or pull requests

5 participants